Similar documents
Topological Properties

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

A scalable multilevel algorithm for graph clustering and community structure detection

Lecture 12: Partitioning and Load Balancing

Fast Multipole Method for particle interactions: an open source parallel library component

Mesh Generation and Load Balancing

Clustering and scheduling maintenance tasks over time

Optimizing Configuration and Application Mapping for MPSoC Architectures

DYNAMIC GRAPH ANALYSIS FOR LOAD BALANCING APPLICATIONS

Clustering & Visualization


DATA ANALYSIS II. Matrix Algorithms

QUALITY OF SERVICE METRICS FOR DATA TRANSMISSION IN MESH TOPOLOGIES

A Note on Maximum Independent Sets in Rectangle Intersection Graphs

Load balancing Static Load Balancing

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Xiaoqiao Meng, Vasileios Pappas, Li Zhang IBM T.J. Watson Research Center Presented by: Payman Khani

Compact Representations and Approximations for Compuation in Games

1 Example of Time Series Analysis by SSA 1

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

An Improved Spectral Load Balancing Method*

Finite cloud method: a true meshless technique based on a xed reproducing kernel approximation

PARALLEL PROGRAMMING

Support Vector Machines with Clustering for Training with Very Large Datasets

Scalability and Classifications

How To Get A Computer Science Degree At Appalachian State

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Map-Reduce for Machine Learning on Multicore

Load Balancing and Termination Detection

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Introduction to Logistic Regression

DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH

Interconnection Network

A Novel Switch Mechanism for Load Balancing in Public Cloud

Big Graph Processing: Some Background

Lecture 2 Parallel Programming Platforms

Dynamic Load Balancing of SAMR Applications on Distributed Systems y

Technology White Paper Capacity Constrained Smart Grid Design

Unsupervised Data Mining (Clustering)

Load Balancing and Termination Detection

Mean Value Coordinates

Dynamic Load Balancing of Parallel Monte Carlo Transport Calculations

A New Nature-inspired Algorithm for Load Balancing

Principles and characteristics of distributed systems and environments

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Single-Link Failure Detection in All-Optical Networks Using Monitoring Cycles and Paths

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

A FAST AND HIGH QUALITY MULTILEVEL SCHEME FOR PARTITIONING IRREGULAR GRAPHS

Interconnection Network Design

A RDT-Based Interconnection Network for Scalable Network-on-Chip Designs

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Static Load Balancing of Parallel PDE Solver for Distributed Computing Environment

Feature Point Selection using Structural Graph Matching for MLS based Image Registration

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Mesh Partitioning and Load Balancing

Evaluating partitioning of big graphs

Sparse Matrix Decomposition with Optimal Load Balancing

Part 2: Community Detection

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Oracle8i Spatial: Experiences with Extensible Databases

On computer algebra-aided stability analysis of dierence schemes generated by means of Gr obner bases

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Load Distribution in Large Scale Network Monitoring Infrastructures

Beyond the Stars: Revisiting Virtual Cluster Embeddings

A New Unstructured Variable-Resolution Finite Element Ice Sheet Stress-Velocity Solver within the MPAS/Trilinos FELIX Dycore of PISCEES

Master of Science in Computer Science

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

Parallel Architectures and Interconnection

Solving a 2D Knapsack Problem Using a Hybrid Data-Parallel/Control Style of Computing

Why the Network Matters

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

MATHEMATICS (MATH) 3. Provides experiences that enable graduates to find employment in sciencerelated

Transcription:

Dynamic Mapping and Load Balancing on Scalable Interconnection Networks Alan Heirich, California Institute of Technology Center for Advanced Computing Research The problems of mapping and load balancing arbitrary programs and data structures on networks of computers are considered. A novel diusion algorithm is presented to solve these problems. It complements the diusion algorithms for load balancing which have enjoyed success on massively parallel processors. Like these algorithms it is innitely scalable and adapts in real time to the requirements of dynamic problems. This is in contrast to existing mapping strategies which are either not dynamic, not scalable, and/or inapplicable to irregular networks. Many recursive bisection algorithms have been proposed for mapping and load balancing on regular and irregular networks. Recursive spectral bisection (RSB) has been very popular because it has a solid theoretical foundation. RSB solves a series of constrained one dimensional quadratic minimization subproblems in order to obtain a locally optimal solution to a multidimensional problem. The constraint prevents the algorithm from converging on a trivial solution. Incorporating the constraint into each subproblem leads to a series of eigenvalue problems which are typically solved by (expensive) Lanczos iterations. The diusion algorithm presented here solves the same quadratic minimization problem as RSB but solves it in multiple dimensions in a single step. As in RSB it is necessary to incorporate constraints in order to avoid the trivial solution. These constraints are incorporated directly from the problem instance and have a natural representation in the problem semantics. The resulting algorithm is scalable, dynamic, and applicable to any interconnection topology which can be contiguously embedded in R n.

DIFFUSING COMPUTATIONS Diusion has been a metaphor for concurrent computation from the earliest days of parallel computing. Early papers introduced the metaphor, considered the problem of termination detection, and discussed mapping and process management. Computation: a form of energy which diuses through a (parallel) computer system. Challenges: manage this (dynamic) energy for peak eciency. { Mapping: to minimize communication. { Load balancing: to minimize idleness. Detect termination. Dijkstra & Scholten, Termination detection for diusing computations (1980). Martin, A distributed implementation method for parallel programming (1980). Chandy & Misra, Termination detection of diusing computations in communicating sequential processes (1982).

DIFFUSION ALGORITHMS With the advent of readily accessible parallel computers several practical diusion algorithms were proposed. The algorithms are based on computing an equilibrium of a dynamical system. \Solve" a Laplace system r 2 x = 0 by iteration. Converge to a xed point (equilibrium) from any initial condition. Cybenko (1989) showed correctness for the load balancing problem. Heirich (1994), Heirich & Taylor (1995) showed innite scalability. Heirich (1996) solves the mapping problem. Cybenko, Dynamic load balancing for distributed memory multiprocessors (1989). Heirich, Scalable load balancing by diusion (1994). Heirich & Taylor, A parabolic load balancing method (1995). Heirich, Dynamic mapping and load balancing on scalable interconnection networks (1996).

BIBLIOGRAPHY: DIFFUSION IN PARALLEL COMPUTING At least a dozen closely related papers have been published since 1988. All of these apply diusion to the load balancing problem. 1. Baden, SIAM J Sci Stat Comp, 12:1 (1991), 145-157. 2. Boillat, Concurrency, 2 (1990), 289-313. 3. Boillat, Bruge & Kropf, J Comp Phys, 96:1 (1991) 1-14. 4. Bruge & Fornili, Comp Phys Comm, 60 (1990), 39-45. 5. Chandy & Misra, ACM TOPLAS, 4:1 (1982). 6. Conley, Argonne Natl Lab Tech Rep ANL-93/40 (1993). 7. Cybenko, J Par Dist Comp, 7 (1989), 279-301. 8. Dijkstra & Scholten, Inf Proc Lett, 11:1 (1980), 1-4. 9. Heirich, Caltech Comp Sci Dept Tech Rep CS-TR-94-04 (1994). 10. Heirich & Taylor, Proc 24th Intl Conf Par Proc, (1995) v. III, 192-202. 11. Hong, Tan & Chen, Proc ACM Sigmetric Conf, (1988), 73-82. 12. Horton, Par Comp, 19 (1993), 209-218. 13. Hosseini, Litow, Malwaki, Nadella & Vairavan, J Par Dist Comp, 10, (1990), 160-166. 14. Martin, Inf Proc, 80 (1980), 309-314. 15. Muniz & Zaluska, Par Comp, 21 (1995), 287-301. 16. Willabeek-LeMair & Reeves, IEEE Tr Par Dist Sys, 4 (1993), 979-993. 17. Xu & Lau, J Par Dist Comp, 16 (1992), 385-393. 18. Xu & Lau, J Par Dist Comp, 24 (1995), 72-85.

PROPERTIES OF GSL ITERATIONS A Gauss-Seidel iteration on a Laplace equation is a general algorithmic paradigm for equilibrium computations. From a practical standpoint it has properties which make it robust and scalable in distributed systems. From a theoretical standpoint it ts a general framework of equilibrium computations in dynamical systems. Gauss-Seidel iteration on Laplace equation r 2 x = 0. Discrete Laplacian matrix Q, split into upper (U), lower (L), diagonal (D). Discrete iteration ~x ;(D + L) ;1 U~x. Properties: concurrent, asynchronous, fault tolerant, scalable, fast. Scalability and convergence: { Model problem (degree 4 lattice) has known eigenvalues. { Time dependent amplitude of point disturbance is linear superposition of modal amplitudes. { Convergence is exponential with respect to initial amplitude (i.e. logarithmic in time). { An arbitrary disturbance can be modeled as a composition of point disturbances.

SCALING FOR POINT DISTURBANCES The scalability of GSL iterations comes from the near scale invariance of the Laplacian spectrum. The gures below show the eigenvalues of two GSL iterations on a model problem. Both model problems are a degree-4 lattice. On the left are three spectral components from a 64 64 lattice, and on the right the corresponding components from a 1024 1024 lattice. 1 0.8 "ev0_64" "ev16_64" "ev31_64" 1 0.8 "ev0_1024" "ev256_1024" "ev511_1024" 0.6 0.6 0.4 0.4 0.2 0.2 0 0-0.2-0.2-0.4-0.4-0.6-0.6-0.8-0.8-1 0 10 20 30 40 50 60 70-1 0 200 400 600 800 1000 1200 The eigenvalues of a GSL iteration on an n n model problem are i j = 1 4 0 B@cos i n + cos j n 1 CA 2

The time dependent amplitude of a point disturbance after iterates is x () 0 0 = X i j = X i j c ( i j ) 1 4 n 0 B@cos i n + cos j n Convergence to equilibrium occurs exponentially, i.e. time is logarithmic with respect to the height of the disturbance. Below left, height of the point disturbance during 32 successive GSL iterates. Below right, number of iterates required to decrease the disturbance by 90% for increasing problem sizes. This number is constant above a certain size. 1 CA 2 32 5 28 24 4 20 3 16 12 2 8 1 4 0 0 8 16 24 32 0 32 64 96 128 160 192 224 256 Young, Iterative solution of large linear systems (1971).

GRAPH LAYOUT BY QUADRATIC MINIMIZATION Many NP-hard and NP-complete problems can be characterized as graph layout problems: given a (capacitated) graph, nd an arrangement of vertices which minimizes the sum of a metric over pairs of connected points. Minimize (nontrivially) aggregate distance z among a (weighted) set of connected points. One dimension: z P i j (x i ; x j ) 2 c i j. { Matrix bandwidth reduction. { Mapping on a ring. Two dimensions: z P i j (xi ; x j ) 2 + (y i ; y j ) 2 c i j. { Mapping on a mesh, heirarchical, or irregular network. { Detecting implicit control structures in compiled programs. { VLSI placement. { Graph isomorphism, many others. Kung & Stevenson, A software technique for reducing the routing time on a parallel computer with a xed interconnection network (1977). Bokhari, On the mapping problem (1981). Read & Corneil, The graph isomorphism disease (1977).

MAPPING PROGRAMS BY GRAPH EMBEDDING The mapping problem can be described as a graph layout problem in which a graph of communicating processes is embedded into a graph of connected computers. \Guest" graph G: vertices are processes, edges are communication channels. \Host" graph H: vertices are computers, edges are network links. Graph embedding problem (Rosenberg): map vertices of G onto vertices of H to minimize dilation (average distance) and equalize density. Mapping problem (Martin): map processes onto computers to minimize aggregate communication and equalize workload. Rosenberg, Issues in the study of graph embeddings (1981). Martin, A distributed implementation method for parallel programming (1980).

COORDINATE BISECTION ALGORITHMS In specialized problems, such as solving partial dierential equations, a mapping is implicit in the problem instance. The coordinates of a PDE grid provide a solution which can be found by repeated coordinate bisection. Repeatedly bisect along alternating axes using intrinsic problem coordinates. Example: partitioning an unstructured PDE grid. Advantages: scalable O(n log n=p), probably the best method for PDE problems. Problems: restricted to special cases in which vertices of G possess intrinsic coordinates. Inapplicable to irregular networks or dynamic problems. No provision for capacitated problems (weighted edges or vertices). Williams, Performance of dynamic load balancing algorithms for unstructured mesh calculations (1991).

GRAPH CONTRACTION ALGORITHMS Any problem instance can be solved by repeatedly coalescing vertices of a graph until it equals a standard topology. Ecient, scalable. Advantages: nds optimal solutions for standard host topologies. Problems: suboptimal solutions for irregular networks, inapplicable to dynamic problems, no provision for capacitated problems. Ben-Natan & Barak, Parallel contraction of grids for task assignment to processor networks (1992). Karabeg, Process partitioning through graph compaction (1995).

THE LAPLACIAN SPECTRUM Call C the matrix of edge weights of G. Then optimal bisections of G can be found by solving a one dimensional eigenvalue problem for C. Minimize z, z = 0:5 X i = 0:5 X i = 0:5( X i = X i X (x i ; x j ) 2 c i j (x 2 ; 2x i i x j + x 2 )c j i j j x 2 c i i i ; X X X 2 x i x j c i j + i j j j X x 2 i c i i ; X j = ~x T (D ; C)~x = ~x T Q~x X i6=j x i x j c i j x 2 j c j j) To avoid triviality (~x = 0) append a constraint ~x T ~x = 1. Minimize the resulting Lagrangian L ~x T Q~x ; (~x T ~x ; 1) @L=@~x = 0 ) 2Q~x ; 2~x = 0 ) Q~x = ~x In two dimensions minimize ~x T Q~x + ~y T Q~y, and similarly in higher dimensions. Hall, An r-dimensional quadratic placement algorithm (1970).

RECURSIVE SPECTRAL BISECTION Over 100 papers have been written in recent years about the RSB algorithm most of which address its high cost. More recently it has been challenged on the grounds that equivalent quality solutions can be obtained by heuristics (Karypis). Recursively bisect using eigenvalues for optimal splitting at each step. Implementations by Lanczos algorithm can be expensive, non robust. Ultimate result can be suboptimal. Advantages: solid theoretical foundation allows edge weights in G. Problems: no edge weights in H, no vertex weights in G or H. suboptimal solutions, expensive algorithm, unscalable, inapplicable to irregular networks or dynamic problems. Barnard & Simon, A parallel implementation of multilevel recursive spectral bisection for application to adaptive unstructured meshes (1995). Karypis & Kumar, Multilevel graph partitioning schemes (1995).

A VISUAL METAPHOR In 1963 Tutte proposed an algorithm to draw a planar graph in the plane. The proposal was to solve a Laplace equation on the graph vertices while constraining the values of some vertices in order to avoid the trivial solution. This proposal can be extended to general graphs in R n if the algorithm is based on a GSL iteration. Equalizing edge lengths of a planar graph gives a (nonunique) optimal placement in the plane (Tutte). Metaphor extends to nonplanar graphs under a leastsquares minimization. Graph embedding can be accomplished by identifying local regions of the plane (or R n ) with vertices of H. Natural interpretation of capacities: { Process weights: areas of vertices in G. { Communication capacity: weight of edges in G. { Computer work capacities: areas of regions associated with vertices in H. Tutte, How to draw a graph (1963). Heirich, Dynamic mapping and load balancing on scalable interconnection networks (1996).

DIFFUSIVE MAPPING AND LOAD BALANCING A robust, scalable and dynamic scheme for mapping and load balancing can be constructed from GSL iterations. In the rst step solve the mapping problem by solving a Laplace equation on a graph of communicating processes or a connected data structure. In the second step solve the load balancing problem by solving a Laplace equation on the workloads of the computers. 1. Dene the mapping space R n. R n represents the network of connected computers. The dimension n is chosen so that R n can be partitioned according to the connectivity of the network. For example, R 2 provides a natural partitioning for a two dimensional mesh. 2. Partition R n according to the network connectivity. In R 2 this can be done by embedding the network in a square region of the plane, bisecting each network link, and using the bisectors to dene the regions associated with each computer. The area of each region represents the workload capacity of each computer. Equivalent procedures can be used for R n. 3. Obtain the guest graph. In many problems this is the form of the problem instance. For example, in a connected data structure, or a graph of communicating processes. In other problems it may be necessary to construct this graph from a data set, for example in ray tracing disconnected polygons. 4. Place a (small) set of \distinguished" vertices of the guest graph. This requires an oracle, or knowledge specic to the problem instance. For example, a small number of processes are typically associated with i/o. These nodes will constrain the algorithm and prevent it from nding the trivial solution. The placement of these nodes is critical.

5. Perform a GSL iteration on the vertices of the guest graph. The vertices which were placed in the previous step do not move. The work requirementofguest vertices can be represented by the areas they occupy in R n. This requires a trivial modication to the basic GSL iteration. The communication requirements of guest channels can be represented by nonuniform weights in Q. The GSL iteration will take these nonuniform weights into account in the desired way, by grouping more closely those vertices which are connected by higher weighted edges. 6. Perform a GSL iteration on the workloads of the computers. The rst GSL iteration will approach a xed point and then converge very slowly to the ultimate conguration. At this point the vertices of G will be properly ordered and further iterates are only rening the solution. Shortcut the nal iterates by directly uxing vertices of G across boundaries between the computers. A vertex of G is a candidate for uxing only if it is connected by an edge to a vertex on a dierent computer. The resulting algorithm is scalable, fault tolerant, delay insensitive, fast, and dynamic. It nds a locally minimal solution to the quadratic minimization problem just as RSB does. It incorporates naturally the capacities of computers, processes, and communication requirements. It does not address limited communication capacities of network links, nor network topologies which cannot be embedded in R n for some n. The constraints result from proper initial placement of guest vertices, and these placement are critical to the quality of the resulting solution. Application of these techniques are currently under way to the problem of Monte Carlo ray tracing on parallel computers.

APPLICATION TO PHOTOREALISTIC ANIMATION (Figure and model courtesy of Greg Ward, Lawrence Berkeley Laboratories) This model of an imaginary oce is comprised of 18,579 polygons. The polygons are assembled into a graph and dynamically mapped across processors of the 512 node IBM SP2 at the Cornell Theory Center. A Euclidian metric in R 3 is used to cluster the polygons in a 2 dimensional mapping space. Dynamic requirements in animation result from the movement of objects, including light sources.