An Improved Spectral Load Balancing Method*



Similar documents
DATA ANALYSIS II. Matrix Algorithms

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Nonlinear Iterative Partial Least Squares Method

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Section 1.1. Introduction to R n

11.1. Objectives. Component Form of a Vector. Component Form of a Vector. Component Form of a Vector. Vectors and the Geometry of Space

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Applied Algorithm Design Lecture 5

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Thnkwell s Homeschool Precalculus Course Lesson Plan: 36 weeks

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Topological Properties

A FAST AND HIGH QUALITY MULTILEVEL SCHEME FOR PARTITIONING IRREGULAR GRAPHS

Approximation Algorithms

x1 x 2 x 3 y 1 y 2 y 3 x 1 y 2 x 2 y 1 0.

Lecture 12: Partitioning and Load Balancing

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

ISOMETRIES OF R n KEITH CONRAD

Solving Geometric Problems with the Rotating Calipers *

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

Offline sorting buffers on Line

A scalable multilevel algorithm for graph clustering and community structure detection

Mechanics lecture 7 Moment of a force, torque, equilibrium of a body

Rotation Rate of a Trajectory of an Algebraic Vector Field Around an Algebraic Curve

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004

A Comparison of General Approaches to Multiprocessor Scheduling

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Linear algebra and the geometry of quadratic equations. Similarity transformations and orthogonal matrices

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Social Media Mining. Graph Essentials

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

3. Interpolation. Closing the Gaps of Discretization... Beyond Polynomials

Computer Graphics. Geometric Modeling. Page 1. Copyright Gotsman, Elber, Barequet, Karni, Sheffer Computer Science - Technion. An Example.

THREE DIMENSIONAL GEOMETRY

3. INNER PRODUCT SPACES

Geometry of Vectors. 1 Cartesian Coordinates. Carlo Tomasi

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

Inner Product Spaces

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

The Open University s repository of research publications and other research outputs

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

α = u v. In other words, Orthogonal Projection

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

Estimated Pre Calculus Pacing Timeline

Load Balancing between Computing Clusters

South Carolina College- and Career-Ready (SCCCR) Pre-Calculus

Big Ideas in Mathematics

1 Sets and Set Notation.

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the school year.

Math 241, Exam 1 Information.

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Figure 2.1: Center of mass of four points.

KEANSBURG SCHOOL DISTRICT KEANSBURG HIGH SCHOOL Mathematics Department. HSPA 10 Curriculum. September 2007

521493S Computer Graphics. Exercise 2 & course schedule change

Fault-Tolerant Routing Algorithm for BSN-Hypercube Using Unsafety Vectors

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Competitive Analysis of On line Randomized Call Control in Cellular Networks

136 CHAPTER 4. INDUCTION, GRAPHS AND TREES

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION

Prerequisite: High School Chemistry.

State of Stress at Point

Analysis of Internet Topologies: A Historical View

9 Multiplication of Vectors: The Scalar or Dot Product

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

DRAFT. Further mathematics. GCE AS and A level subject content

LINEAR ALGEBRA W W L CHEN

Mathematics (MAT) MAT 061 Basic Euclidean Geometry 3 Hours. MAT 051 Pre-Algebra 4 Hours

Prentice Hall Algebra Correlated to: Colorado P-12 Academic Standards for High School Mathematics, Adopted 12/2009

Support Materials for Core Content for Assessment. Mathematics

IN THIS PAPER, we study the delay and capacity trade-offs

Definition Given a graph G on n vertices, we define the following quantities:

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

ON THE DEGREES OF FREEDOM OF SIGNALS ON GRAPHS. Mikhail Tsitsvero and Sergio Barbarossa

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

COUNTING INDEPENDENT SETS IN SOME CLASSES OF (ALMOST) REGULAR GRAPHS

Common Core Unit Summary Grades 6 to 8

A Note on Maximum Independent Sets in Rectangle Intersection Graphs

Social Media Mining. Network Measures

1 VECTOR SPACES AND SUBSPACES

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

New Hash Function Construction for Textual and Geometric Data Retrieval

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Transcription:

SAND93-016C An Improved Spectral Load Balancing Method* Bruce Hendrickson Robert Leland Abstract We describe an algorithm for the static load balancing of scientific computations that generalizes and improves upon spectral bisection. Through a novel use of multiple eigenvectors, our new spectral algorithm can divide a computation into or 8 pieces at once. This leads to balanced partitions that have lower communication overhead and are less expensive to compute than those of spectral bisection. In addition, our approach automatically works to minimize message contention on a hypercube or mesh architecture. 1 Introduction Efficient use of a distributed memory parallel computer requires that the computational load be balanced across processors in a way that minimizes interprocessor communication. This mapping requirement can be abstracted to a graph problem in which nodes represent computation, edges represent communication, and the objective is to assign an equal number of vertices to each processor in a way that minimizes the number of edges crossing between processors. Practical experience has shown that the quality of this mapping can have a substantial impact on performance, hence there is considerable interest in effective mapping algorithms. A new heuristic for graph partitioning in the context of mapping parallel computations was recently proposed by Simon [1]. This spectral method recursively bisects a graph by considering an eigenvector of an associated matrix to gain an understanding of global properties of the graph. The method has received much attention because it offers a good balance between generality, quality and efficiency. The idea of using eigenvectors to partition graphs originated with work in the early 70's by Donath and Hoffman [3, 1, and Fiedler [5, 61. In recent years these ideas have been revived and further developed by several authors [Z, 11, 12, 131. With one exception, all this previous work has been based upon the application of recursive bisection, a strategy which is limiting in several important ways. Rend1 and Wolkowicz do describe an algorithm for partitioning a graph into an arbitrary number of sets without recursion [13]. However, their algorithm requires that IC eigenvectors of a matrix representing the graph be determined in order to partition the graph into k sets. This renders the algorithm impractical for dividing large graphs into hundreds or thousands of sets, precisely what is required in mapping grand challenge problems onto massively parallel machines. The distinguishing feature of the work described in this paper is that it offers a practical generalization of spectral bisection to allow the division into more than two sets at once, which is especially appropriate in the context of modern parallel supercomputing. 'This work was supported by the Applied Mathematical Sciences program, US. Department of Energy, Office of Energy Research, and was performed at Sandia National Laboratories, operated for the US. Department of Energy under contract No. DE-AC0-76DP00789. 'Sandia National Laboratories, Albuquerque, NM 87185 1

2 HENDRICKSON AND LELAND Specificdly, we describe two new algorithms, spectral quadrisection and spectral octasection, which partition into four sets using two eigenvectors and eight sets using three eigenvectors respectively. By weighing the effect of several cuts simultaneously, these new algorithms produce better partitions than spectral bisection, as measured in the hypercube hop (or Manhattan) metric. Empirical study has shown that this is the appropriate measure for modeling the performance of hypercube architecture machines since minimizing this metric corresponds to minimizing congestion within the communication network [7]. The hop metric is similarly appropriate for two and three dimensional mesh architectures (with or without wrap-around torus connections). Most distributed memory parallel machines have either mesh or hypercube architectures, so these new algorithms have wide application. In addition to producing better partitions, these higher dimensional spectral algorithms have two other significant advantages. First, the logarithmic relationship between the number of required eigenvectors and the number of partitions means that in practice the new methods actually require less net computation than spectral bisection to define the same number of partitions. Second, by exploiting available redundancy in the solution space, higher dimensional spectral methods are able to correctly partition highly symmetric graphs which give spectral bisection trouble. Load balancing algorithms based on spectral quadrisection and octasection are thus generally more powerful, efficient and robust than those based on spectral bisection. The results presented here are largely taken from a recently completed technical report [9]. The purpose of this paper is to present an abbreviated version of those results to a wider audience in a less formal and hopefully more intuitive manner. 2 Spectral bisection Various formulations of spectrd bisection can be found in papers by several authors [2, 11, 11. To allow for our generalization to higher dimensions we set up the problem as follows. Define the graph G by vertex set V and edge set E. Index the n vertices with i or j, so Vi refers to the vertex with index i, Ei,j denotes an extant edge between V; and Vj, and Cv, and CE,, specify sums over the vertices and existing edges, respectively. Now assign a variable xi to each such that xi = fl and xi = 0. The first condition stipulates a partition into two distinct sets. The second requires that the sets be of equal size, assuming an even number of vertices. (We call a vector 2 whose elements satisfy these conditions an indicator vector since it indicates the set assignment of each graph vertex.) The next step is to notice that the function f(z) = $ EEiJ(zi - xji2 counts the number of cross-edges between sets since (q - zj)2 contributes nothing to the sum if xi and xj have the same sign, and contributes if they have opposite sign. We now have an objective function to minimize. The solution will become apparent once we have converted f(z) to matrix form, so that is the next step. We start by defining the adjacency matrix Aij = 1 if (v,vj) E E 0 otherwise. We also define the degree matrix D = diag(di), where da is the graph degree of vi, i.e. the number of edges incident upon vertex 6. The conversion then proceeds as follows. Write

A N IMPROVED SPECTRAL LOADBALANCING METHOD 3 and recast the terms (3) () t x;) = El, 2 = 2 ( E (= Xfdi = XTDX. V; E13 Now define the Laplacian matrix of the graph G (5) note that L(G) = D - A, and conclude f ( x ) = axtlx. Coupling this with the constraints on z, we define the discrete bisection problem Minimize l -xtlx Subject to : x T 1 = 0, (6) x; = f l, where 1 is the n-vector (1,1,1...)T. Since graph partitioning is an NP-hard problem) we expect that there is no practical way to solve this problem. Undeterred, we relax the discreteness constraint on x and define the continuous bisection problem 1 -xtlx Subject to : xt1 = 0, Minimize (7) x T x = n, in which the elements of x may take on any value. This continuous problem is only an approximation to the discrete problem, and the values defining its solution must be mapped back to il by some appropriate scheme to define a partition. Let us emphasize that the relaxation of the discreteness constraint is the crucial approximation in application of spectral methods to graph partitioning. Ideally, the solution to the continuous problem will have entries clustered near f l so that it is a good approximation to the discrete problem. We now begin the solution of (7) by noting that the Laplacian matrix has several important properties. If u1, u2...u, are the normalized eigenvectors of L with corresponding eigenvalues A 1 5 A2 5.. 5 A,, the following theorem is proved in [9]. THEOREM 2.1. The matrix L has the following properties. ( I ) L is symmetric positive semi-definite. (II) The u; are pairwise orthogonal, (HI) u1 = 1, A 1 = 0. (IV) If G is connected, then A 1 is the only zero eigenvalue of L. Next express z in terms of the eigenvectors of L: x = Ci(riu;where the ai are real constants such that Cr=la: = n. Property I1 ensures that this is always possible since a set of n pairwise orthogonal vectors must span R".By substitution for z we find f ( x ) = + ( a ; A 2 t a 9 3 t..cxix,) since X I = 0. Clearly. (8) ( a ; t a; +... a i ) A 2 5 a;& t x 3 t...a;x,, given the ordering of the eigenvalues) so f ( x ) 2 nx2/. Notice that we can achieve f ( x ) = nx2/ by choosing x = f i u 2. Notice also that this choice of x satisfies the balance

HENDRICKSON A N D LELAND constraint since zt1 = f i u r u l = 0, by properties I1 and 111. Therefore, since x = f i u g satisfies the constraints and minimizes f(z), it is a solution to the continuous problem. If A 2 # A3, this solution is unique (z = 61 defines the same partition). There remains the task of mapping the solution of the continuous problem to a discrete partition. In the case of bisection there is a simple and natural way to do this. Find the median of the xi values and then map vertices with corresponding z; above the median to one set, and those below to the other. If several vertices share the median value they are assigned in a way that retains balance. This solution is the nearest discrete point to the continuous optimum. An immediate corollary of the reasoning used to solve the continuous minimization problem is that na2/ is a lower bound on the number of cuts produced by any balanced partitioning of the graph. That is because the solution space of the continuous problem subsumes the solution space of the discrete problem. This result can be slightly improved using insight from the higher dimensional partitioning schemes [9]. Unfortunately, while these lower bounds are simple and computable, they tend to be rather loose in practice. A theoretical result of more practical interest involves a theorem due to Fiedler [5, 61. This says that following a bisection, if the median value is larger than zero, then the subgraph consisting of all lower valued vertices is connected. Similarly, if the median value is less than zero, then the subgraph consisting of all higher valued vertices is connected. The other subgraph may be disconnected. A disconnected subgraph is problematic because we intend to apply the method recursively, and due to the breakdown of Theorem 2.1(1V), the method can fail badly if applied to a disconnected graph. We therefore monitor connectivity of the subgraphs at each stage of recursion. If a disjoint subgraph is detected, we add a minimal number of phantom edges to establish connectivity, partition this subgraph and then remove the phantom edges. We find that in practice several subgraphs do become disconnected in the course of the partitioning of a typical large graph, and our phantomedge strategy does noticeably improve overall results in those cases. 3 Spectral quadrisection In order to partition a graph into four sets we need a second indicator vector, y, so that we can associate two bistate coordinates with each vertex. To make set assignment explicit, we define a mapping from the indicator vector to binary digits ii = $(xi - l), & = $(yi - 1) and then assign o; to set 0, 1,2 or 3 by interpreting ii& as a binary number. The question is then how we should choose x and y in order to achieve a good mapping. Consider the objective function f ( z, y ) = $ ( s T L e ytly). This counts hypercube hops, as illustrated in Fig. 1. To understand this, notice that E ~ Jwhich, crosses the z = 0 plane, contributes 1 to the value of f ( x, y ) through the ztlx term. Similarly, E2,3 crosses the y = 0 plane and contributes 1 to f ( x, y ) through the ytly term. An edge like E1,3, however, crosses both planes and hence contributes 2 to the value of f(x, y). We also need to revise the minimization constraints. In the continuous bisection problem we had the single constraint x T 1 = 0 to ensure load balance in the x indicator vector. In the quadrisection case we clearly need to add yt1 = 0 to ensure balance in the y indicator vector. A less obvious but necessary constraint proved in [9] is that Cv,ziy; = zty = 0. This constraint serves to prevent a node distribution like that shown in Fig. 2, which is balanced with respect to both the x = 0 and y = 0 planes independently, but does not represent four balanced sets; a correct partitioning would assign two nodes to each quadrant. +

AN IMPROVED SPECTRALOAD BALANCING METHOD 5 $$= 00 t 2g = 10 FIG. 1. f(x,y) = a(xtax + ytay) counts hops. Note that by way of physical analogy with a point-mass distribution, the zt1 = 0 and yt1 = 0 constraints specify a coordinate system in which the origin is placed at the center of mass. The zty = 0 constraint, combined with norm constraints xtx = yty = n, stipulates a distribution in which the inertial tensor is a constant multiple of the identity matrix, i.e. there is no preferred axis of rotation. Intuitively, these conditions specify a roughly spherical mass distribution which is divided naturally into four balanced sets by the orthogonal axes. We might infer that for higher dimensional partitionings we must specify that higher moments are zero in order to preserve balance. This is in fact the case, as we prove algebraically in [9]. FIG. 2. An unbalanced but allowable partition unless xty = 0. With these amendments, we can now follow the same progression from discrete to continuous model used in the bisection case and write continuous quadrisection (9) 1 Minimize -(xtlz f ytly) T T Subject to : xt1 = y 1 = xtg = 0, z z = yty = n. The solution of (9) is actually very simple. We first note that if z and y are chosen to be any pair of distinct eigenvectors of L not including u1 and scaled by fi, the constraint equations are all satisfied. That is because, by Theorem 2.1, all such eigenvectors are orthogonal and are also orthogonal to u1 = 1. By extension of the algebraic argument used in (8) for the bisection case, we can prove that the minimum possible value of f(x, y) = n(xzfx3)/ is obtained by setting x = fiuz and y = fiu3, and that n(xzfx3)/

6 HENDRICKSON A N D LELAND is a lower bound on the number of hops induced by any balanced partition into four sets. This is a special case of the general solution theorem proved in [9]. Notice that if we set x = f i u z cos 8$ f i u g sin 8 and y = - f i u z sin 8 -l- f i u 3 cos 8 then z and y still satisfy the constraint equations and also produce the same minimum value of f(z,y). Hence there is actually a family of solutions to (9) corresponding to the various choices of the free parameter 9. Different values of 9 correspond to different solutions to the continuous problem that have the same value, but correspond to different partitionings. An obvious question is therefore how we might exploit this rotational degree of freedom to advantage. We do so by choosing 8 in a way that partially recovers the accuracy lost when we relaxed the discrete optimization model to the continuous one. That is, we attempt to minimize the discrepancy between the two models by making the continuous optimum as nearly discrete as possible. Specifically, we minimize g(8) = Cvi(1 - zf)2 (1 over 8 E (0,2n). After substituting the trigonometric expansions for x and y, this reduces to minimizing a constant coefficient quartic equation in sines and cosines of 8. The construction of the coefficients in this equation requires O ( n ) work, but the cost of the resulting minimization problem is independent of n and is not significant when partitioning large graphs. This is actually a global optimization, but in our experience the number of values of 0 corresponding to local minima is small, so a good solution can be found in most cases by a short sequence of local minimizations from random starting points. As in spectral bisection, there remains the problem of mapping the continuous, rotated solution back to discrete space. Unfortunately a simple generalization of the median technique used there is no longer adequate. Our guiding principle is once again to find the discrete solution nearest the continuous optimum. First we define a distance function from a, The problem is now to map the continuous values point (xi,yi) to a discrete point ( & l&l). to f l in a, way that produces a balanced partitioning in which the sum of the distances from the continuous values to their discrete assignments is minimized. This is an instance of a minimum cost assignment problem, for which efficient algorithms are known. In our code, we implemented an algorithm from [l51 that runs in O(nlog(n)) time. + Spectral octasection The development of the octasection algorithm follows closely that of quadrisection. We define a third indicator vector z mapped to a bit by i i = $ ( z i - 1) and assign each wi to one of eight sets by interpreting P & i i as a binary number. We then minimize the function f ( x, y, z ) = $ ( x T L z t ytly z T L z ) which counts hypercube hops on a three dimensional cube. To retain balance we must do this subject to the additional constraint that a particular third moment of the distribution is zero, i.e. Cvixiyizi = 0. (The physical analogy with mass distribution does not extend easily, although it is still true that the inertial tensor is a multiple of the identity.) Relaxing the discreteness constraint as before, we arrive at the continuous octasection problem -+ 1 - ( z T L x -t ytly i- z T L z ) Subject to : zt1 = y T 1 = z T 1 = zty = x T z = yt z = c s i y i z i = 0, Minimize xt x Vi = yty = z T z = n. Ignoring the third moment constraint, one solution to the problem is z = &uz, y = f i u 3 and z = f i u q with corresponding minimum (and hence lower bound on the number of

, * AN IMPROVED SPECTRALOAD BALANCING METHOD 7 hops induced by any balanced partition into eight sets) of n(x2 + X3 + X)/. Again there is redundancy in the solution space since any rotation of these eigenvectors generates another solution of equal value. Since we are working in 3D space, there are three rotational degrees of freedom. We use this freedom to satisfy the triple product constraint while, as in quadrisection, also trying to make the continuous optimum as nearly discrete as possible. This yields a constrained optimization problem in three variables involving a constant coefficient quartic polynomial in sines and cosines of the angular parameters. The coefficients are computed in O(n) time, after which the cost of the minimization is independent of n. The continuous, rotated solution is then mapped back to a discrete solution using the same algorithm for the minimum cost assignment problem employed in spectral quadrisection. 5 Higher dimensional spectral partitioning If a d dimensional partitioning scheme divides a mesh into 2d parts on each recursive pass, we have presented spectral schemes for 1 i d 5 3. Although some of the implementation details become more difficult, these ideas extend naturally to the d = case. When d > the moment constraints outnumber the rotational degrees of freedom, so it will not generally be possible to construct a balanced partition from the d t 1 lowest eigenvectors of L [9]. 6 Results We have compared empirically our recursive spectral quadrisection (RSQ) and recursive spectral octasection (RSO) algorithms to recursive spectral bisection (RSB) and another bisection algorithm in common use, the inertial method proposed by Nour-Omid [1]. We report here results for partitioning three representative application meshes. A more comprehensive reporting of our experimental results is contained in [8, 91. The first sample is a 2D finite element meshing of a multi-element airfoil provided by Barth and Jesperson of NASA Ames which has been commonly used in testing partitioners [l]; we have actually constructed and partitioned the dual ([VI = 803, IEl = 11813 ) of this mesh since that is more appropriate for some finite element calculations. The second is another 2D CFD mesh generated by Hammond at RIACS and available from RIACS through anonymous ftp (IVl = 720, IEl = 13722). The third is a three-dimensional finite difference mesh of a complex manufacturing component generated using advanced meshing software at Sandia National Laboratories (IVl = 6661, /El = 55600). Table 1 shows the results obtained when the partitioning methods were used to divide each sample mesh into eight pieces. The methods are listed in rank order by hop count, which, as we mentioned earlier, has been shown to closely correlate with the overhead due to communication for these sorts of applications [7]. We have also shown another common measure of partition quality, the number of cuts, i.e. the total number of edges crossing between processor sets. The suffix KL in RSOKL refers to the classic graph bisection heuristic devised by Kernighan and Lin [lo]. KL must be supplied with an initial partition which is then improved by a greedy local strategy. RSOKL is a composite algorithm in which the output of RSO at each stage of recursion is fed into a generalized KL algorithm capable of minimizing hops over an 8 way initial partitioning. The motivation for this strategy was to combine the global strength of RSO with the local finesse of KL. The resulting partition is clearly the best with respect to both hops and cuts. Notice also that the cut and hop totals are nearly equal indicating that most communication is nearest neighbor. Partitioning results are discussed in more detail in [8, 91.

8 HENDRICKSON AND LELAND - Barth Hammond Sandia Method cuts I hops cuts I hops cuts I hops F 7 Conclusions We have described a static load balancing algorithm that is well suited to unstructured problems in scientific computation. Our algorithm generalizes spectral bisection to allow for the division into four or eight sets at once. This approach is faster than spectral bisection, and explicitly accounts for message congestion in mesh, torus or hypercube architectures. Several authors have found spectral bisection to produce better mappings of unstructured meshes than other techniques [1, 161, and in sample problems our new algorithm significantly outperforms spectral bisection. References T. Barth. Personal Communication, December 1991. R. Bopana, Eigenvalues and graph bisection: An average case analysis, in Proc. 28th Annual Symposium on Foundations of Computer Science, IEEE, 1987, pp. 280-285. W. Donath and A, Hoffman, Algorithms for partitioning of graphs and computer logic based on eigenvectors of connection matrices, IBM Technical Disclosure Bulletin, 15 (1972), pp. 938-9. -, Lower bounds for the partitioning of graphs, IBM J. Res. Develop., 17 (1973), pp. 20-25. M. Fiedler, Algebraic connectivity of graphs, Czechoslovak Math. J., 23 (1973), pp. 298-305. -, A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory, Czechoslovak Math. J., 25 (1975), pp. 619-633. S. Hammond, Mapping unstructured grid computations to massively parallel computers, PhD thesis, Rensselaer Polytechnic Institute, Dept, of Computer Science, Rensselaer, NY, 1992. B. Hendrickson and R. Leland, Domain mapping of parallel scientific computations, Tech. Rep. SAND 92-161, Sandia National Laboratories, Albuquerque, NM, 1992. -, An improved spectral graph partitioning algorithm for mapping parallel computations, Tech. Rep. SAND 92-160, Sandia National Laboratories, Albuquerque, NM, 1992. B. Kernighan and S. Lin, An eficient heuristic procedure for partitioning graphs, Bell System Technical Journal, 29 (1970), pp. 291-307. A. Pothen, H. Simon, and K. Liou, Partitioning sparse matrices with eigenvectors of graphs, SIAM J. Matrix Anal., 11 (1990), pp. 30-52. D. Powers, Graph partitioning by eigenvectors, Lin. Alg. Appl., 101 (1988), pp. 121-133. F. Rend1 and H. Wolkowicz, A projection technique for partitioning the nodes of a graph, Tech. Rep. CORR 90-20, University of Waterloo, Faculty of Mathematics, Waterloo, Ontario, November 1990. H. Simon, Partitioning of unstructured problems for parallel processing, in Proc. Conference on Parallel Methods on Large Scale Structural Analysis and Physics Applications, Pergammon Press, 1991. T. Tokuyama and J. Nakano, Geometric algorithms for a minimum cost assignment problem, in Proc. 7th Annual Symposium on Computational Geometry, ACM, 1991, pp. 262-271. R. Williams, Performance of dynamic load balancing algorithms for unstructured mesh calculations, Concurrency, 3 (1991), pp. 57-81.