Parallel Adaptive Multigrid Methods in Plane Linear Elasticity Problems

Similar documents

Optical Illusion. Sara Bolouki, Roger Grosse, Honglak Lee, Andrew Ng

Two Dimensional FEM Simulation of Ultrasonic Wave Propagation in Isotropic Solid Media using COMSOL

DISTRIBUTED DATA PARALLEL TECHNIQUES FOR CONTENT-MATCHING INTRUSION DETECTION SYSTEMS

DISTRIBUTED DATA PARALLEL TECHNIQUES FOR CONTENT-MATCHING INTRUSION DETECTION SYSTEMS. G. Chapman J. Cleese E. Idle

Assessing the Discriminatory Power of Credit Scores

Unit 11 Using Linear Regression to Describe Relationships

DUE to the small size and low cost of a sensor node, a

Scheduling of Jobs and Maintenance Activities on Parallel Machines

Queueing systems with scheduled arrivals, i.e., appointment systems, are typical for frontal service systems,

Report b Measurement report. Sylomer - field test

Control of Wireless Networks with Flow Level Dynamics under Constant Time Scheduling

Mixed Method of Model Reduction for Uncertain Systems

FEDERATION OF ARAB SCIENTIFIC RESEARCH COUNCILS

A note on profit maximization and monotonicity for inbound call centers

Redesigning Ratings: Assessing the Discriminatory Power of Credit Scores under Censoring

A technical guide to 2014 key stage 2 to key stage 4 value added measures

Analysis of Mesostructure Unit Cells Comprised of Octet-truss Structures

MSc Financial Economics: International Finance. Bubbles in the Foreign Exchange Market. Anne Sibert. Revised Spring Contents

Acceleration-Displacement Crash Pulse Optimisation A New Methodology to Optimise Vehicle Response for Multiple Impact Speeds

Universitat Autònoma de Barcelona

(a) Hidden Terminal Problem. (b) Direct Interference. (c) Self Interference

Project Management Basics

(a) Original Images. (b) Stitched Image

TRADING rules are widely used in financial market as

A Resolution Approach to a Hierarchical Multiobjective Routing Model for MPLS Networks

FRAME. ... Data Slot S. Data Slot 1 Data Slot 2 C T S R T S. No. of Simultaneous Users. User 1 User 2 User 3. User U. No.

6. Friction, Experiment and Theory

Author manuscript, published in "1st International IBM Cloud Academy Conference - ICA CON 2012 (2012)" hal , version 1-20 Apr 2012

TIME SERIES ANALYSIS AND TRENDS BY USING SPSS PROGRAMME

universe nonself self detection system false negatives false positives

Bi-Objective Optimization for the Clinical Trial Supply Chain Management

application require ment? reliability read/write caching disk

Profitability of Loyalty Programs in the Presence of Uncertainty in Customers Valuations

Introduction to the article Degrees of Freedom.

Bud row 1. Chips row 2. Coors. Bud. row 3 Milk. Chips. Cheesies. Coors row 4 Cheesies. Diapers. Milk. Diapers

Downloaded from SPIE Digital Library on 29 Aug 2011 to Terms of Use:

A Note on Profit Maximization and Monotonicity for Inbound Call Centers

A New Optimum Jitter Protection for Conversational VoIP

Client URL. List of object servers that contain object

Socially Optimal Pricing of Cloud Computing Resources

Senior Thesis. Horse Play. Optimal Wagers and the Kelly Criterion. Author: Courtney Kempton. Supervisor: Professor Jim Morrow

v = x t = x 2 x 1 t 2 t 1 The average speed of the particle is absolute value of the average velocity and is given Distance travelled t

REDUCTION OF TOTAL SUPPLY CHAIN CYCLE TIME IN INTERNAL BUSINESS PROCESS OF REAMER USING DOE AND TAGUCHI METHODOLOGY. Abstract. 1.

Risk Management for a Global Supply Chain Planning under Uncertainty: Models and Algorithms

International Journal of Heat and Mass Transfer

Applications. Decode/ Encode ... Meta- Data. Data. Shares. Multi-read/ Multi-write. Intermediary Software ... Storage Nodes

MECH Statics & Dynamics

Bidding for Representative Allocations for Display Advertising

Application. handle layer. access layer. reference layer. transport layer. ServerImplementation. Stub. Skeleton. ClientReference.

Linear energy-preserving integrators for Poisson systems

Apigee Edge: Apigee Cloud vs. Private Cloud. Evaluating deployment models for API management

Growing Self-Organizing Maps for Surface Reconstruction from Unstructured Point Clouds

σ m using Equation 8.1 given that σ

CASE STUDY BRIDGE.

PROCESSOR IS OCCUPIED BY T i

Cluster-Aware Cache for Network Attached Storage *

Transcription:

Numerical Linear Algebra with Application, Vol. 1(1), 1 1 (1996) Parallel Adaptive Multigrid Method in Plane Linear Elaticity Problem Peter Batian, Stefan Lang Intitut für Computeranwendungen III, Univerität Stuttgart, Pfaffenwaldring 27, D-70569 Stuttgart, Federal Republic of Germany ( peter,tefan @ica3.uni-tuttgart.de) and Knut Ecktein Graduiertenkolleg Kontinua und Strömungen, ISD, Univerität Stuttgart, D-70569 Stuttgart, Federal Republic of Germany (ecktein@id.uni-tuttgart.de) In thi paper we dicu the implementation of parallel multigrid method on untructured locally refined mehe for 2D linear elaticity calculation. The problem of rebalancing the workload of the proceor during execution of the program i dicued in detail and a load balancing algorithm uited for hierarchical mehe i propoed. For large problem the efficiency per multigrid iteration range from 65% on 64 proceor in the locally refined cae to 85% on 256 proceor in the uniformly refined cae. All calculation were carried out on a CRAY T3D machine. KEY WORDS multigrid, untructured grid, adaptive local grid refinement, MIMD computer, dynamic load balancing, linear elaticity 1. Introduction The efficient numerical olution of partial differential equation i an important field of active reearch. Over the lat decade a number of technique have been developed to reduce computer time. The firt poibility to ave time i in the dicretization tep. Here adaptive local grid refinement concentrate degree of freedom in the critical part of the olution domain. To that end an a-poteriori error etimator (or indicator) i applied to a given numerical olution and if the error exceed a precribed tolerance a modification trategy produce a refined meh baed on local quantitie computed by the error etimator. 1070 5325/96/010001 01$5.50 Received 15 December 1995 c 1996 by John Wiley & Son, Ltd. Revied 30 May 1996

2 Peter Batian, Knut Ecktein and Stefan Lang Thi procedure i repeated until the required tolerance ha been reached. A review of thee technique i given in [1]. On each of the adaptively generated mehe a numerical olution ha to be computed. Therefore a fat iterative olver i needed that can be topped when the olution error ha reached the dicretization error. For elliptic partial differential equation multigrid method are the fatet method known o far. They have the important property that their convergence rate i independent of the meh ize. The optimality of the method for calar elliptic problem on untructured and locally refined mehe without aumption on the regularity of the differential operator ha been hown only very recently. An overview can be found in the paper by YSERENTANT [2]. The multigrid method fit nicely into the framework of adaptive local grid refinement which ha been exploited already in a number of computer code like PLTMG, [3], or KASKADE, [4]. Yet another method to ave computer time i the ue of modern computer architecture, i. e. parallel computer. But it ha to be emphaized that the ue of parallel computer will allow one to olve bigger problem than before only if the method mentioned above are implemented in parallel. If non-optimal olver are ued on the parallel machine a large part of it capacity i wated to compenate for the gain that could be achieved by uing multigrid on a ingle proceor. Provide initial grid T 0 Aemble equation Solve #IT T IT T SOL compute new neted grid hierarchy interpolate olution T REF Etimate dicretization error flag element for refinement compute mapping T BAL migrate element T MIG T LB EXIT Figure 1. Baic adaptive olution trategy. Therefore the aim of thi paper i to combine adaptive local grid refinement and multigrid on parallel computer. Thi immediately lead to the problem of dynamic load balancing, ince ome proceor will refine more element than other proceor. Since the location of refinement i in general not known in advance the load balancing problem mut be olved during run-time. The load balancing problem conit of two part: Firt a mapping of the data (e.g. element) to the proceor mut be determined that balance execution time in the olver. Second the data mut be migrated according to thi new mapping. The computation of the optimal mapping i a NP-complete dicrete optimization problem 11/10/2002 23:06 Submiion parmech

K K, Parallel Adaptive Multigrid Method in Plane Linear Elaticity Problem 3 which mut be olved approximately by a heuritic procedure that doe not dominate the execution time of the multigrid olver. Fig. 1 how an outline of the parallel adaptive olution trategy. Thi paper i organized a follow. In the next ection we will decribe hortly the equation of linear elaticity in two pace dimenion and their dicretization with the Finite- Element method. Section 3 cover the other component of the adaptive olution algorithm, epecially the multigrid olver. Section 4 give an overview of the parallelization approach baed on data partitioning. Then ection 5 decribe in detail the load balancing algorithm that ha been developed. Finally ection 6 contain ome peedup reult for uniform and adaptive calculation. 2. Mathematical Model and it Dicretization 2.1. Variational Formulation Subject of the computation preented in thi paper i the claical planar, firt order theory of elaticity. The retriction to linear material law i going to be lifted oon, a the routine allowing for nonlinear material law are currently under implementation. The calculation of the diplacement repone of an elatic body ubjected to precribed force and diplacement i equivalent to olving the following variational problem: "!# %$ & ')(+*-, (2.1) Thi expreion baed on the mechanical axiom of tatic equilibrium repreent the minimal energy criterion for the elatic body which cover the domain.. The phyical interpretation i uch that under the above condition of minimal elatic energy the body enter the equilibrium tate with the um of the internal force being equivalent to the um of the external force. The minimal energy criterion contain three different variable decribing the tate of the elatic body: The two dimenional diplacement vector, the deformation tenor and the tre tenor. The vector and $ hold the external force and the boundary tree repectively. A diplacement, deformation and tre component of the elatic body are not independent of each other, everal approache to olving the variational problem exit. We chooe the o-called diplacement formulation which ue the dependencie between the decription variable to eliminate a well a. The linear material law / 10 2 3 0 2547698 ;:-<>= (2.2) i employed in the elimination of the tre tenor 2 where i the Young modulu and 0 2 the Poion number. We chooe to #?A@B?C?? and 0 to?, D which i a common choice for teel material. Firt order linearization of the kinematic coupling condition between the diplacement and the deformation yield FEHG I JLK E G MK G EN 11/10/2002 23:06 Submiion parmech (2.3)

O O O O m ^ ^ l ^, 4 Peter Batian, Knut Ecktein and Stefan Lang Short-handing the above to Q R 8 : SUT PO finally reult in VW UXUY $ & P')(+* (2.4) with the material tenor T containing expreion of 2 and 0. Z\[ i uually being referred to a the Neumann boundary. The weak form correponding to the variational problem (2.4) can be written a 9 ]T O_^ ) X Y $`^ C (2.5) where the ^ deired olution ha to atify thi expreion for any valid choice of tet function. 2.2. Dicretization In order to find an approximate olution ba on a given meh denoted by c we define the Finite Element pace d a. We chooe tandard ioparametric triangle and quadrilateral with piecewie linear/bilinear hape function e E * which reult in 1 at the node E and in 0 at all other node. Then the dicrete problem can be tated a follow: Find a_f d ahg d a uch that i a T O_^ Cj UẌ Y $k^ f d a_g d a (2.6) In the cae of the quadrilateral element we employ elective reduced integration (SRI) on the hear component in order to reduce the well known problem thi element exhibit when ubjected to bending or volumetric deformation mode. By expreing the unknown (vector-valued) function ba in the bai given by e E with coefficient E we obtain the linear, ymmetric and poitive definite ytem of equation on (2.7) that will be olved by the multigrid method. It hould be noted however, that the application of the multigrid method i not retricted to ymmetric matrice which i important for the extenion of thi work to elatoplatic problem. A detailed dicuion and analyi of the variational problem of plane elaticity and it dicretization outlined here can be found in [5] and [6] 3. The adaptive olution algorithm 3.1. Grid refinement The multigrid method work on a equence of ucceively finer mehe. The initial meh i intentionally coare but hould be fine enough to reolve important detail of the geometry. The equence of finer mehe i contructed with the refinement algorithm of BANK that i alo ued in the code PLTMG (ee [3]) and KASKADE (ee [4]). However, we allow quadrilateral element and more refinement rule (ee Fig. 2). 11/10/2002 23:06 Submiion parmech

c x G * x G Parallel Adaptive Multigrid Method in Plane Linear Elaticity Problem 5 (a) regular rule (b) irregular rule (c) copy rule (1) (2) (5) (6) (11) (3) (4) (7) (8) (12) (9) (10) Figure 2. Complete et of refinement rule The refinement algorithm i explained with the help of Fig. 3. The coaret grid level prq i aumed to be generated by ome meh generator and all element of thi level are defined to be regular element. Then a refinement rule can be applied to each element reulting in the generation of new element on the next finer level. Each refinement rule i either of type regular, irregular or copy (ee Fig. 2 for all poible rule), producing regular, irregular or copy element on the next finer level. An irregular or copy element allow only the application of copy rule, wherea all type of rule can be applied to regular element. Thi trategy generate mehe atifying a minimum angle criterion ince irregular refinement rule can only be applied once. Note that the refinement i local in the ene that an element may not be refined at all. The copy element are only needed for an efficient implementation of the local multigrid method and do not detroy the optimal complexity of the method. The refinement algorithm i reponible for generating a conforming meh on each level, i.e. the interection of two different element i either empty, a node or an edge. In practice it will alo happen frequently that the error etimator decide to refine irregular element or a regular element with irregular neighbour. In that cae the irregular refinement i removed and replaced by a regular refinement rule. There exit other well known refinement trategie e.g. hanging node or tranition element. The ubequently explained load balancing p; trategie are applicable to any refinement method. Let denote the et of element on level t and u the et of node on level t generated by the refinement algorithm decribed o far. Then the meh c defined by wv 4 f zy q p 4 i not further refined (3.8) i the meh that define the Finite-Element pace for the dicrete olution of our problem. The node of thi meh are given by } zy q~ f u 9 * did not exit in u R [i (3.9) 11/10/2002 23:06 Submiion parmech

ƒ ƒ n m * ~ * ~ G m ƒ 6 Peter Batian, Knut Ecktein and Stefan Lang copy not tored T irregular 0 T 1 regular T 2 Figure 3. Neted local grid refinement 3.2. Local multigrid method In order to olve the dicrete problem derived from a dicretization on the meh c, a equence of auxiliary problem m on Qn (3.10) (3.11) i ued. m correpond to a dicretization on the grid level p. In particular when we define O f u f u A * A * p i corner of a regular element *b in @ i connected to a node f @ (3.12) we O need m n the tiffne matrix and load vector only at the node correponding to the et. The additional m layer of copy element in the meh tructure allow u to compute exactly thi part of. The main concluion of the paper [7] and [8] with repect to locally refined mehe wa that it i ufficient to mooth only the unknown correponding to the node in et on level t in order to achieve a convergence rate that i independent of the number of refinement tep and the pace dimenion for elliptic problem. The number of arithmetical operation for one iteration of the multigrid method can be hown to be proportional to the ize of the et }, i.e. the dimenion of the ytem m ALGORITHM 3..1 One iteration of the multiplicative local multigrid method i given by algorithm mlmg. We only decribe the V-cycle ince the W-cycle i ha no optimal complexity for arbitrary local refinement. Vector and n contain the current olution and the load vector. and n live on the node }. mlmg ( Qn @ (1) G ) ~ G ; ^?" on O G Qn 11/10/2002 23:06 Submiion parmech.

[ [ q [ q ^ Y n ƒ G q [ ƒ ƒ ƒ ƒ m ƒ (2) for (t (3) (4) 6 (5) (6) (7) (8) for (t (9) (10) (11) (12) Parallel Adaptive Multigrid Method in Plane Linear Elaticity Problem 7 ^ P Š tˆ?a t t ) ~ 8 ^ @ : ; on m ^ O ; on R Œ 8 = R m [ :Ž 6 O R on R [- [ ele ^ R? ; on ^q& m u ; on u C th ^R t t ) ~ [ R ; on ^C o^c [\ = i ^CR ^C P [ ; on Š u 8 ^C @ : ; on j G ; on The aim of algorithm mlmg i the computation m ^jon ^ of a correction to the given iterate by applying one V-cycle to the problem. The algorithm tart in line 1 with the computation of the defect on the finet level (which doe not necearily cover the whole domain.!). The loop in line 2 goe from top to bottom and applie 0 [ pre-moothing tep where only the value of ^ at the node correponding to O are changed (line 3). O Thi reult in the change of the defect only at node in (line Ohi 4). In order to obtain the defect on the next coarer level we retrict only to the node in [ O_ that are alo in on the fine meh. For all other node the defect can be computed on the coarer grid level ince the olution ha not been changed yet at thoe poition (line 5). Like in tandard multigrid we tart with a zero correction on the next coarer level in line 6. Line 7 contain the exact olve on the coaret level. The loop in line 8 now realize the upward part of the V-cycle. It tart with an update of the current iterate in line 9 with the correction at thoe node that are not changed any further on the higher level of the grid. Note that each component of the global vector i altered only once. Line 10 contain the interpolation tep and line 11 the pot-moothing tep. Finally line 12 update the current iterate at the remaining node not yet corrected at the coarer level. A moother we ue point-block variant of Jacobi, Gauß-Seidel or ILU (ee [9]) iteration, i.e whenever a diviion by the diagonal element occur, we multiply with the block matrix correponding to the two unknown at a node. invere of the g 3.3. Refinement indicator The adaptive grid refinement algorithm require an indicator which decide on whether a finite element i to be further refined or not. We chooe a very implitic baed on the idea of nodal tre comparion. With the linear ioparametric triangle and quadrilateral employed in our calculation the tre component typically exhibit dicontinuitie acro element border. In our cae an element will be refined when the relative change in the value of the Von-Mie-tre š œ ž AŸ œ Ÿ D œ (3.13) in any node toward adjacent element urpae a given value. The Von-Mie-tre wa choen a it take into account all three component of and a it i invariant of the 11/10/2002 23:06 Submiion parmech

8 Peter Batian, Knut Ecktein and Stefan Lang are divided by the maximum main-tre which i the larget eigenvalue of the tre tenor found on the entire problem domain. The above indicator wa choen for reaon of implicity and eae of implementation. Our primary aim in thi context i to demontrate the uability and calability of our data tructure, numerical algorithm and general concept in the context of a parallel adaptive application from tructural mechanic. The indicator i baed on practitioner experience alone and doe not claim a olid theoretical bai, yet the numerical reult were atifactory. Thi i due to the fact that the refinement algorithm preented above trictly avoid badly haped element with large apect ratio, large difference in ize between neighbouring element and other grid deficiencie that would undoubtly inhibit thi indicator. coordinate ytem in ue. In order to obtain a relative value the ; C 4. Parallelization Approach 4.1. Grid Partitioning The parallelization of all component of the adaptive multigrid method i baed on a ditribution of the data onto the et of proceor. In our method the element are aigned uniquely to the proceor. Thi reult in horizontal and vertical overlap a hown in Fig. 4. horizontal vertical P1 k P3 P2 k-1 t Figure 4. Horizontal and vertical overlap in data partitioning The left part of the Fig. 4 how the ituation on one grid level (intra-grid). For intance the node in the center i tored in three copie on the three proceor [, œ and. The right part of Fig. 4 how the vertical or inter-grid ituation. The white triangle 4 on level t poee four on triangle on level t each aigned to a different proceor. Here the rule applie that for every element aigned to a proceor alo a copy of the father element mut be tored on that proceor including all it node. Conequently 4 additional copie of triangle 4 will exit on level t. Obviouly the load balancer hould avoid thi ituation whenever poible. 11/10/2002 23:06 Submiion parmech

6 6 [ O 6 [ : 4.2. Aembling the tiffne matrix Parallel Adaptive Multigrid Method in Plane Linear Elaticity Problem 9 In the Finite-Element m method the um of all element tiffne matrice yield the global tiffne matrix on grid level t. In the parallel verion the element tiffne matrice are only ummed per proceor n m& to give a proceor tiffne matrix. The ame i done with the right hand ide. Conequently we have the relation m @ n, (4.14) m i C n ª i C Note that ml and n can be computed locally without any communication and that load balancing i optimal if each proceor ha the ame number of element. We ay that vector n i tored inconitently ince at the node that are tored on more than one proceor each proceor know only part of the global value. On the other hand if for ome vector we have 8 : E 8 : E, i.e. each proceor know the global value alo in the overlap node, we ay that i tored conitently. In order to tranform an inconitent vector into a conitent vector a communication operation over the interface node i required. 4.3. Retriction and Prolongation The concept of inconitent vector i very ueful for the parallelization of the grid tranfer operator. m The n «n retriction operate on the defect. Now let u aume that and are tored inconitently and i tored conitently. Together with the linearity of all the operator involved we have m R [ R [h C n C m C i [ 8 n m ± ² ³ µb C R [ @ (4.15) which mean that an inconitent defect on the fine grid i retricted to an inconitent defect R [ on the coare grid. Thi doe not need any communication a long a the on of an element are mapped to the ame proceor a. Under the ame aumption a conitently tored correction ^i [ can be interpolated without any communication. 4.4. Smoother For the moother we either have the poibility of uing Point-Jacobi O º¹9»½¼C¾ or Block-Jacobi with inexact inner olver. In the imple Point-Jacobi cae we need 8 : on each proceor m which require one communication at the beginning of the olution cycle ince i tored inconitently. But then we compute: À Á i C, m (4.16) The um correpond to the fact that the inconitent defect mut be tranformed into a conitent defect ince only a conitent correction can be added to the conitently tored olution. In the cae of a Block-Jacobi moother we aign alo the node of the grid uniquely to the proceor. For a node that i tored on more than one proceor we define that the 11/10/2002 23:06 Submiion parmech

[ O [ 10 Peter Batian, Knut Ecktein and Stefan Lang proceor with the mallet number i reponible for thi node and all other proceor O O compute a zero correction for it. Now the matrix i a block diagonal matrix ¹9»Â¼ ¾ 8ÄÃ [F[ @,#,>, @ Ã ; : where Ã E½E correpond to one tep of Gauß-Seidel or ILU for the unknown aigned to proceor (. Now the moothing iteration i given by À Á C C @ (4.17) i.e. we need two communication over the interface node per moothing tep. 4.5. Refinement Algorithm The mot delicate part in the parallelization of the refinement algorithm i the green cloure, where a conforming meh ha to be contructed from the refinement tag produced by the error indicator. Uually thi tep involve an iteration that can be implemented with optimal complexity on a erial machine. On the parallel machine thi approach i not very ueful, therefore we implemented a complete et of refinement rule (ee Fig. 2) in the ene that for each poible pattern of refined edge (8 for triangle, 16 for quadrilateral) there i a refinement rule that fit to the given edge pattern. With thi approach an iteration i completely avoided and the conforming meh can be computed with only one communication over the interface element. All thee additional element are conidered a irregular element and cannot be refined further. 5. Load Balancing 5.1. Goal The purpoe of the load balancing algorithm i to aign the data to the proceor in uch a way that execution time i minimal. In thi generality it i very hard to olve thi problem (even approximately). Therefore we make the aumption that the computation time between any two ynchronization point i reaonably large compared to communication time (coare granularity). Thi mean that in the firt place one hould aign the data in uch a way that the computation time i equally balanced between the ynchronization point and that minimizing the communication time i only the econd goal. The overall olution algorithm decribed o far conit of the different part aembly, multigrid olver, error etimator and grid refinement. Fortunately it turn out that aigning an equal number of element to each proceor will balance the work load very well for all part of the algorithm, epecially in the element aembly phae which can be the mot time conuming part in cae of nonlinear problem. The mapping of all other data object, e.g. the node, i completely determined by the aignment of the element. The mot time conuming operation in the multigrid olver i a matrix vector product. If the meh conit of either triangle or quadrilateral it can be hown that the time needed for a matrix vector product i proportional to the number of element plu a term proportional to the number of boundary node. A long a the number of boundary node i negligible compared to the number of interior node the aignment of an equal number of element to each proceor will therefore alo balance the work load in the multigrid olver. 11/10/2002 23:06 Submiion parmech

g À Ð ƒ f 4 Ð ƒ Æ, : 4 Ê, Æ Parallel Adaptive Multigrid Method in Plane Linear Elaticity Problem 11 The econd goal of the load balancer i now to minimize communication requirement. The aembly part require no communication at all therefore we concentrate on the communication in the multigrid olver. Since the proceor will be ynchronized at leat once on each grid level in the moother it i tempting to require that the element of each grid level hould be aigned to all proceor in uch a way that communication in the moother i (approximately) minimized. The aignment of two conecutive grid level hould, however, be related in order to minimize alo the communication in the grid tranfer. In the cae of local refinement even imple one-dimenional example how that thee two requirement are contradictory. Therefore the aignment mut be done uch that a compromie between low communication in the moother and low communication in the grid tranfer i achieved. In the following we propoe a clutering trategy baed on the multigrid hierarchy that will achieve thi compromie. The outline of the load balancing trategy i a follow: (i) Combine element into cluter uing the multigrid hierarchy. Thi tep can be done completely in parallel. (ii) Tranfer all cluter information to the mater proceor. (iii) Aign cluter to proceor uch that each proceor ha (approximately) the ame number of element on each grid level and communication on each level i low. Thi i done on a ingle proceor. (iv) Tranfer mapping information back to cluter owner. (v) Reditribute the data tructure in parallel. 5.2. Clutering For the clutering algorithm we require a tree-baed (local) refinement trategy a wa p decribed in ection 3.1. above. By ~ 4 [ @,>,#, pœ3å Gzy @ 4 and by q p; we denote all element of level t we denote the et of all element. Since the refinement i baed on ubdividing individual element we have the father relationhip: Æ ÈÇ p p [ @ 8É4 @ 4 Æ ªÊ : f ÆËÌÅ\RÍ ÆÎ If i the highet level then we et G unique we can expre the relation alo in a functional form, i.e. 8Ï4 The on of an element are defined by 8Ï4 : ~ 4 i generated by refinement of 4 p 8Ï4 @ 4 : f (5.18). Since the father of an element i. 8É4 @ 4 : f (5.19) A imple recurive formula give u the number of element in the ubtree that ha a given element a it root: 8É4 : 3v Ò ÓÉÔ Õ"Ö ÓÉ 8É4 8Ï4 : oñ : ele (5.20) In order to quantify the communication cot (approximately) we alo need a neighbour relation on the element: 11/10/2002 23:06 Submiion parmech

Ù Ê Æ T Ù @ Ý Ý T p p,, ^ 12 Peter Batian, Knut Ecktein and Stefan Lang NB ÈÇ p g p @ 8É4 @ 4 : f NB %Ê 4 i neighbouring element of 4 (5.21) QÅ RØ The union over all level give NB G NB. Now we p are in a poition to define the cluter. In general a clutering i a partitioning of the et, i.e. ~iù [ @,>,>, @ Ù Ú @ Ç p Ù E (5.22) uch that x Û CÜ op Ù E Ù G QÑ Ê (ÞÝ, pœß The partitioning define a mapping Ù from element to cluter that we will alo denote by Ù : Ù 8É4 : 4 f Ù. Some additional quantitie can be derived from the partitioning. Firt we need the lowet and highet level of any element in a cluter: bot8 Ù : oàj»âá ~RÙ p Ñ Ý @ top8 Ù : à_¼ ã We alo need the number of element in a cluter on each level: ~RÙ p oñ Ý (5.23) ä 8 Ù : ~ 4 p; f Ù 8É4 : Ù @ ä 8 Ù : ~ 4 p m m f Ù 8Ï4 : Ù @ (5.24) where denote the number of element in et. In the following we require that the clutering ha the following important propertie: (i) ä botö Û 8 Ù : for all cluter Ù f T. The unique 4 f botö Û Ù i called the root element of the cluter and i denoted by root8 Ù :. (ii) For all cluter Ù f T we require: 8 8 bot8 Ù :å t top8 Ù :B:;æ 8É4 f Ù pb :B:kçè 8Ï4 : f Ù. Thi definition enure that the Æ element in a cluter Æ form a ubtree of the element tree tructure. Therefore the relation implie a relation Ü on the cluter et C by 8F8É4 @ 4 : f æ Ù 8É4 : Ù 8É4 : :\ç 8 Ù 8Ï4 : @ Ù 8É4 Æ : : f Ü, (5.25) The neighbour relation NB alo implie a neighbour relation NB Ü on the cluter via 8F8É4 @ 4 : f NB æ Ù 8É4 : Ù 8É4 : :\ç 8 Ù 8Ï4 : @ Ù 8É4 : : f NB Ü, (5.26) The following algorithm contruct a clutering with the deired propertie. ALGORITHM 5..1 Clutering p of an element et. The following algorithm cluter receive a multigrid hierarchy (with highet level 8 : ) a input and deliver a partitioning into cluter T n a output. The parameter @ @êé n nz9ì control the algorithm. i the iíï Ŗí ince in practice we tart partitioning on a level higher than zero if the coare grid are very coare or in the dynamic ituation where we do not like to rebalance the coaret level. Parameter i the deired depth of the cluter and é i the minimal ize of the cluter. 11/10/2002 23:06 Submiion parmech

Ð T p @ n Ð @ @ 8 p @ Ò ( l @ @ n @ @ @ @ n Ò n : Å T Parallel Adaptive Multigrid Method in Plane Linear Elaticity Problem 13 cluter QÑ (T @ @ @êé ) ~ ; for (t on,>,>, : ) for (4 p f ) ~ if 8 8 8Ï4 :Lî é :æ 8 8Ät àjï9¹ 8 ª : create new Ù ; T ~iù ; bot8 Ù : top8 Ù : t ; root8 Ù : 4 ; ä E 8 Ù :? ; ä 8 Ù :? ; ele Ù Ù 8 8É4 :B: ; Ù 8É4 : Ù ; top8 à_¼ã Ù : 8 top8 Ù : @ t : ; ä 8 Ù : ä 8 Ù :b ; ä 8 Ù : ä 8 Ù : ; The algorithm proceed a follow: It run over all level from n to and over all element within each level. If the ubtree defined by the current element i large enough and the level relative to b i a multiple of ð the current element will be the root of a new cluter ele it will be in the cluter of it father element. In the dynamic ituation, when the multigrid tructure i already ditributed over the proceor, algorithm cluter can be n run in parallel. If the parameter @ @êé are not changed within one run then only the computation of 8Ï4 : require communication (comparable to a retriction from the finet to the coaret meh). In our implementation the parallel grid refinement algorithm impoe an additional contraint that exclude ome element from becoming the root of a new cluter, ince thi contrained will be removed in a new verion of our code we refer to [10] for detail.? :B: ~ 5.3. Balancing the Cluter After the clutering tep, the cluter have to be aigned to proceor. Thi aignment problem i olved on a ingle proceor in our current implementation. The aignment heuritic i given by the following two algorithm mg aign and aign. ALGORITHM 5..2 Algorithm mg aign map a et of cluter T to a et of proceor by repeatedly olving n maller aignment problem with particular ubet of T. A parameter it receive the baelevel ued in the clutering algorithm, the highet level of the multigrid hierarchy and Ã the minimum number of element deired per proceor. The number Ã uually depend on the hardware. Algorithm mg aign ue another algorithm aign that olve the maller aignment problem. Aign i a modification of tandard graph partitioning algorithm and everal variant will be dicued below. mg aign (T @ @ Ã ) ~ for ( f @ t Qn,>,#, ) íïò ó t @ "ô? ; (1) for (t @,>,#, ) ~ (2) T ~iù f T top8 Ù : t ; oñ if (T ) continue; (3) í i C íõò ó t @ Aô ä Ç Û CÜ ð 8 Ù : ; (4) Determine with à_¼ã 8 @ ö í Ã : ; (5) aign(t @ T @ íïò ); 11/10/2002 23:06 Submiion parmech

ü í q À ü q Y ú ü q ü ü @ Ç Ò q Ò q q 14 Peter Batian, Knut Ecktein and Stefan Lang (6) for ( Ù f T ) (7) for (i=bot8 Ù : @,>,>, @ top8 Ù : ) íïò ( ó @ ' 8 Ù : ô íïò ( ó @ ' 8 Ù : ô ä E 8 Ù : ; Algorithm mg aign proceed a follow. It ue a two-dimenional array íõò ó t @ Aô to tore the number of level-t -element that have been aigned to proceor. Then it proceed from top to bottom (loop in line 1) and elect the cluter with the currently highet level that ha not yet been aigned (line 2). Line 3 compute the number of element on thi level and line 4 determine the number of proceor that will be ued for that level. Line 3 and 4 implement a coare grid agglomeration trategy that ue fewer proceor when the grid get coarer (controlled by the parameter Ã ). In line 5 algorithm aign i called to aign the cluter T to the (ub-) et of proceor. Since ome level-t -element have already been aigned in previou iteration algorithm aign get alo the array íïò to take thi into account. Finally line 6 and 7 update the íõò array. We now give a generic verion of algorithm aign that i ued in algorithm mg aign above. Aign i a modification of the recurive biection idea that i able to take into account that ome element already have been aigned to ome proceor. ALGORITHM 5..3 Algorithm aign aign a given et of cluter T to a given et of proceor uch that the work on level t of the multigrid hierarchy i balanced. In order to take into account that the proceor are already loaded with ome element on level t it receive the array íïò. The output of the algorithm i given by the mapping ') ß«. T aign 5 (t @ T @ @ íõò ) ~ (1) if ( ~ ) ~ l Ù f TM 'ø et Î8 Ù : (2) Divide (Î Îq into @ [ ; (3) for (?A@> ) í E C Cù íõò ó t @ Aô ; (4) ú í [ ûò ä Û CÜ 8 Ù : ; (5) Determine T @ T [ T @ q Å T T [ T @ T þ (6) Cþ 3ÿ í ä Û CÜþ 8 Ù : (7) aign(t ýü @ q T @ íïò ); (8) aign(t @ T [ @ [ @ íïò ); ; return; T [ Ñ ßÌàj»âá ; uch that Algorithm aign proceed a follow. If contain only one proceor the recurion end and all cluter in T are aigned to thi proceor (line 1). Ele the et of proceor i divided into two halve q and [ (line 2). Line 3 then compute the load that ha already been aigned to the two proceor et (on level t ) and line 4 compute the total load and that i available on level t. Now the cluter et T mut be divided into two halve T T [ uch that the element on level t q are equal in the correponding proceor et and [ q (line 5 and 6). Note that and [ are not required to contain the ame number of element. Finally line 7 and 8 contain the recurive call that ubdivide the new cluter et again. 11/10/2002 23:06 Submiion parmech

( T @ q q Ú Òy E [, T q Parallel Adaptive Multigrid Method in Plane Linear Elaticity Problem 15 Several trategie are available to biect the cluter et in line 6 and 7 of the algorithm above. Orthogonal coordinate biection. In thi variant each cluter i aigned a coordinate 8 @ : by taking the center of ma of the root element of the cluter. Then the cluter are ordered by their (or ) coordinate. A given poition Û Ó (repectively Û Ó now q define the biection T ~iù f T 8 Ù : Û Ó and T [ ~iù f T 8 Ù : ˆ Û Ó (or alternatively in direction). The cut poition Û Ó i determined uch that the expreion in line 6 of algorithm aign i minimized. The biection direction are choen in an alternating fahion. Inertial biection. Thi i a method imilar to coordinate biection but uing a rotated coordinate ytem derived from the inertial moment. For detail ee [11] and [12]. Spectral biection. All ubequent method do not require any coordinate information but ue only graph connectivity information. A a graph we conider 8 2 T : @ 2 8 T g T : Ü NB where T i the cluter ubet contructed in algo- Ü rithm mg aign and NB i the cluter neighbourhip relation defined in Eq. (5.26). Note that the neighbourhip relation i retricted to the ubet T. The connectivity between the et TE and T (Ý, t i not conidered. The pectral biection method then derive the o-called Laplace-matrix graph a follow: Matrix 'ø 8 : E G degree8 Ù E :? ele (- 8 Ù E @ Ù G : f 2 from the (5.27) i ymmetric and poitive emi-definite. Now each cluter Ù E f, i aigned a number E and the E define the vector 8 [ @,>,>, @ Ú :. According to [13] the graph biection problem can be formulated a: [ Minimize under the contraint E?"@ E f ~ C@#, (5.28) Thi optimization problem i now olved with f RÚ intead of the contrained. If the graph i connected the continuou optimization problem ha the olution Ó ' œ where œ i the eigenvector correponding to the econd mallet eigenvalue of. The component of Ó q& are now ued to define the biection by etting T ~RÙ E f 8 Ó : E Û Ó and T [ ~iù ELf T 8 Ó : E ˆ Û Ó for a given Û Ó. The poition Û Ó i choen in order to minimize the expreion in line 6 of algorithm aign. Kernighan-Lin biection. In thi method one tart with a random biection into T and T [ that minimize the expreion in line 6 of algorithm aign. Then ubet of T and T [ are wapped repeatedly until a local minimum of the number of edge and T [ i found. Note that each wapping tep doe not change the connecting T load balance. By uing the output of any of the other partitioning cheme a a tarting partitioning intead of the random one an improved method can often been obtained. For detail we refer to [14], [12], [10]. Multi-level biection. Thi method trie to apply idea from multigrid to the olution of the graph biection problem. Thi method i epecially ueful if the graph to be partitioned i very large. In the o-called coarening phae neighbouring node of the 11/10/2002 23:06 Submiion parmech

p 8 : 8 8 : : Y p [ Y p [ 8 : 8 Y 16 Peter Batian, Knut Ecktein and Stefan Lang given graph are aembled into cluter (ince our node are already cluter we have cluter of cluter now). Thee cluter together with edge defined in the canonical way form a new coarer graph which i coarened repeatedly until a given minimal ize i reached. For the coaret graph a high quality biection i determined uing e.g. pectral biection. Then the reult interpolated to the next finer graph in the canonical way. On the finer level the Kernighan-Lin heuritic i ued to improve the interpolated coare grid olution. Thi proce i repeated until the finet level i reached. For detail we refer to [12], [15]. All method except coordinate biection have been implemented in the load balancing oftware CHACO by HENDRICKSON and LELAND, ee [12]. CHACO ha been adapted to our code o that it can be ued a algorithm aign above, for detail ee [16]. The load balancing cheme dicued in thi ection ha been deigned for the tandard multiplicative multigrid algorithm. For the additive variant, like BPX (ee [7]), load balancing can be done differently ince the ynchronization behavior i different. For a olution of the load balancing problem for additive multigrid (including local refinement) we refer to [10]. Finally we remark that the load balancing cheme can be extended immediately to the three-dimenional ituation. 5.4. A imple parallel model In thi ection we derive a imple model for parallel efficiency that how the influence m of the different load balancing cheme. Suppoe we have a load balancing cheme and that parallel efficiency i determined by the following formula: 2 p Õ ; @ (5.29) p where p [ p i the time needed by one proceor, Õ i the erial part of the algorithm, the ret [ Õ i perfectly parallelizable and are the communication cot. In more detail i only that part of communication cot that can be influenced by the load balancing cheme, e.g. meage etup-time would not be part of ince the number of meage i almot not influenced by the choice of the load balancing cheme. Then we want to ee how the efficiency i influenced when the interface length i reduced by a better load balancing cheme. The efficiency for load balancing cheme i modeled by 2 p Õ! b @ (5.30) where i a factor that decribe the improvement in the communication cot. Typically the value of are in the range of [. to After ome algebraic 2" manipulation we obtain the following 2# formula that expree the (better) efficiency in term of the (wore) efficiency : 2 2 8 $9: 8 2 : :&% [ [ À('()+* )-, where. i the cot of the erial part relative to the communication cot: @ (5.31) 11/10/2002 23:06 Submiion parmech

. p Õ,, Parallel Adaptive Multigrid Method in Plane Linear Elaticity Problem 17 (5.32) m Formula (5.31) ay how much the efficiency of load balancing cheme i improved by hortening 2" the interface length of the partition. The improvement however depend alo on and the factor. 2". E.g. if i already cloe to one, then the variation of ha not much influence. The ame i true when the erial part i not negligible, i.e. when.œˆ. Figure 5 how an example with et to?,0/ and. et to?,?, /,?,, and?. It depict the dependency of the efficiency of the better load-balancing cheme B on the original efficiency of the wore cheme A. In our cae where the efficiency of cheme A i already very good (about 0.9) the efficiency of B differ only lightly. 1 0.8 Efficiency B 0.6 0.4 0.2 relative erial part = 0.0 relative erial part = 0.5 relative erial part = 1.0 relative erial part = 2.0 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Efficiency A Figure 5. Dependency between efficiency A and B 6. Numerical Reult 6.1. Uniformly refined tet cae The uniformly refined tet cae i being evaluated with repect to the influence of the loadbalancing cheme on the efficiency of our olution procedure. Therefore the equation of linear elaticity with parameter given in ection 2 are olved in a domain given by Fig. 6a which i nonymmetric and not ingle-connected. By choing uch a domain which reemble a eal ring and which ha motly curved boundarie we tried to eliminate initial advantage of pecific load balancing cheme. Dirichlet boundary condition have been ued on the inner ide of the ring and von Neumann condition on the outer ide. The initial meh coniting of 24 quadrilateral element i hown in Fig. 6b. Part c depict the firt refinement tage with 96 element. 6.1.1. Fixed-ize problem In thi ubection we examine a problem with fixed ize, i.e. the meh from Fig. 6b ha been refined uniformly 5 time, reulting in a hierarchy with 11/10/2002 23:06 Submiion parmech

é : ö 8 :? ; 18 Peter Batian, Knut Ecktein and Stefan Lang Γ 1 Γ 0 Figure 6. a) problem domain b) grid 0 c) grid 1 6 level and 24576 element on the finet level. Thi problem i then olved on 1,4,16 and 64 proceor of the CRAY T3D. The parameter of the multigrid method were: V-cycle, 1 y pre and pot-moothing tep with Block-Jacobi-Smoother and ILU q21 3 nþ a inexact inner olver. The parameter for the clutering algorithm cluter are et to, and, i.e. level 0 and 1 are treated on proceor 0. Table 1 how the reult for variou mapping cheme within algorithm aign. The variou method were: rcb (recurive coordinate biection), rib (recurive inertial biection), ribkl (recurive inertial biection with Kernighan-Lin optimization), rbkl (recurive pectral biection with Kernighan-Lin optimization), mk50 (multi-level biection with Kernighan-Lin optimization and a coare graph ize of 50 node), mk200 (like mk50, but coare graph ize wa 200 node). All method except rcb were taken from the CHACO library. 24 2"4 The column labeled Ó contain the efficiency for one multigrid iteration defined by 8 Mp4 8 p4 : 8 : p4, where 8 : i the time per iteration on proceor. The column 3 2 labeled Õ 65 give the efficiency in the multigrid olver for obtaining a reduction of #? in the euclidean norm of the reidual, i.e. thi number include alo the convergence rate of 2"4 2 the method (numerical efficiency). The column labeled 87 how an upper bound for Ó computed a: Total number of node on all level in erial calculation divided by maximum 2 number of node on all level in one proceor in the parallel calculation. The number 97 therefore account for load imbalance and overlap but not for communication cot and idle time. The column labeled Min/Max IF how the minimum and maximum number of overlap node per partition on the finet level. The column labeled Min/Max Prod i imilar to Min/Max IF but each individual overlap node i weighted with the ditance to the detination proceor. 2:4 Table 1 how that the efficiency per iteration, Ó, doe not vary much with the different load balancing cheme. We explain thi with <; the help of formula (5.31). In the cae of very many unknown per proceor ( @#2= 2:> ), the efficiency Û? obtained with imple recurive coordinate biection i 2# already very good and cannot be improved much by reducing interface length. E.g. if in formula (5.31) i already?, @, / 2 and the interface length i halved ( ) the efficiency will be?, @A;CB according to thi imple model. In the cae of very few unknown per proceor ( = ) the erial part of the algorithm due to the coare grid i not negligible compared to communication 11/10/2002 23:06 Submiion parmech

G Parallel Adaptive Multigrid Method in Plane Linear Elaticity Problem 19 cot, i.e.. ˆ weaken the influence of in formula (5.31). A detailed look at table 1 reveal practically no correlation of interface length and meaured parallel efficiency 2#4 Ó. E.g. for ûd; ribkl ha much maller interface length than rib alone and alo 2 87 i better for ribkl, but meaured efficiency 2#4 Ó i better for the rib method. An additional view on mapping data reveal that all cheme howing an efficiency of approx. 92% in the cae of ME; kept the maximum workload on each level on the ame proceor. Thu idle time were hidden by computation. Thi i not the cae when the maximum workload i mapped to different proceor on different level. In the latter cae only 87% efficiency are achieved. Furthermore cache uage may be influenced by the hape of the partition a hown in [10]. More important than the efficiency of a ingle iteration tep i the performance of the complete olution proce in order to obtain a fixed accuracy. Figure 7 how the reult of three different load balancing cheme: rcb (a) take 11 iteration to complete, rib (b) how the bet parallel efficiency but need the larget number of iteration, therefore it achieve the wort numerical efficiency 2 Õ F5. The bet numerical efficiency i meaured with rbkl (c) which need the leat number of iteration. In general the more expenive load balancing algorithm yield a lower iteration count. The variation of the number of iteration i due to the block Jacobi moothing where the number, length and poition of the partition interface poee ignificant influence. Table 1. Reult for uniformly refined, fixed ize tet cae. LB # It. H&I6J HLKM6NOHQPRTS Min/Max IF Min/Max Prod 1 7 4 rcb 11 91.9 58.5 96.6 76 / 266 76 / 266 4 rib 9 91.9 71.5 97.7 84 / 255 89 / 260 4 ribkl 8 86.7 75.9 98.2 66 / 140 66 / 140 4 rbkl 8 88.3 77.3 98.4 66 / 158 66 / 158 4 mk50 9 92.3 71.8 96.5 76 / 266 76 / 266 4 mk200 8 87.0 76.1 97.0 66 / 134 66 / 134 16 rcb 11 81.6 51.9 88.6 72 / 221 72 / 299 16 rib 13 82.7 44.6 89.2 70 / 212 70 / 323 16 ribkl 9 76.4 59.4 89.9 66 / 164 99 / 362 16 rbkl 9 78.1 60.7 89.5 66 / 160 66 / 294 16 mk50 10 75.1 52.6 88.8 72 / 221 72 / 299 16 mk200 10 78.8 55.1 89.4 66 / 161 66 / 304 64 rcb 13 57.9 31.2 67.6 62 / 179 63 / 245 64 rib 13 57.6 31.0 69.4 62 / 119 69 / 315 64 ribkl 11 55.9 35.6 68.9 59 / 94 65 / 282 64 rbkl 11 56.0 35.6 69.1 60 / 116 65 / 390 64 mk50 12 54.0 31.5 67.8 62 / 179 63 / 245 64 mk200 12 54.4 31.7 69.4 60 / 96 63 / 384 6.1.2. Scaled-ize problem In thi ection the ame problem a in the previou ection i olved but the problem ize per proceor remain contant. Thi mean that with a fourfold increae in the number of proceor the multigrid hierarchy i extended by one level. Each proceor i aigned 24576 element on the finet level leading to a total of 6291456 element and 12601344 degree of freedom on the finet level in the cae of 256 11/10/2002 23:06 Submiion parmech

G 20 Peter Batian, Knut Ecktein and Stefan Lang Figure 7. a) rcb b) rib c) rbkl n wd proceor. The clutering parameter are et to, and é, i.e. level 0, 1 and 2 are alway treated on one proceor wherea the remaining level are divided into 1536 cluter ditributed among all proceor. Experiment with a progreively maller number of proceor for the coare grid howed no improvement in the parallel efficiency. DU Table 2. Reult for uniformly refined, caled ize tet cae H H H LB # It. IFJ KVM6N P"RWS Min/Max IF Min/Max Prod 1 7 4 rcb 9 89.6 69.7 97.7 146 / 538 146 / 538 4 rib 9 89.9 69.9 98.2 146 / 498 146 / 498 4 ribkl 8 92.2 80.6 98.6 130 / 266 130 / 266 4 rbkl 8 92.3 80.9 97.7 130 / 290 195 / 355 16 rcb 11 87.7 55.9 96.9 290 / 869 290 / 1177 16 rib 11 87.2 55.6 97.2 290 / 820 290 / 1239 16 ribkl 9 88.6 69.0 97.2 258 / 692 378 / 1418 16 rbkl 10 88.1 61.7 97.2 258 / 629 258 / 1161 64 rcb 12 85.4 49.8 96.0 452 / 1286 483 / 2025 64 rib 11 86.0 54.8 96.5 484 / 903 485 / 2349 64 ribkl 10 86.0 60.2 96.5 451 / 774 455 / 2131 64 rbkl 10 86.0 60.2 96.7 451 / 969 453 / 3555 256 rcb 12 84.5 49.3 96.1 451 / 1158 451 / 2326 256 rib 12 84.4 49.3 96.5 451 / 1033 451 / 6446 256 ribkl 13 84.8 45.7 96.4 451 / 903 453 / 7058 256 rbkl 12 84.9 49.5 96.5 451 / 903 455 / 2650 Table 2 how again 2 4 Ó and 2 Õ F5 a the mot important reult. Scaled efficiency per multigrid iteration reache 85% on 256 proceor due to the fat communication of the Cray T3D and low urface to volume ratio. For the maller proceor number a good correlation of interface length and efficiency per iteration i viible. For increaing number of proceor the variation in interface length become maller ince the number of cluter per proceor decreae. In the 256 proceor cae only 6 cluter are aigned to one proceor. Uing a maller cluter depth would lead to an increaing number of cluter 11/10/2002 23:06 Submiion parmech

q Parallel Adaptive Multigrid Method in Plane Linear Elaticity Problem 21 yielding better partitioning on the one hand but additional inter-grid communication on the other hand. In practice the bet efficiency ha been obtained with the above-mentioned parameter. 6.2. Adaptively refined tet cae In order to invetigate the parallel performance in the cae of adaptive local grid refinement the problem in Fig. 8 ha been olved. The domain with 16 reentrant corner wa choen not to tet the efficiency of the refinement indicator but the efficiency of the parallel multigrid olver in a hierarchically refined et of grid. Dirichlet boundary condition have been applied at Z and Neumann boundary condition at Z-[. The refinement i concentrated at the reentrant corner. A locally refined meh i hown in Fig. 9. Again a fixed-ize and a caled ize computation are preented. Γ 1 [Y[\[Y[Y[Y[Y[Y[Y[\[Y[Y[Y[Y[Y[Y[ [Y[\[Y[Y[Y[Y[Y[Y[\[Y[Y[Y[Y[Y[Y[ [Y[\[Y[Y[Y[Y[Y[Y[\[Y[Y[Y[Y[Y[Y[ XYXYX [Y[\[Y[Y[Y[Y[Y[Y[\[Y[Y[Y[Y[Y[Y[ ZYZ XYXYX ZYZ XYXYX ZYZ Γ XYXYX ZYZ 0 XYXYX Γ 1 Γ 1 Γ 1 Γ 1 ZYZ Γ 0 XYXYX ZYZ XYXYX ZYZ Γ 1 Figure 8. Problem etup for the adaptively refined example. Figure 9. Adaptively refined meh. 6.2.3. Fixed-ize problem The meh for the fixed-ize calculation contained 23314 node (46628 degree of freedom) and 7 grid level. Table 3 how the reult for par- 11/10/2002 23:06 Submiion parmech

G G 22 Peter Batian, Knut Ecktein and Stefan Lang allel and numerical efficiency. In the 64 proceor calculation each computing node wa aigned only about 200 element on the finet level. The decreae in parallel efficiency compared to the uniformly refined cae i due to the following reaon: In contrat to the uniformly refined cae we have no geometric growth in the number of node per level. The number of node on level 3,4,5 and 6 were 3282, 8523, 18444 and 17249 in thi example. Thi lead to a wore calculation to communication ratio. Algorithm mg aign require that cluter with different top level are balanced eparately (cluter et T ). Thi lead to element on the ame grid level being aigned to proceor in everal independent tep which lead in turn to higher communication overhead and idle time. Table 3. Parallel adaptive problem LB # It. HLIFJ HLKVMNOHLP"RTS Min/Max HIF Min/Max Prod 1 16 4 rcb 16 88.6 88.6 97.4 25 / 28 25 / 28 4 rib 18 86.2 76.5 96.2 46 / 46 46 / 46 4 ribkl 18 79.7 70.8 96.2 20 / 20 20 / 20 4 rbkl 17 76.9 72.3 91.5 22 / 50 22 / 70 16 rcb 20 58.0 46.4 85.5 63 / 131 63 / 140 16 rib 19 55.8 47.0 85.0 13 / 56 13 / 104 16 ribkl 19 53.6 45.1 82.9 6 / 47 6 / 91 16 rbkl 19 54.5 45.9 81.2 8 / 60 8 / 161 64 rcb 22 24.7 18.0 65.8 30 / 111 32 / 138 64 rib 20 24.1 19.3 61.9 22 / 84 22 / 184 64 ribkl 20 24.5 19.6 63.2 8 / 69 8 / 155 64 rbkl 20 21.6 17.3 53.3 16 / 81 16 / 238 6.2.4. Scaled-ize problem The ame problem a in ection 6.2.3. i olved but now the number of node per proceor wa held approximately contant by varying the tolerance in the error indicator. Since the efficiency reult did not depend much on the load balancing cheme, only rcb wa 2"4 ued for thi experiment. Table 4 how Ó 2 and Õ 65 on up to 64 proceor. In order to judge the cot of the load p p^]_ balancing procedure the execution time for the mapping ( ), for the load migration 4W` ( p ) in the lat balancing tep and the multigrid olution time ( Õ 65 ) have been included in Table 4. Table 4. Scaled parallel adaptive problem a a H H H Unknown # It. KVMN acbed P#I6f IFJ KVM6N PRTS 1 23,314 16 49.9 4 96,538 16 51.2 0.66 1.15 92.4 92.4 97.6 16 299,587 18 49.1 3.57 4.52 80.7 71.7 93.6 64 691,309 19 29.5 3.36 2.50 65.4 55.1 85.4 11/10/2002 23:06 Submiion parmech

Parallel Adaptive Multigrid Method in Plane Linear Elaticity Problem 23 7. Concluion In thi paper we howed that adaptive multigrid method can be effectively implemented on modern parallel computer architecture and applied to linear elaticity calculation. The efficiencie per multigrid iteration reached 85% on 256 proceor of the CRAY T3D for a uniformly refined tet cae and 65% on 64 proceor for an adaptively refined tet cae. The problem of dynamic load balancing ha been dicued in detail and a central cheme with a clutering trategy baed on the meh hierarchy ha been propoed. Within thi algorithm tandard graph partitioning algorithm are ued to partition pecial ubet of the cluter. A public domain library (CHACO) ha been adapted to our code in order to compare different partitioning cheme. It ha been found that parallel efficiency i only lightly influenced by the different cheme becaue of the following reaon: For large problem computation time i dominating completely, for mall problem the coare grid in the multigrid proce are a non-negligible erial part. In both cae the influence of partition interface length on efficiency i only weak. Mapping the maximum workload onto different proceor on different level prohibit that ome of the communication time can be hidden behind computation time. Cache hit rate and therefore effective computation peed may depend on the hape of the partition. On the other hand it ha been found that the number of mg iteration to reach a certain accuracy may vary greatly with the different load balancing cheme. Obviouly the quality of the block Jacobi moother depend on the number, hape and poition of the partition, epecially if the element not iotropic. The invetigation how that many competing effect are influencing the oberved numerical efficiency and it i felt that partition interface length, the tandard meaure for load balancing algorithm, i not the mot important one. Mot of the effect are problem and/or machine dependent and therefore difficult to conider in an improved balancing cheme. Further work will include the extenion to nonlinear material law and the contruction of new load balancing cheme that alo take the ynchronization between multigrid level and load migration time into account. Acknowledgment The author would like to thank the Konrad-Zue-Center Berlin (ZIB) for the poibility of uing their 256 proceor CRAY T3D. Alo the ue of the CHACO load balancing library i greatly acknowledged. REFERENCES 1. K. Erikon, D. Etep, P. Hanbo, and C. Johnon, Introduction to adaptive method for differential equation, Acta Numerica, (1995). 2. H. Yerentant, Old and new convergence proof for multigrid method, Acta Numerica, (1993). 3. R. Bank, PLTMG Uer Guide Verion 6.0, SIAM, Philadelphia, 1990. 4. P. Deuflhard, P. Leinen, and H. Yerentant, Concept of an adaptive hierarchical finite element code, IMPACT of Computing in Science and Engineering, 1, 3 35, (1989). 11/10/2002 23:06 Submiion parmech

24 Peter Batian, Knut Ecktein and Stefan Lang 5. D. Brae, Finite Elemente, Springer, 1991. 6. T. J. R. Hughe, The Finite Element Method, Prentice Hall, 1987. 7. J. H. Bramble, J. E. Paciak, J. Wang, and J. Xu, Parallel multilevel preconditioner, Math. Comp., 55, 1 22, (1990). 8. H. Yerentant, Two preconditioner baed on multi-level plitting of finite element pace, Numer. Math., 58, 163 184, (1990). 9. G. Wittum, On the robutne of ilu moothing, SIAM J. Sci. Statit. Comput., 10, 699 717, (1989). 10. P. Batian, Parallele adaptive Mehrgitterverfahren, Teubner Skripten zur Numerik, Teubner- Verlag, 1996. 11. B. Nour-Omid, A. Raefky, and G. Lyzenga, Solving finite-element equation on concurrent computer, in Parallel computation and their impact on mechanic, ed., A. K. Noor, 209 227, American Soc. Mech. Eng., New York, (1986). 12. Hendrickon and R. Leland, The chaco uer guide verion 1.0, Technical Report SAND93-2339, Sandia National Laboratory, (October 1993). 13. B. Hendrickon and R. Leland, Multidimenional pectral load balancing, Technical Report SAND93-0074, Sandia National Laboratory, (January 1993). 14. B. W. Kernighan and S. Lin, An efficient heuritic procedure for partitioning graph, The Bell Sytem Technical Journal, 49, 291 307, (1970). 15. G. Karypi and V. Kumar, A fat and high quality multilevel cheme for partitioning irregular graph, Technical Report 95-035, Univerity of Minneota, Department of Computer Science, (1995). 16. S. Lang, Latverteilung für paralle adaptive Mehrgitterberechnungen, Mater thei, Univerität Erlangen-Nürnberg, IMMD III, 1994. 11/10/2002 23:06 Submiion parmech