A Prefix Code Matching Parallel Load-Balancing Method for Solution-Adaptive Unstructured Finite Element Graphs on Distributed Memory Multicomputers

Transcription

1 Ž. The Journal of Supercomputng, 15, Kluwer Academc Publshers. Manufactured n The Netherlands. A Prefx Code Matchng Parallel Load-Balancng Method for Soluton-Adaptve Unstructured Fnte Element Graphs on Dstrbuted Memory Multcomputers YEH-CHING CHUNG, CHING-JUNG LIAO, DON-LIN YANG ychung, cjlao, dlyang@ecs.fcu.edu.tw Department of Informaton Engneerng, Feng Cha Un ersty, Tachung, Tawan 407, ROC Ž. Rece ed February 18, 1998; fnal erson accepted No ember 13, Abstract. In ths paper, we propose a prefx code matchng parallel load-balancng method Ž PCMPLB. to effcently deal wth the load mbalance of soluton-adaptve fnte element applcaton programs on dstrbuted memory multcomputers. The man dea of the PCMPLB method s frst to construct a prefx code tree for processors. Based on the prefx code tree, a schedule for performng load transfer among processors can be determned by concurrently and recursvely dvdng the tree nto two subtrees and fndng a maxmum matchng for processors n the two subtrees untl the leaves of the prefx code tree are reached. We have mplemented the PCMPLB method on an SP2 parallel machne and compared ts performance wth two load-balancng methods, the drected dffuson method and the multlevel dffuson method, and fve mappng methods, the AE ORB method, the AE MC method, the ML kp method, the PARTY lbrary method, and the JOSTLE-MS method. An unstructured fnte element graph Truss was used as a test sample. Durng the executon, Truss was refned fve tmes. Three crtera, the executon tme of mappng load-balancng methods, the executon tme of an applcaton program under dfferent mappng load-balancng methods, and the speedups acheved by mappng load-balancng methods for an applcaton program, are used for the performance evaluaton. The expermental results show that Ž. 1 f a mappng method s used for the ntal parttonng and ths mappng method or a load-balancng method s used n each refnement, the executon tme of an applcaton program under a load-balancng method s less than that of the mappng method. Ž. 2 The executon tme of an applcaton program under the PCMPLB method s less than that of the drected dffuson method and the multlevel dffuson method. Keywords: dstrbuted memory multcomputers, parttonng, mappng, remappng, load balancng, soluton-adaptve unstructured fnte element models 1. Introducton The fnte element method s wdely used for the structural modelng of physcal systems 27. To solve a problem usng the fnte element method, n general, we need to establsh the fnte element graph of the problem. A fnte element graph s a connected and undrected graph that conssts of a number of fnte elements. Each fnte element s composed of a number of nodes. The number of nodes of a fnte element s determned by an applcaton. In Fgure 1, an example of a 21-node fnte element graph consstng of 24 fnte elements s shown. Due to the

2 26 CHUNG, LIAO AND YANG Fgure 1. An example of a 21-node fnte element graph wth 24 fnte elements Žthe crcled and uncrcled numbers denote the node numbers and fnte element numbers, respectvely.. propertes of computaton-ntensveness and computaton-localty, t s very attractve to mplement the fnte element method on dstrbuted memory multcomputers 1, 41 42, In the context of parallelzng a fnte element applcaton program that uses teratve technques to solve a system of equatons 2 3, a parallel program may be vewed as a collecton of tasks represented by nodes of a fnte element graph. Each node represents a partcular amount of computaton and can be executed ndependently. To effcently execute a fnte element applcaton program on a dstrbuted memory multcomputer, we need to map nodes of the correspondng graph to processors of the multcomputer such that each processor has approxmately the same amount of computatonal load and the communcaton among processors s mnmzed. Snce ths mappng problem s known to be NP-complete 13, many heurstc methods were proposed to fnd satsfactory suboptmal solutons 4 6, 8, 10 11, 14 15, 17 18, 24 26, 29, 31, 33, 35, 41, 43 44, If the number of nodes of a fnte element graph does not ncrease durng the executon of a fnte element applcaton program, the mappng algorthm only needs to be performed once. For a soluton-adaptve fnte element applcaton program, the number of nodes ncreases dscretely due to the refnement of some fnte elements durng the executon. Ths may result n load mbalance of processors. A node remappng or a load-balancng algorthm has to be performed many tmes n order to balance the computatonal load of processors whle keepng the communcaton cost among processors as low as possble. For the node remappng approach, some mappng algorthms can be used to partton a fnte element graph from scratch. For the load-balancng approach, some load-balancng algorthms can be used to perform the load balancng process accordng to the current load of processors. Snce node remappng or load-balancng algorthms are performed at run-tme, ther executon must be fast and effcent.

3 A PREFIX CODE 27 In ths paper, we propose a prefx code matchng parallel load-balancng Ž PCMPLB. method to effcently deal wth the load mbalance of soluton-adaptve fnte element applcaton programs on dstrbuted memory multcomputers. In a fnte element graph, the value of a node depends on the values of node and other nodes that are n the same fnte elements of node. For example, n Fgure 1, nodes 3, 5, and 6 form element 3, nodes 5, 6, and 9 form element 5, and nodes 6, 9, and 10 form element 8. The value of node 6 depends on the values of node 6, node 3, node 5, node 9, and node 10. When nodes of a soluton-adaptve fnte element graph were evenly dstrbuted to processors by some mappng algorthms, two processors, P and P j, need to communcate to each other f two nodes n the same fnte element are mapped to P and P j, respectvely. Accordng to ths communcaton property of a parttoned fnte element graph, we can get a processor graph from the mappng. In a processor graph, nodes represent the processors and edges represent the communcaton needed among processors. The weghts assocated wth nodes and edges denote the computaton and the communcaton costs, respectvely. Fgure 2Ž. a shows a mappng of Fgure 1 on 7 processors. The correspondng processor graph of Fgure 2Ž. a s shown n Fgure 2Ž. b. When a fnte element graph s refned at run-tme, t may result n load mbalance of processors. To balance the computatonal load of processors, the PCMPLB method s frst to construct a prefx code tree for processors accordng to the processor graph, where the leaves of the prefx code tree are processors. Based on the prefx code tree, a schedule for performng load transfer among processors can be determned by concurrently and recursvely dvdng the tree nto two subtrees and fndng a maxmum matchng for processors n the two subtrees untl the leaves of the prefx code tree are reached. We have mplemented ths method on an SP2 parallel machne and compared ts performance wth two load-balancng methods, the drected dffuson method 9, 44 and the multlevel dffuson method 37 38, and fve mappng methods, the Fgure 2. ure 2Ž. a. Ž. a A mappng of Fgure 1 on 7 processors. Ž. b The correspondng processor graph of Fg-

4 28 CHUNG, LIAO AND YANG AE ORB method 4, 8, 29, 48, the AE MC method 8, 10, 12, 14, 26, the ML kp method 24, the PARTY lbrary method 35, and the JOSTLE-MS method An unstructured fnte element graph Truss was used as the test sample. Durng the executon, Truss was refned fve tmes. Three crtera, the executon tme of mappng load-balancng methods, the executon tme of an applcaton program under dfferent mappng load-balancng methods, and the speedups acheved by mappng load-balancng methods for an applcaton program, are used for the performance evaluaton. The expermental results show that Ž. 1 f a mappng method s used for the ntal parttonng and ths mappng method or a load-balancng method s used n each refnement, the executon tme of an applcaton program under a load-balancng method s less than that of the mappng method. Ž. 2 The executon tme of an applcaton program under the PCMPLB method s less than that of the drected dffuson method and the multlevel dffuson method. The paper s organzed as follows. The related work s gven n Secton 2. In Secton 3, the proposed prefx code matchng parallel load-balancng method s descrbed n detal. In Secton 4, we present the cost model of mappng load-balancng methods for unstructured fnte element graphs on dstrbuted memory multcomputers. The performance evaluaton and smulaton results are also presented n ths secton. 2. Related work Many methods have been proposed to deal wth the load unbalancng problems of soluton-adaptve fnte element graphs on dstrbuted memory multcomputers n the lterature. They can be classfed nto two categores, the remappng method and the load redstrbuton method. For the remappng method, many fnte element graph mappng methods can be used as remappng methods. In general, they can be dvded nto fve classes, the orthogonal secton approach 4, 8, 29, 48, the mn-cut approach 8, 10, 12, 14, 26, the spectral approach 5 6, 41, the multle el approach 5 6, 19, 24 25, and others 15, 31, 34, 41. These methods were mplemented n several graph partton lbrares, such as Chaco 17, DIME 48, JOSTLE 45, METIS 24 25, PARTY 35, Scotch 33, and TOP DOMDEC 11, etc., to solve graph partton problems. Snce our man focus s on the load-balancng methods, we do not descrbe these mappng methods here. For the load redstrbuton method, many load-balancng algorthms can be used as load redstrbuton methods. In 46, a recent comparson study of dynamc load balancng strateges on hghly parallel computers s gven. The dmenson exchange method Ž DEM. s appled to applcaton programs wthout geometrc structure 9. It s conceptually desgned for a hypercube system but may be appled to other topologes, such as k-ary n-cubes 51. Ou and Ranka 32 proposed a lnear programmng-based method to solve the ncremental graph partton problem. Wu 39, 49 proposed the tree walkng, the cube walkng, and the mesh walkng run tme schedulng algorthms to balance the load of processors on tree-base, cubebase, and mesh-base paradgms, respectvely.

5 A PREFIX CODE 29 Hu and Blake 22 proposed a drected dffuson method that computes the dffuson soluton by usng an unsteady heat conducton equaton whle optmally mnmzng the Eucldean norm of the data movement. They proved that the dffuson soluton could be found by solvng the lnear equaton. The dffuson soluton s a vector wth n elements. For any two elements, n, f j j s postve, then partton needs to send j nodes to partton j. Otherwse, partton j needs to send nodes to partton. Herch and Taylor 16 j proposed a drect dffusve load balancng method for scalable multcomputers. They derved a relable and scalable load balancng method based on propertes of the parabolc heat equaton ut 2 u 0. Horton 20 proposed a multlevel dffuson method by recursvely bsectng a communcaton graph nto two subgraphs and balancng the load of processors n the two subgraphs. In each bsecton, the two subgraphs are connected by one or more edges and the dfference of processors n the two subgraphs s less than or equal to 1. Schloegel et al. 38 also proposed a multlevel dffuson scheme to construct a new partton of the graph ncrementally. It contans three phases, a coarsenng phase, a multlevel dffuson phase, and a multlevel refnement phase. A parallel multlevel dffuson algorthm was also descrbed n 37. These algorthms perform dffuson n a multlevel framework and mnmze data movement wthout comprsng the edge-cut. Ther methods also nclude parameterzed heurstcs to specfcally optmze edge-cut, total data mgraton, and the maxmum amount of data mgrated n and out of any processor. Walshaw et al. 44 mplemented a parallel parttoner and a drected dffuson reparttoner n JOSTLE 45. The drected dffuson method s based on the dffuson solver proposed n 22. It has two dstnct phases, the load-balancng phase and the local refnement phase. In the load-balancng phase, the dffuson soluton gudes vertex mgraton to balance the load of processors. In the local refnement phase, a local vew of the graph gudes vertex mgraton to decrease the number of cut-edges between processors. They also developed a multlevel dffuson reparttoner n JOSTLE. Olker and Bswas 30 presented a novel method to dynamcally balance the processor workloads wth a global vew. Several novel features of ther framework were descrbed such as the dual graph repartton, the parallel mesh reparttoner, the optmal and heurstc remappng cost functons, the effcent data movement and refnement schemes, and the accurate metrcs comparng the computatonal gan and the redstrbuton cost. They also developed generc metrcs to model the remappng cost for multprocessor systems. 3. The prefx code matchng parallel load-balancng method To deal wth the load mbalance of a soluton-adaptve fnte element applcaton program on a dstrbuted memory multcomputer, a load-balancng algorthm needs to balance the load of processors and mnmze the communcaton among processors. Snce a load-balancng algorthm s performed at run-tme, the executon of a load-balancng algorthm must be fast and effcent. In ths secton, we wll descrbe

6 30 CHUNG, LIAO AND YANG a fast and effcent load-balancng method, the prefx code matchng parallel load-balancng Ž PCMPLB. method, for soluton-adaptve fnte element applcaton programs on dstrbuted memory multcomputers n detal. The man dea of the PCMPLB method s frst to construct a prefx code tree for processors. Based on the prefx code tree, a schedule for performng load transfer among processors can be determned by concurrently and recursvely dvdng the tree nto two subtrees and fndng a maxmum matchng for processors n the two subtrees untl the leaves of the prefx code tree are reached. Once a schedule s determned, a physcal load transfer procedure can be carred out to mnmze the communcaton overheads among processors. The PCMPLB method can be dvded nto the followng four phases. Phase 1: Obtan a processor graph G from the ntal partton. Phase 2: Construct a prefx code tree for processors n G. Phase 3: Determne the load transfer sequence by usng matchng theorem. Phase 4: Perform the load transfer. In the followng, we wll descrbe them n detal The processor graph In a fnte element graph, the value of a node depends on the values of node and other nodes that are n the same fnte elements of node. When nodes of a soluton-adaptve fnte element graph were dstrbuted to processors by some mappng algorthms, two processors, P and P j, need to communcate to each other f two nodes n the same fnte element are mapped to P and P j, respectvely. Accordng to ths communcaton property of a parttoned fnte element graph, we can get a processor graph from the mappng. In a processor graph, nodes represent the processors and edges represent the communcaton needed among processors. The weghts assocated wth nodes and edges denote the computaton and the communcaton costs, respectvely. We now gve an example to explan t. Example 1. Fgure 3 shows an example of a processor graph. Fgure 3Ž. a shows an ntal partton of a 100-node fnte element graph on 10 processors by usng the ML kp method. In Fgure 3Ž. a, all processors are assgned 10 fnte element nodes. After a refnement, the number of nodes assgned to processors P 0, P 1, P 2, P 3, P 4, P 5, P 6, P 7, P 8, and P9 are 10, 11, 11, 12, 10, 19, 16, 13, 13, and 13, respectvely, and s shown n Fgure 3 Ž b.. The correspondng processor graph of Fgure 3 Ž b. s shown n Fgure 3Ž. c The constructon of a prefx code tree Based on the processor graph, we can construct a prefx code tree. The algorthm for constructon of a prefx code tree T s based on Huffman s algorthm 23 Prefx and s gven as follows.

7 A PREFIX CODE 31 Fgure 3. An example of a processor graph. Ž. a The ntal parttoned fnte element graph. Ž. b The fnte element graph after a refnement. Ž. c The correspondng processor graph obtaned from Ž. b. Algorthm buld prefx code treež G. 1 Let V be a set of P solated vertces, where P s the number of processors n a processor graph G. Each vertex P n V s the root of a complete bnary tree Ž of heght 0. wth a weght w 1. 2 Whle V Fnd a tree T n V wth the smallest root weght w. If there are two or more canddates, choose the one whose leaf nodes have the smallest degree n G. 5 For trees n V whose leaf nodes are adjacent to those n T, fnd a tree T wth the smallest root weght w. If there are two or more canddates, choose the one whose leaf nodes have the smallest degree n G. 6 Create a new Ž complete bnary. tree T* wth root weght w* w w and havng T and T as ts left and rght subtrees, respectvely. 7 Place T* nv and delete T and T. 8 4 end of buld prefx code tree We now gve an example to explan algorthm buld prefx code tree. Example 2. An example of step by step constructon of a prefx code tree from the processor graph shown n Fgure 3Ž. c s gven n Fgure 4. The degrees of processors P 0, P 1, P 2, P 3, P 4, P 5, P 6, P 7, P 8, and P9 are 3, 3, 3, 6, 4, 2, 4, 4, 4, and 5, respectvely. The ntal confguraton s shown n Fgure 4Ž. a. In the frst teraton of lnes 2 8 of algorthm buld prefx code tree, P5 has the smallest degree among those trees wth root weght 1. P5 s selected as the canddate n lne 4. In lne 5, both P6 and P7 are adjacent to P5 wth root weght 1. The degrees of P6 and P are the same. We choose P as the canddate because P has a smaller rank. P5 and P6 are then combned as a tree n lne 6. After the executon of lne 7, we obtan a new confguraton as shown n Fgure 4Ž b.. In the second teraton, P s selected n lne 4. P s selected n 0 1

8 32 CHUNG, LIAO AND YANG Ž. Fgure 4. A step by step constructon of a prefx code tree from Fgure 3 c. lne 5. After the executon of lnes 6 and 7, we obtan a new confguraton as shown n Fgure 4Ž. c. By contnung the teraton seven tmes, we can obtan Fgures 4Ž d. 4Ž j Determne a load transfer sequence by usng matchng theorem Based on the prefx code tree and the processor graph, we can obtan a communcaton pattern graph.

9 A PREFIX CODE 33 Defnton 1. Gven a processor graph G Ž V, E. and a prefx code tree T Prefx, the communcaton pattern graph G Ž V, E. c c c of G and TPrefx s a subgraph of G. For every Ž P, P. j E c, P and Pj are n the left and the rght subtrees of T Prefx, respectvely, and P, Pj V c. The communcaton pattern graph has several propertes that can be used to determne the load transfer sequence. Defnton 2. A graph G Ž V, E. s called bpartte f V V V wth V V, and every edge of G s of the form Ž a, b. wth a V and b V Theorem 1. A communcaton pattern graph G s a bpartte graph. c Proof: Accordng to Defnton 1, for every Ž P, P. j E c, P and Pj are n the left and rght subtrees of T Prefx, respectvely. Therefore Gc s a bpartte graph. Defnton 3. A subset M of E s called a matchng n G f ts elements are edges and no two of them are adjacent n G; the two ends of an edge n M are sad to be matched under M. M s a maxmum matchng f G has no matchng M wth M M. Theorem 2. Let G Ž V, E. be a bpartte graph wth bpartton Ž V, V Then G contans a matchng that saturates e ery ertex n V f and only f NŽ S. S 1 for all S V, where NŽ S. s the set of all neghbors of ertces n S. 1 Proof: The proof can be found n 7. Corollary 1. Let G Ž V, E. c c c be a communcaton pattern graph and VL and VR are the sets of processors n the left and the rght subtrees of T Prefx, respec ely, where V L, VR V c. Then we can fnd a maxmum matchng M from Gc such that for e ery element Ž P, P. M, P V and P V. j L j R Proof: From Theorem 2 and Hungaran method 7, we know that a maxmum matchng M from G can be found. c From the communcaton pattern graph, we can determne a load transfer sequence for processors n the left and the rght subtrees of a prefx code tree by usng the matchng theorem to fnd a maxmum matchng among the edges of the communcaton pattern graph. Due to the constructon process used n Phase 2, we can also obtan communcaton pattern graphs from the left and the rght subtrees of a prefx code tree. A load transfer sequence can be determned by concurrently and recursvely performng the followng steps, Step 1. Dvde a prefx code tree nto two subtrees, Step 2. Construct the correspondng communcaton pattern graph,

10 34 CHUNG, LIAO AND YANG Step 3. Fnd a maxmum matchng for the communcaton pattern graph, and Step 4. Determne the number of fnte element nodes to be transferred among processors, untl a tree contans only one vertex. Assume that a processor graph has P processors and a refned fnte element graph has N nodes. We defne N P as the average load of a processor. The load of a processor s defned as the number of fnte element nodes assgned to t. Let loadž P. and quotaž P. represent the load and the average load of processor P, respectvely. The algorthm to determne a load transfer sequence s gven as follows. Algorthm determne load transfer sequencež G, T. Prefx 1 Let ST be a set that contans the prefx code tree obtaned n Phase 2 and seq 0; 2 whle ST P * P s the number of processors n G* 3 T ST, f Ž the number of vertces n T 1. then Prefx Prefx 4 Let TL and TR be the left and the rght subtrees of T Prefx, respectvely; 5 Fnd the communcaton pattern graph G from the processor graph G c and the prefx code tree T Prefx; 6 Fnd a maxmum matchng M Ž P, Q. P and Q are processors n T and T, respectvely, and P and Q are adjacent n G4 L R from G c; 7 quotaž T. R the sum of the average load of processors n T R; 8 loadž T. R the sum of the load of processors n T R; 9 f ŽloadŽ T. quotaž T.. then R R 10 seq seq 1; LSseq ; 11 For each Ž P, Q. n M, do 12 m ŽloadŽ T. quotaž T.. M R R ; 13 f ŽloadŽ P. m. then flag 0 else flag 1; 14 LS LS Ž P, Q, m, flag.4 seq seq ; 15 loadž P. loadž P. m; and loadž Q. loadž Q. m; else f ŽloadŽ T. quotaž T. then R R 19 seq seq 1; LSseq ; 20 For each Ž P, Q. n M, do 21 m ŽquotaŽ T. loadž T.. M R R ; 22 f ŽloadŽ Q. m. then flag 0 else flag 1; 23 LS LS Ž Q, P, m, flag.4 seq seq ; 24 loadž P. loadž P. m; and loadž Q. loadž Q. m; Place TL and TR n ST and delete TPrefx from ST;

11 A PREFIX CODE 35 * Excepton Handlng * 30 f Ž Ž SP, RP, m,0. LS, where 1,...,seq. then 31 For each processor P, reset loadž P. to ts ntal value; 32 for Ž k 1; k seq; k. 33 For each element Ž SP, RP, m, flag. n LS, do k 34 f ŽloadŽ SP. m. then flag 0 else flag 1; 35 f Ž flag 1. then loadž SP. loadž SP. m; loadž RP. loadž RP. m4 36 else 37 f ŽŽ SP, RP, m, flag. can be performed wth other elements n 38 LS n parallel. then LS LS Ž SP, RP, m, flag.4 seq seq seq 39 else seq ; LS Ž SP, RP, m, flag.44 seq ; 40 LS LS Ž SP, RP, m, flag.4 k k ; end of determne load transfer sequence In the above descrpton, lnes 2 29 construct a load transfer sequence, LS 1, LS 2,...,LS seq, accordng to the processor graph, the prefx code tree, and the matchng theorem. Each step LS n a load transfer sequence may contan more than one element. It means that the load transfer process for each processor par n LS can be performed n parallel. Lnes form an excepton handlng process. For an element Ž SP, RP, m, flag. n LS, t s possble that loadž SP. m. In ths case, f we perform the load transfer process, processor SP fals to send the desred nodes to processor RP. To avod ths stuaton, n the excepton handlng process, we postpone the executon of all elements Ž SP, RP, m, 0. n LS by movng them to the end of the sequence. We have the followng theorem. Theorem 3. By executng the load transfer sequence obtaned from algorthm determne load transfer sequence, the load of processors can be fully balanced. Proof: Assume that a processor graph has P processors and a refned fnte element graph has N nodes, where N s dvsble by P. Accordng to lnes 2 29, we know that a load transfer sequence, LS 1, LS 2,...,LS seq, can be generated. Each LS s generated by lnes 9 26, where 1,...,seq. For a processor SP Ž RP. n Ž SP, RP, m, flag. of LS, LS,...,LS, t needs to send Ž receve. 1 2 seq m fnte element nodes to Ž from. processor RP Ž SP.. Due to the method used to determne the load transfer numbers n lnes 9 26, after a sequence of send and receve operatons, the number of nodes n each processor wll be equal to N P. We have the followng two cases. Case 1. If the value of flag of each element Ž SP, RP, m, flag. n LS 1, LS 2,,...LSseq s equal to 1; that s, the load of SP s always greater than or equal to m,

12 36 CHUNG, LIAO AND YANG then the load of processors can be balanced after executng the load transfer sequence produced by lnes In ths case, the load transfer sequence s obtaned accordng to the communcaton pattern graphs and the maxmum matchng. The value of seq s equal to log P. Case 2. If there exsts some elements Ž SP, RP, m,0. n LS 1, LS 2,...,LS seq, then a new load transfer sequence, LS, LS,...,LS 1 2 seq, s produced by lnes Snce lnes only alter the executon order of elements n LS 1, LS 2,...,LS seq, the load transfer pars and the number of fnte element nodes needed to be transferred between processors n the load transfer par n LS,...,LS 1 seq, are the same as those n LS 1,...,LS seq. Therefore, f we can clam that the value of flag of each element Ž SP, RP, m, flag. n LS 1, LS 2,...,LS seq, s equal to 1, then the load of processors can be balanced after the load transfer sequence LS, LS,...,LS 1 2 seq, s performed. Let CG Ž V, E. be a communcaton graph, where V s the set of processors that appear n the frst two postons of elements Ž SP, RP, m, flag. n LS, LS,...,LS 1 2 seq, and E s the set of arcs that represent the send receve relatons of SP and RP for elements Ž SP, RP, m, flag. n LS 1, LS 2,..., LS seq. Obvously, CG s a dgraph. Assume that there exsts some elements Ž SP, RP, m, flag. n LS 1, LS 2,...,LSseq n whch the values of flag are equal to 0. Accordng to lnes 30 44, these elements are n LS,...,LS k seq, where k 1. When we perform load transfer accordng to LS, LS,...,LS 1 2 seq, whenever a processor sends nodes to another processor, we remove the correspondng send receve relaton from CG. After steps LS,...,LS 1 k 1 are performed, we have a communcaton graph CG that corresponds to the send receve relatons of elements Ž SP, RP, m, flag. n LS k,..., LS seq. Snce the value of flag for each element Ž SP, RP, m, flag. n LS k,...,lsseq s equal to 0, each processor n CG needs to receve nodes from other processors of CG n order to balance ts load. Therefore, the n-degree of each vertex n CG s greater than or equal to 1. It mples that CG contans loops. However, the constructon of a load transfer sequence n lnes 2 29 s based on the bpartte graphs and the maxmum matchng. CG contans no loops. Snce CG s a subgraph of CG, t mples that CG contans no loops ether. Ths contradcts our assumpton. Therefore, the value of flag for each element Ž SP, RP, m, flag. n LS 1, LS 2,...,LSseq s equal to 1. In ths case, the load transfer sequence s obtaned accordng to lnes 2 29 and lnes The number of elements Ž SP, RP, m, flag. that can be generated n lnes 2 29 s less than or equal to log P P 2. Therefore, the number of steps of the load transfer sequence generated by lnes s less than or equal to log P P 2,.e., seq log P P 2. We now gve an example to explan the behavor of algorthm determne load transfer sequence. Example 3. Fgure 5 shows the communcaton pattern graphs and the correspondng maxmum matchng for the example shown n Fgures 3 and 4 step by step when performng algorthm determne load transfer sequence. Fgure 5Ž. a shows the communcaton pattern graph for the prefx code tree wth root at level 1. In

13 A PREFIX CODE 37 Fgure 5. The matchng of the communcaton pattern graph. Ž. a The frst matchng. Ž. b The second matchng. Ž. c The thrd matchng. Ž. d The fourth matchng. Fgure 5Ž. a, an arrow s an element of a matchng. The number assocated wth an arrow denotes the number of fnte element nodes that a processor needs to send to another processor. In ths case, T P, P, P, P 4, T L R P 0, P 1, P 2, P 3, P, P 4, loadž T. 61, loadž T. 67, quotaž T. 51, and quotaž T. 4 8 L R L R 77. Accordng to lne 20 of algorthm determne load transfer sequence, the processors n TR need to receve 10 nodes from the processors n T L. From the communcaton pattern graph and the matchng theorem, we have LS1 Ž P 6, P 8,4,1., Ž P 7, P 4,3,1., Ž P, P, 3, 1.4. Fgure 5Ž b. to Fgure 5Ž d. 9 3 show the communcaton pattern graphs and the correspondng maxmum matchng for the prefx code trees wth roots at levels 2, 3, and 4, respectvely. After the executon of lnes 2 29 of algorthm determne load transfer sequence, we have the followng load transfer sequence. LS Ž P, P,4,1., Ž P, P,3,1., Ž P, P,3, ; LS Ž P, P,3,1., Ž P, P,2,1., Ž P, P,3,1., Ž P, P,4, ; LS Ž P, P,3,1., Ž P, P,2,1., Ž P, P,1, ; LS Ž P, P,2, Snce the value of flag of each element Ž SP, RP, m, flag. n LS 1,..., LS4 s equal to 1, the executon of algorthm determne load transfer sequence s termnated.

14 38 CHUNG, LIAO AND YANG Table 1. The load of each processor n each matchng Intal Load n each matchng Processor weght Quota Table 1 shows the ntal load of each processor, the quota of each processor, and the load of each processor after each matchng for the gven example Perform the physcal load transfer After the determnaton of the load transfer sequence, the physcal load transfer can be carred out among the processors accordng to the load transfer sequence n parallel. The goals of the physcal load transfer are balancng the load of processors and mnmzng the communcaton cost among processors. Assume that processor P needs to send m fnte element nodes to processor Q. To mnmze the communcaton cost between processors P and Q, P sends fnte element nodes that are adjacent to those n Q Ž we called these nodes as boundary nodes. to Q. If the number of boundary nodes s greater than m, boundary nodes wth smaller degrees wll be sent from P to Q. If the number of boundary nodes s less than m, the boundary nodes and nodes that are adjacent to the boundary nodes wll be sent from P to Q. The algorthm of the PCMPLB method s summarzed as follows. Algorthm Prefx Code Matchng Parallel Load BalancngŽ P, N. * P s the number of processors and N s the number of fnte element nodes * 1. Obtan a processor graph G from the ntal partton; 2. T buld prefx code treež G.; 3. determne load transfer sequencež G, T. 4. Transfer the fnte element nodes accordng to the load transfer sequence n parallel; end of Prefx Code Matchng Parallel Load Balancng

15 A PREFIX CODE 39 Ž. Fgure 6. The test sample Truss 7325 nodes, elements. 4. Performance evaluaton and expermental results We have mplemented the PCMPLB method on an SP2 parallel machne and compared ts performance wth two load-balancng methods, the drected dffuson method Ž DD. 45 and the multlevel dffuson method Ž MD. 45, and fve mappng methods, the AE MC method 8, the AE ORB method 8, the JOSTLE-MS method 45, the ML kp method 24, and the PARTY lbrary method 35. Three crtera, the executon tme of mappng load-balancng methods, the computaton tme of an applcaton program under dfferent mappng load-balancng methods, and the speedups acheved by the mappng load-balancng methods for an applcaton program, are used for the performance evaluaton. In dealng wth the unstructured fnte element graphs, the dstrbuted rregular mesh envronment Ž DIME. 48 s used. DIME s a programmng envronment for dong dstrbuted calculatons wth unstructured trangular meshes. The mesh covers a two-dmensonal manfold, whose boundares may be defned by straght lnes, arcs of crcles, or Bezer cubc sectons. It also provdes functons for creatng, manpulatng, and refnng unstructured trangular meshes. Snce the number of nodes n an unstructured trangular mesh cannot be over 10,000 n DIME, n ths paper, we only use DIME to generate the ntal test sample. From the ntal test graph, we use our refnng algorthms and data structures to generate the desred test graphs. The ntal test graph used for the performance evaluaton s shown n Fgure 6. The number of nodes and elements for the test graph after each refnement are shown n Table 2. For the presentaton purpose, Table 2. The number of nodes and elements of the test graph Truss Refnement Intal Samples Ž Truss Node Element

16 40 CHUNG, LIAO AND YANG the number of nodes and the number of fnte elements shown n Fgure 6 are less than those shown n Table 2. To emulate the executon of a soluton-adaptve fnte element applcaton program on an SP2 parallel machne, we have the followng steps. Frst, we read the ntal fnte element graph. Then we use the AE MC method, the AE ORB method, the JOSTLE-MS method, the ML kp method, or the PARTY lbrary method to map nodes of the ntal fnte element graph to processors. After the mappng, the computaton of each processor s carred out. In our example, the computaton s to solve Laplaces s equaton Ž Laplace solver.. The algorthm of solvng Laplaces s equaton s smlar to that of 2. Snce t s dffcult to predct the number of teratons for the convergence of a Laplace solver, we assume that the maxmum number of teratons executed by our Laplace solver s When the computaton s converged, the frst refned fnte element graph s read. To balance the computatonal load of processors, the AE MC method, the AE ORB method, the JOSTLE-MS method, the ML kp method, the PARTY lbrary method, the drected dffuson method, the multlevel dffuson method, or the PCMPLB method s appled. After a mappng load-balancng method s performed, the computaton for each processor s carred out. The mesh refnement, load balancng, and computaton processes are performed n turn untl the executon of a soluton-adaptve fnte element applcaton program s completed. By combnng the ntal mappng methods and methods for load balancng, there are twenty methods used for the performance evaluaton. We defned M AE MC, AE ORB, JOSTLE-MS, ML k P, PARTY, AE MC DD, AE ORB DD, JOSTLE-MS DD, ML kp DD, PARTY DD, AE MC MD, AE ORB MD, JOSTLE-MS MD, MLk P MD, PARTY MD, AE MC PCMPLB, AE ORB PCMPLB, JOSTLE-MS PCMPLB, ML kp PCMPLB, PARTY PCMPLB 4. In M, AE ORB means that the AE ORB method s used to perform the ntal mappng and the AE ORB method s used to balance the computatonal load of processors n each refnement. AE ORB PCMPLB means that the AE ORB method s used to perform the ntal mappng and the PCMPLB method s used to balance the computatonal load of processors n each refnement The cost model for mappng unstructured soluton-adapt e fnte element models on dstrbuted memory multcomputers To map an N-node fnte element graph on a P-processor dstrbuted memory multcomputer, we need to assgn nodes of the graph to processors of the multcomputer. There are P N mappngs. The executon tme of a fnte element graph on a dstrbuted memory multcomputer under a partcular mappng loadbalancng method L can be defned as follows: Ž. Ž. 4 T Ž L. max T L, P T L, P, Ž 1. par comp j comm j

17 A PREFIX CODE 41 where T Ž L. par s the executon tme of a fnte element applcaton program on a dstrbuted memory multcomputer under L, T Ž L, P. comp j s the computaton cost of processor P under L, and T Ž L, P. j comm j s the communcaton cost of processor Pj under L, where 1,...,P N and j 0,...,P 1. The cost model used n Equaton 1 s assumng a synchronous communcaton mode n whch each processor goes through a computaton phase followed by a communcaton phase. Therefore, the computaton cost of processor Pj under a mappng load-balancng method L can be defned as follows: Ž. Ž. T L, P S load P T, Ž 2. comp j j task where S s the number of teratons performed by a fnte element method, load Ž P. j s the number of nodes of a fnte element graph assgned to processor P j, and Ttask s the tme for a processor to execute a task. In our communcaton model, we assume that every processor can communcate wth all other processors n one step. In general, t s possble to overlap the communcaton wth the computaton. In ths case, T Ž L, P. comm j may not always reflect the true communcaton cost snce t could be partally overlapped wth that of the computaton. However, T Ž L, P. comm j can provde a good estmate for the communcaton cost. Snce we use a synchronous communcaton mode, T Ž L, P. can be defned as follows: comm j Ž. Ž. T L, P S T T, Ž 3. comm j setup c where S s the number of teratons performed by a fnte element method, s the number of processors that processor Pj has to send data to n each teraton, Tsetup s the setup tme of the I O channel, s the total number of bytes that processor Pj has to send out n each teraton, and Tc s the data transmsson tme of the I O channel per byte. Let Tseq denote the executon tme of a fnte element graph on a dstrbuted memory multcomputer wth one processor. The speedup resulted from a mappng load-balancng method L for an applcaton program s defned as T seq SpeedupŽ L., Ž 4. T Ž L. par Let TŽ L. denote the tme for the Laplace solver to execute one teraton for the th refnement of the test fnte element graph under a mappng load-balancng method, where 0, 1,..., 5 and L M. For the presentaton purpose, we assume that the ntal fnte element graph s the 0th refned fnte element graph. TŽ L. s defned as follows: Ž. Ž. T Ž L. T L, P T L, P, Ž 5. comp j comm j

18 42 CHUNG, LIAO AND YANG The total executon tme of test fnte element graphs on a dstrbuted memory multcomputer s defned as follows: 5 Ý T Ž L. T Ž L. T Ž L. S, Ž 6. total exec 0 where T Ž L. total s the total executon tme of the test samples under a mappng load-balancng method L on a dstrbuted memory multcomputer, L M, T Ž L. exec s the total executon tme of a mappng load-balancng method L for test samples, and S s the number of teratons executed by the Laplace solver for the th refnement. From Equaton 6, we can derve the speedup acheved by a mappng load-balancng method as follows: Seq S SpeedupŽ L., Ž 7. T L Ý T L S Ý execž. 0 Ž. where SpeedupŽ L. s the speedup acheved by a mappng load-balancng L for test samples, L M, and Seq s the tme for the Laplace solver to execute one teraton for the th refnement of test graphs n sequental. The maxmum speedup acheved by a mappng load-balancng L can be derved by settng the value of S to. In ths case, T Ž L. exec s neglgble. We have the followng equaton: Seq Speedup Ž L.. Ž 8. Ý T Ž L. Ý 5 0 max 5 0 where Speedup Ž L. max s the maxmum speedup acheved by mappng load-balancng L and L M Comparsons of the executon tme of mappng load-balancng methods The executon tmes of dfferent mappng load-balancng methods for Truss on the 10, 30, and 50 processors of an SP2 parallel machne are shown n Table 3. In Table 3, we lst the ntal mappng tme and the refnement tme for mappng load-balancng methods. The ntal mappng tme s the executon tme of mappng methods to map fnte element nodes of the ntal test sample to processors. The refnement tme s the sum of the executon tme of mappng load-balancng methods to balance the load of processors after each refnement. Snce we deal wth the load balancng ssue n ths paper, we wll focus on the refnement tme comparson of mappng load-balancng methods. From Table 3, we can see that, n general, the refnement tme of load-balancng methods s less than that of the mappng methods. The reasons are Ž. 1 the mappng methods have a hgher tme complexty than those of the load-balancng methods;

19 A PREFIX CODE 43 Table 3. The executon tme of dfferent mappng load-balancng methods for the test sample on dfferent numbers of processors Number of processors Intal Intal Intal Method mappng Refnement mappng Refnement mappng Refnement AE MC AE MC DD AE MC MD AE MC PCMPLB AE ORB AE ORB DD AE ORB MD AE ORB PCMPLB JOSTLE-MS JOSTLE-MS DD JOSTLE-MS MD JOSTLE-MS PCMPLB ML kp ML kp DD ML kp MD ML kp PCMPLB PARTY PARTY DD PARTY MD PARTY PCMPLB Tme unt: seconds Ž. and 2 the mappng methods need to perform gather-scatter operatons that are tme consumng n each refnement. For the same ntal mappng method, the refnement tme of the PCMPLB method, n general, s less than that of the drected dffuson and the multlevel dffuson methods. The reasons are as follows: Ž. 1 The PCMPLB method has less tme complexty than those of the drected dffuson and the multlevel dffuson methods. Ž. 2 The number of data movement steps among processors n the PCMPLAB method s less than those of the drected dffuson method and the multlevel dffuson method Comparsons of the executon tme of the test sample under dfferent mappng load-balancng methods The tmes for a Laplace solver to execute one teraton Žcomputaton communcaton. for the test sample under dfferent mappng load-balancng meth-

20 44 CHUNG, LIAO AND YANG ods on the 10, 30, and 50 processors of an SP2 parallel machne are shown n Fgure 7, Fgure 8, and Fgure 9, respectvely. Snce we assume a synchronous communcaton model, the total tme for a Laplace solver to complete ts job s the sum of the computaton tme and the communcaton tme. From Fgure 7 to Fgure 9, we can see that f the ntal mappng s performed by a mappng method Ž for example AE ORB. and the same mappng method or a load-balancng method Ž DD, MD, PCMPLB. s performed for each refnement, the executon tme of a Laplace solver under the proposed load-balancng method s less than that of other methods. The reasons are as follows: Ž. 1 In the PCMPLB method, data mgraton s done between adjacent processors. Ths local data mgraton mechansm can reduce the communcaton cost of a Laplace Solver. Ž. 2 The PCMPLB method can fully balance the load of processors. In JOSTLE-MS, ML kp, and PARTY, 3% 5% load mbalance are allowed due to the trade-off between the computaton and the communcaton costs. The DD and MD methods may not be able to balance the load of processors sometmes. Ths load mbalance may result n hgh computaton cost. Fgure 7. The tme for Laplace solver to execute one teraton Ž computaton communcaton. for the test sample under dfferent mappng load-balancng methods on 10 processors.

21 A PREFIX CODE 45 Fgure 8. The tme for Laplace solver to execute one teraton Ž computaton communcaton. for the test sample under dfferent mappng load-balancng methods on 30 processors. Fgure 9. The tme for Laplace solver to execute one teraton Ž computaton communcaton. for the test sample under dfferent mappng load-balancng methods on 50 processors.

22 46 CHUNG, LIAO AND YANG 4.4. Comparsons of the speedups under the mappng load-balancng methods for the test sample The speedups and the maxmum speedups under the mappng load-balancng methods on the 10, 30, and 50 processors of an SP2 parallel machne for the test sample are shown n Table 4. From Table 4, we can see that f the ntal mappng s performed by a mappng method Ž for example AE ORB. and the same mappng method or a load-balancng method Ž DD, MD, PCMPLB. s performed for each refnement, the proposed load-balancng method has the best speedup among mappng load-balancng methods. From Table 4, we can also see that f the ntal mappng s performed by a mappng method Ž for example AE ORB. and the same mappng method or a load-balancng method Ž DD, MD, PCMPLB. s performed for each refnement, the proposed load-balancng method has the best maxmum speedup among mappng load-balancng methods. For the mappng methods, AE MC has the best maxmum speedups for test samples. For the load-balancng methods, AE MC PCMPLB has the best maxmum speedups for test samples. From Table 4, we can see that f a better ntal mappng method s used, a better maxmum speedup can be expected when the PCMPLB method s used n each refnement. Table 4. The speedups and maxmum speedups for the test sample under the mappng load-balancng methods on an SP2 parallel machne Number of processors Max. Max. Max. Methods Speedup Speedup Speedup Speedup Speedup Speedup AE MC AE MC DD AE MC MD AE MC PCMPLB AE ORB AE ORB DD AE ORB MD AE ORB PCMPLB JOSTLE-MS JOSTLE-MS DD JOSTLE-MS MD JOSTLE-MS PCMPLB ML kp ML kp DD ML kp MD ML kp PCMPLB PARTY PARTY DD PARTY MD PARTY PCMPLB

23 A PREFIX CODE Conclusons In ths paper, we have proposed a prefx code matchng parallel load balancng method, the PCMPLB method, to deal wth the load unbalancng problems of soluton-adaptve fnte element applcaton programs. We have mplemented ths method on an SP2 parallel machne and compared ts performance wth two load-balancng methods, the drected dffuson method and the multlevel dffuson method, and fve mappng methods, the AE MC method, the AE ORB method, the JOSTLE-MS method, the ML kp method, and the PARTY lbrary method. An unstructured fnte element graph Truss was used as the test sample. Three crtera, the executon tme of mappng load-balancng methods, the executon tme of a soluton-adaptve fnte element applcaton program under dfferent mappng load-balancng methods, and the speedups under mappng load-balancng methods for a soluton-adaptve fnte element applcaton program, are used for the performance evaluaton. The expermental results show that Ž. 1 f a mappng method s used for the ntal parttonng and ths mappng method or a load-balancng method s used n each refnement, the executon tme of an applcaton program under a load-balancng method s shorter than that of the mappng method. Ž. 2 The executon tme of an applcaton program under the PCMPLB method s shorter than that of the drected dffuson method and the multlevel dffuson method. Acknowledgments The authors would lke to thank Dr. Robert Pres, Professor Karyps, and Professor Chrs Walshaw for provdng the PARTY, the METIS, and JOSTLE software packages. The work of ths paper was partally supported by the Natonal Scence Councl of Republc of Chna under contract NSC E References 1. I. G. Angus, G. C. Fox, J. S. Km, and D. W. Walker. Sol ng Problems on Concurrent Processors, Vol. 2. Prentce-Hall, Englewood Clffs, N. J., C. Aykanat, F. Ozguner, F. Ercal, and P. Sadayaooan. Iteratve algorthms for soluton of large sparse systems of lnear equatons on hypercubes. IEEE Trans. on Computers, 37Ž 12.: , C. Aykanat, F. Ozguner, S. Martn, and S. M. Doravelu. Parallelzaton of a fnte element applcaton program on a hypercube multprocessor. Hypercube Multprocessor, , S. B. Baden. Programmng abstractons for dynamcally parttonng and coordnatng localzed scentfc calculatons runnng on multprocessors. SIAM Journal on Scentfc and Statstcal Computng, 12Ž. 1 : , S. T. Barnard and H. D. Smon. Fast multlevel mplementaton of recursve spectral bsecton for parttonng unstructured problems. Concurrency: Practce and Experence, 6Ž. 2 : , 1994.

24 48 CHUNG, LIAO AND YANG 6. S. T. Barnard and H. D. Smon. A parallel mplementaton of multlevel recursve spectral bsecton for applcaton to adaptve unstructured meshes. Proceedngs of the Se enth SIAM Conference on Parallel Processng for Scentfc Computng, pp San Francsco, Feb J. A. Bondy and U. S. R. Murty. Graph Theory wth Applcatons. Elsever North Holland, New York, Y. C. Chung and C. J. Lao, A processor orented parttonng method for mappng unstructured fnte element models on SP2 parallel machnes. Techncal report. Insttute of Informaton Engneerng, Feng Cha Unversty, Tachung, Tawan, G. Cybenko. Dynamc load balancng for dstrbuted memory multprocessors. Journal of Parallel and Dstrbuted Computng, 7Ž. 2 : , F. Ercal, J. Ramanujam, and P. Sadayappan. Task allocaton onto a hypercube by recursve mncut bparttonng. Journal of Parallel and Dstrbuted Computng, 10:35 44, C. Farhat and H. D. Smon. TOP DOMDEC a software tool for mesh parttonng and parallel processng. Techncal report RNR NASA Ames Research Center, C. M. Fducca and R. M. Mattheyes. A lnear-tme heurstc for mprovng network parttons. Proceedng of the 19th IEEE Desgn Automaton Conference, pp , M. R. Garey and D. S. Johnson. Computers and Intractablty, A Gude to Theory of NP-Completeness. Freeman, San Francsco, J. R. Glbert and E. Zmjewsk. A parallel graph parttonng algorthm for a message-passng multprocessor. Internatonal Journal of Parallel Programmng, 16Ž. 6 : , J. R. Glbert, G. L. Mller, and S. H. Teng. Geometrc mesh parttonng: mplementaton and experments. Proceedngs of 9th Internatonal Parallel Processng Symposum, pp Santa Barbara, Calf. Apr A. Herch and S. Taylor. A Parabolc Load Balancng Method, Proceedng of ICPP 95, pp , B. Hendrckson and R. Leland. The Chaco user s gude: verson 2.0. Techncal report SAND Sanda Natonal Laboratores, Albuquerque, NM, Oct B. Hendrckson and R. Leland. An mproved spectral graph parttonng algorthm for mappng parallel computatons. SIAM Journal on Scentfc Computng, 16Ž. 2 : , B. Hendrckson and R. Leland. An multlevel algorthm for parttonng graphs. Proceedng of Supercomputng 95, Dec G. Horton. A mult-level dffuson method for dynamc load balancng. Parallel Computng, 19: , S. H. Hossen, B. Ltow, M. Malkaw, J. Mcpherson, and K. Varavan. Analyss of a graph colorng based dstrbuted load balancng algorthm. Journal of Parallel and Dstrbuted Computng, 10Ž. 2 : , Y. F. Hu and R. J. Blake. An optmal dynamc load balancng algorthm. Techncal report DL-P Daresbury Laboratory, Warrngton, UK, D. A. Huffman. A method for the constructon of mnmum redundancy codes. Proceedngs of the IRE 40, pp , G. Karyps and V. Kumar. Multlevel k-way parttonng scheme for rregular graphs. Techncal report Department of Computer Scence, Unversty of Mnnesota, Mnneapols, G. Karyps and V. Kumar. A fast and hgh qualty multlevel scheme for parttonng rregular graphs. Techncal report Department of Computer Scence, Unversty of Mnnesota, Mnneapols, B. W. Kerngham and S. Ln. An effcent heurstc procedure for parttonng graphs. Bell Syst. Tech. J. 49Ž. 2 : , L. Lapdus and C. F. Pnder. Numercal Soluton of Partal Dfferental Equatons n Scence and Engneerng. Wley, New York, F. C. H. Ln, and R. M. Keller. The gradent model load balancng method. IEEE Trans. Software Engneerng, SE-13Ž. 1 :32 38, D. M. Ncol. Rectlnear parttonng of rregular data parallel computatons. Journal of Parallel and Dstrbuted Computng, 23Ž. 2 : , L. Olker and R. Bswas. Effcent load balancng and data remappng for adaptve grd calculatons. Techncal report, NASA Ames Research Center, Moffett Feld, Calf., 1997.