Partition And Load Balancer on World Wide Web

Transcription

1 JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 17, (2001) UMPAL: An Unstructured Mesh Partitioner and Load Balancer on World Wide Web WILLIAM C. CHU *, DON-LIN YANG, JEN-CHIH YU AND YEH-CHING CHUNG * Department of Computer and Information Science TungHai University Taichung, Taiwan 407, R.O.C. chu@cis.thu.edu.tw Department of Information Engineering Feng Chia University Taichung, Taiwan 407, R.O.C. {jcyu, dlyang, ychung}@iecs.fcu.edu.tw The finite element method (FEM) has been widely used for the structural modeling of physical systems. Due to computation-intensiveness and computation-locality, it is attractive to implement the finite element method on distributed memory multicomputers. Many research efforts have already provided solid algorithms for mesh partitioning and load balancing. However, without proper support, mesh partitioning and load balancing are labor intensive and tedious. In this paper, we present an unstructured mesh partitioner and load balancer (UMPAL) on World Wide Web (WWW). UMPAL is an integrated tool that consists of five components, a partitioner, a load balancer, a simulator, a visualization tool, and a Web interface. In the partitioner, three partitioning methods, Jostle/DDM, Metis/DDM, and Party/DDM are provided. The load balancer provides two load-balancing methods, prefix code matching parallel load-balancing and binomial tree based parallel load-balancing. The simulator provides a performance simulation environment for a partitioned mesh. By inputting parameters of a target distributed memory multicomputer, one can get the execution result of a partitioned mesh from the simulator. The visualization tool provides a way for users to view a partitioned mesh. The Web interface provides a mean for users to use UMPAL via the Internet and integrates the other four parts. Through the Web interface, other four components can be operated independently or together. Additionally, UMPAL provides several demonstrations and their corresponding mesh models that allow beginners to download and experiment. The UMPAL is designed with ease of use, efficiency, and transparency in mind. The experimental results show the property being practical and usefulness of our UMPAL. Keywords: World Wide Web, partitioner, load balancer, unstructured mesh, Internet 1. INTRODUCTION The finite element method has been widely used for structural modeling of physical systems. To solve problems with using the finite element method on a distributed memory multicomputer, in general, we first need to establish a finite element model for the Received September 14, 1999; revised May 25, 2000; accepted June 27, Communicated by Gen-Huey Chen. 595

2 596 WILLIAM C. CHU, DON-LIN YANG, JEN-CHIH YU AND YEH-CHING CHUNG the problem. Usually, the model could can be a 2D or 3D mesh, which is a connected and undirected graph that consists of a number of finite elements. Each of which is composed of a number of nodes. The number of nodes of a finite element is determined by the application. In Fig. 1, an example of a 21-node 2D mesh of 24 finite elements is shown. Due to the properties of computation-intensiveness and computation-locality, it is very attractive to implement the finite element method on distributed memory multicomputers Fig. 1. An example of a 21-node 2D mesh with 24 finite elements (the circled and uncircled numbers denote the node numbers and finite element numbers, respectively). To efficiently execute a finite element application program on a distributed memory multicomputer, we need to map nodes of the corresponding mesh to processors of a distributed memory multicomputer such that each processor has the same amount of computational load and the communication among processors is minimized. Since this mapping problem is known to be NP-complete [9], many heuristics have been proposed to find satisfactory sub-optimal solutions [1, 7-8, 10, 12-13, 16-19, 21-26]. Based on these heuristics, many graph partitioners were developed [12, 16-17, 21-22, 24, 26]. Among them, Jostle [24], Metis [16], and Party [21] are regarded as the best graph partitioners available to date. If the number of nodes in a mesh will not be increased during the execution of a finite element application program, the mapping algorithm needs to be performed only once. For an adaptive mesh application program, the number of nodes will be increased due to the refinement of some finite elements during the execution. This will result in a load imbalance on the processors. A load-balancing algorithm has to be performed many times in order to balance the computational load on processors while also keeping the communication cost among processors as low as possible. To deal with the load imbalance problem of an adaptive mesh computation, many load-balancing methods have been proposed in the literature [2-6, 11, 14-15, 20, 22, 24-25]. Without tools support, mesh partitioning and load balancing are labor intensive and tedious. In this paper, we present an unstructured mesh partitioner and load balancer

3 UMPAL: A WWW TOOL 597 (UMPAL) on WWW. UMPAL is an integrated tool that consists of five components, a partitioner, a load balancer, a simulator, a visualization tool, and a Web interface. In the partitioner, three partitioning methods, Jostle/DDM, Metis/DDM, and Party/DDM are provided. In the load balancer, UMPAL provides two load-balancing methods, the prefix code matching parallel load-balancing method [3], and the binomial tree based parallel load-balancing method [4]. The simulator provides a performance simulation environment for a partitioned mesh. By inputting parameters into a target distributed memory multicomputer, one can simulate the partitioning of a mesh. The visualization tool provides a way for users to view the partitioned mesh. The Web interface provides a mean for users to use UMPAL via the Internet and integrates the other four parts. Through the Web interface, the other four components can be operated independently or in cooperation with the others. Additionally, UMPAL provides several demonstration examples and their corresponding models, which allow beginners to download and experiment. The design of UMPAL is based on ease of use, efficiency, and transparency. Our experimental results demonstrate the practicality and usefulness of our UMPAL. The rest of this paper is organized as follows. Relevant work is given in Section 2. In Section 3, the UMPAL is described in detail. In Section 4, some experimental results of using UMPAL are presented. 2. RELATED WORK Many methods have been proposed to deal with the partitioning/mapping problems of irregular graphs on distributed memory multicomputers. In general, they can be divided into five classes, orthogonal section [23, 25], min-cut [7-11, 18], spectral [1, 13, 23], multilevel [1, 16-17, 22, 24], and other miscellaneous approaches [10, 19, 25]. These methods were implemented in several graph partition libraries, such as Chaco [12], DIME [26], Jostle [24], Metis [16], Party [21], etc.. For the orthogonal section approach, an irregular graph is partitioned into modules by recursively cutting the graph into two subgraphs according to the node coordinates on the x and y-axes in turn. Each partitioned module has the same amount of computational load. These modules are then mapped to processors. Although this approach does not consider any connectivity information in the graph, it tries to group nodes that are closed together in the graph to the same modules. For the min-cut approach, the Kernighan-Lin heuristic [18] is the most frequently used method for local bisection. It uses a sequence of logical vertex pair exchanges to determine the sets to be physically exchanged. Several heuristics have been proposed to improve the performance of the KL heuristic [8]. In [7], a recursive min-cut bipartitioning algorithm was proposed to map graphs on hypercubes. The spectral approach is based on algebraic graph theory. In this method, a matrix similar to the adjacency matrix of the graph is constructed and some specific eigenvectors of this matrix are computed. The determination of the eigenvectors is the major computational task of this method. Nodes of the graph are distributed to corresponding processors according to the values of these eigenvectors. Spectral methods are efficient for graph partitioning. However, the time and space required by spectral methods to partition a graph are quite high.

4 598 WILLIAM C. CHU, DON-LIN YANG, JEN-CHIH YU AND YEH-CHING CHUNG The multilevel approach is based on a coarsening strategy that decreases the size of a graph in several levels using matching techniques. After coarsening, it uses the spectral method or a k-way partitioning method or other partitioning methods to partition the coarsened graph. After partitioning, the partition of the coarse graph is extrapolated to the original one. Then the partitioned modules are assigned to processors. Other partition methods such as index-based mapping [19], projection based mapping [10], simulated annealing (SA) [25], etc., use other heuristics to do graph partitioning and do not belong to the approaches described above. To solve the load imbalance problem of adaptive mesh computations, many load-balancing algorithms can be used to balance the load on processors. The dimension exchange method (DEM) is applied to application programs without geometric structure [6]. Ou and Ranka [20] proposed a linear programming-based method to solve the incremental graph partitioning problem. Since their method has scope for the transferred nodes, it may sometimes result in no solution. Hu and Blake [15] proposed a direct diffusion method that computes the diffusion solution by using an unsteady heat conduction equation, while optimally minimizing the Euclidean norm of the data movement. They proved that a diffusion solution can be found by solving a linear equation. Heirich and Taylor [11] proposed a direct diffusive load-balancing method for scalable multicomputers. They derived a reliable and scalable load-balancing method based on properties of the parabolic heat equation u t α 2 u =0. Horton [14] proposed a multilevel diffusion method by recursively bisecting a graph into two subgraphs and balancing the load of the two subgraphs. This method assumes that the graph can be recursively bisected into two connected graphs. Schloegel et al. [22] also proposed a multilevel diffusion scheme to construct a new partition of the graph incrementally. It contains three phases, a coarsening phase, a multilevel diffusion phase, and a multilevel refinement phase. These algorithms perform diffusion in a multilevel framework and minimize data movement without comprising the edge-cut. Their methods also include parameterized heuristics to specifically optimize edge-cut, total data migration, and the maximum amount of data migrated in and out of each processor. Walshaw et al. [24] implemented a parallel partitioner and a direct diffusion repartitioner in Jostle that is based on the diffusion solver proposed by Hu and Blake [15]. They also developed a multilevel diffusion repartitioner in Jostle. Although several graph partitioning and load balancing methods have been implemented as tools or libraries [12, 16, 21, 24, 26], none of them has offered its Web interface and high level support to users. 3. THE SYSTEM STRUCTURE OF UMPAL The system structure of UMPAL is shown in Fig. 2. It consists of five components, a partitioner, a load balancer, a simulator, a visualization tool, and a Web interface. Users can upload the unstructured mesh data and get the running results using any Web browser. Through the Web interface, the other four components can be operated independently or can be run cooperatively. In the following, we will describe them in details.

5 UMPAL: A WWW TOOL 599 User Web interface visualization tool Simulator Load balancer Partitioner Fig. 2. The system structure of UMPAL. 3.1 The Partitioner In the partitioner, we provide three partitioning methods, Jostle/DDM, Metis/DDM, and Party/DDM. Jostle/DDM, Metis/DDM, and Party/DDM were implemented based on the best algorithms provided in Jostle, Metis and Party, respectively, with the dynamic diffusion optimization method (DDM) [5]. The partitioner of UMPAL has the following advantages: 1. In Jostle, Metis, and Party, a 3% to 5% load imbalance among partitioned modules is allowed. The dynamic diffusion optimization method can efficiently balance the 3% to 5% load imbalance among partitioned modules allowed by these three methods, thereby improving the total cut-edges of partitioned modules. Therefore, the partition methods provided in the partitioner will perform better than their regular counterparts, i.e., Jostle, Metis, and Party. 2. The partition results of Jostle, Metis, and Party depend on the shapes of unstructured meshes. It is difficult to tell that which one performs best for a given unstructured mesh. If we want to get the best result among these three partitioners,. we need to run these three partitioners separately. Since the parameters used in these three partitioners are different, it may take some time to get the desire results. By integrating the Jostle/DDM, Metis/DDM, and Party/DDM methods in a partitioner, one can try each method once and take the best partitioning result because the parameters for these three methods are uniform in UMPAL. The flow chart of the partitioner is given in Fig. 3. From Fig. 3, we can see that the inputs of the partitioner are the number of processors and a file of an unstructured mesh connection model. The number of processors specifies how many processors will be involved in the partitioning process. The file of an unstructured mesh connection model can be uploaded from a user s Web browser or can be specified by using the demo model. In UMPAL, we provide five 2D and two 3D unstructured demo meshes. In an

6 600 WILLIAM C. CHU, DON-LIN YANG, JEN-CHIH YU AND YEH-CHING CHUNG unstructured mesh connection model file, the first line specifies the numbers of nodes and edges of an unstructured mesh. The second line describes the neighbors of node 1. The third line describes the neighbors of node 2, and so on. Fig. 4 shows an example file of the connection model of an unstructured mesh. unstructured mesh connection model number of processors partition methods: Metis/DDM, Party/DDM, and Jostle/DDM partitioned unstructured mesh file total cutedges, and load balancing degree Fig. 3. The flow chart of the partitioner the model contains 100 nodes and 300 edges the neighbor nodes of node 1, which are nodes 2, 3, 4 and the neighbor nodes of node 2, which are nodes 1, 3, and 4.. Fig. 4. Format of the unstructured mesh connection model file. The outputs of the partitioner are a partitioned unstructured mesh file and the partitioned results. In a partitioned unstructured mesh file, a number j in line i indicates that node i belongs to processor j. Users can download the partitioned unstructured mesh file for further use and see the partitioned results on a Web browser. The partitioned results include the load balancing degree and the total cut-edges of a partitioned unstructured mesh. Figs. 5 and 6 show the Web page of the partitioner and the partitioned results of an unstructured mesh Truss, respectively. 3.2 The Load Balancer In the load balancer, we provide two load-balancing methods, the prefix code matching parallel load-balancing (PCMPLB) method [3] and the binomial tree based parallel load-balancing (BINOTPLB) method [4]. From the flow chart of the load balancer given in Fig. 7, we can see that the inputs of the load balancer are the number of processors and files of the connection model, the element model, and the partitioned model of an unstructured mesh. The number of processors specifies how many processors will be involved in the load balancing process. The data format of the connection

7 UMPAL: A WWW TOOL 601 Fig. 5. The Web page of the partitioner. Fig. 6. the portioned results of Truss. model file of an unstructured mesh is the same as that described in the partitioner. In the element model file of an unstructured mesh, the first line specifies the number of elements. The second line describes the nodes of element 1. The third line describes the nodes of element 2, and so on. Fig. 8 gives an example file showing the format of the element model of an unstructured mesh. The data format of the partitioned model file of an unstructured mesh is the same as that of the output file of the partitioner. In the load balancer, users can also use the partitioned unstructured demo mesh model provided by UMPAL. In this case, the inputs are the load imbalance degree and the numberofprocessors. unstructured mesh element model unstructured mesh connection model partitioned unstructured mesh file number of processors load balancing methods: PCMPLB, and BINOTPLB load-balanced unstructured mesh file total cut-edges, and load balancing degree Fig. 7. The flow chart for the load balancer the model contains 300 elements Nodes 2, 3, and 4 form element Nodes 1, 3, and 4 form element 2.. Fig. 8. Format of the unstructured mesh element model file.

8 602 WILLIAM C. CHU, DON-LIN YANG, JEN-CHIH YU AND YEH-CHING CHUNG The outputs of the load balancer are a load-balanced unstructured mesh file and the load balancing results. The data format of a load-balanced unstructured mesh file is the same as that of the output file of the partitioner. Users can download the load-balanced unstructured mesh file for further use and see the load balancing results on a Web browser. The load balancing results include the load balancing degree and the total cut-edges. Fig. 9 shows the Web page of the load balancer. Fig. 10 shows the load balancing results of a partitioned unstructured mesh Truss with a 5% load imbalance on 10 processors. Fig. 9. The Web page of the load balancer. Fig. 10. The load-balanced results of Truss. 3.3 The Simulator The simulator provides a simulated distributed memory multicomputer for the performance evaluation of a partitioned unstructured mesh. The execution time of an unstructured mesh on a P-processor distributed memory multicomputer under a particular mapping/load-balancing method L i can be defined as follows: T par (L i )=max{t comp (L i, P j )+T comm (L i, P j )}, (1) T comp (L i, P j ) is the computation cost of processor P j under L i,andt comm (L i, P j )isthe communication cost of processor P j under L i,wherej = 0,..., P 1. The cost model used in Equation 1 assums a synchronous communication mode in which each processor goes through a computation phase followed by a communication phase. Therefore, the computation cost of processor P j under a mapping/load-balancing method L i can be defined as follows: T comp (L i, P j )=S load i (P j ) T task, (2) where S is the number of iterations performed by a finite element method, load i (P j )isthe number of nodes of an unstructured mesh assigned to processor P j,andt task is the time for a processor to execute the tasks of a node. For the communication model, we assume a synchronous communication mode and

9 UMPAL: A WWW TOOL 603 that every pair of processors can communicate with each other in one step. In general, it is possible to overlap communication with computation. In this case, T comm (L i, P j ) may not always reflect the true communication cost since it would partially overlap the computation. However, T comm (L i, P j ) should provide a good estimate for the communication cost. Since we use a synchronous communication mode, T comm (L i, P j ) can be defined as follows: T comm (L i, P j )=S (δ T setup + φ T c ), (3) where S is the number of iterations performed by a finite element method, δ is the number of processors that processor P j sends data to in each iteration, T setup is the setup time of the I/O channel, φ is the total amount of data that processor P j sends out in each iteration, and T c is the data transmission time of the I/O channel per byte. partitioned or load-balanced unstructured mesh file communication setup time, data transmission time, and executing time for one task simulator maximun processing time among all processors Fig. 11. Simulator flow chart. The simulator flow chart is given in Fig. 11. To use the simulator, users need to input the partitioned or load-balanced unstructured mesh file and the values of S, T setup, T c, T task, and the number of bytes sent by a finite element node to its neighbors. The partitioned or load-balanced unstructured mesh file can be uploaded from a user s browser or it can be a demo file provided by UMPAL. The data format of the partitioned or load-balanced unstructured mesh file is the same as for those described in the partitioner and the load balancer discussions. The outputs of the simulator are the execution time of the unstructured mesh on a simulated distributed memory multicomputer and the total cut-edges of a partitioned unstructured mesh. Fig. 12 shows the Web page for the simulator, and Fig. 13 presents the simulation results of Truss on a simulated 10-processor SP The Visualization Tool UMPAL also provides a visualization tool for visualizing the partitioned unstructured mesh. The working flow for the visualization tool is shown in Fig. 14. The inputs of the visualization tool are files of the coordinate model, the element model, the partitioned unstructured mesh models, and the image size. In the coordinate model file for an unstructured mesh, line 1 specifies the number of nodes, line 2 specifies the x, y, z

10 604 WILLIAM C. CHU, DON-LIN YANG, JEN-CHIH YU AND YEH-CHING CHUNG Fig. 12. Simulator Web page. Fig. 13. Simulation results. unstructured mesh element model unstructured mesh coordinate model partitioned or load-balanced unstructured mesh file width and height of image unstructured mesh visualization tool image of unstructured mesh Fig. 14. Visualization tool flow chart the model contains 300 nodes x, y, z coordinates of node x, y, z coordinates of node 2.. Fig. 15. Data format of an unstructured mesh coordinate model. coordinate of node 1, line 3 specifies the coordinate of node 2, and so on. Fig. 15 illustrates the file format for an unstructured mesh coordinate model. The data formats of the element model and the partitioned model of an unstructured mesh are the same as those described in the load balancer. After rendering, a Web browser displays the unstructured mesh with different colors, with each color representing one processor. Currently, the visualization tool can only display partitioned 2D meshes. The visualization of partitioned 3D meshes is still under development. Fig. 16 shows the visualization tool Web page. Fig. 17 shows the rendering result of Letter_S.

11 UMPAL: A WWW TOOL Fig. 16. Visualization tool Web page. 605 Fig. 17. Result of rendering Letter_S. 3.5 The Web Interface The Web interface allows users to try the various components of UMPAL. The interface consists of two parts, an HTML interface and a CGI interface. The HTML interface provides Web pages for users to input requests from Web browsers. The CGI interface is responsible for handling these requests. Through the Web interface, other four components can be operated independently or can be run in cooperation. The Web interface flow chart is shown in Fig. 18. As a user operates each component independently, the Web interface passes the requests to that component. The component will then process the requests and produce an output. When the request involves more than one component, the Web interface has to controls the data flow between each requested component. In this case, the partitioner is always executed before the load balancer, the load balancer before the simulator, and the simulator before the visualization tool. Fig. 19 gives an example of specifying a cooperative process, while Figure 20 presents the computed solution for the specified problem. 3.6 Implementation of UMPAL In order to support standard WWW browsers, the front end is coded in HTML using CGI. The CGI interface is implemented in the Perl language. The CGI interface receives the data and parameters from the forms of the HTML interface, and then it calls external tools to handle the requests. The tools of UMPAL, partitioner, balancer, and simulator are coded in the C programming language. They receive parameters from the CGI interface and use the specified methods to process user requests. To support an interactive visualization tool, the client/server software architecture is used in UMPAL. In the client side, a Java Applet is implemented to display images rendered by server. In the server side, a Java server-let is implemented as a Java Application. The Java server-let renders an image with specific size and unstructured mesh models. As the server finishes its rendering work, it sends the final image to the client side so users can see the final image.

12 606 WILLIAM C. CHU, DON-LIN YANG, JEN-CHIH YU AND YEH-CHING CHUNG User HTML interface upload required unstructured mesh models CGI element, and coordinate models load balanced model element, connection, and partitioned models partitioned model connection model visualization tool Simulator Load balancer partitioner otal cut-edges and simulation time load balanced model image of unstructured mesh model Service provider's Web pages partitioned model Fig. 18. The Web interface flow chart. (a) The upper half. (b) The lower half. Fig. 19. An example of specifying a cooperative process through the Web interface.

13 UMPAL: A WWW TOOL 607 Fig. 20. The solution for the problem given in Fig EXPERIENCE AND EXPERIMENTAL RESULTS In this section, we will present some experimental results for unstructured meshes by using the partitioner, the load balancer, and the simulator of UMPAL through a Web browser. 4.1 Experimental Results for the Partitioner To evaluate the performance of Jostle/DDM, MLkP/DDM, and Party/DDM, three 2D and two 3D unstructured meshes are used as test samples. The initial 2D unstructured meshes, Hook, Letter_S, andtruss, were created by using the distributed irregular mesh environment (DIME) [22], then followed by our mesh refinement algorithm. The 3D unstructured meshes, Femur and Tibia, were produced by using our auto mesh generation program on source images obtained from CT (computer tomography). These five unstructured meshes are part of a set of demo meshes provided in UMPAL and are shown in Fig. 21. The number of nodes, the number of elements, and the number of edges of these five unstructured meshes are given in Table 1. For presentation purposes, the number of nodes, number of elements, and number of edges of the irregular finite element graphs shown in Fig. 17 are less than those shown in Table 1. Table 2 shows the total cut-edges of Jostle/DDM, Metis/DDM, and Party/DDM with their counterparts for the three 2D and two 3D unstructured meshes on 70 processors. The total cut-edges of Jostle, Metis, and Party were obtained by running these three partitioners with default values. The load imbalance degree allowed by Jostle, Metis, and Party are 3%, 5%, and 5%, respectively. The total cut-edges of Jostle/DDM,

14 608 WILLIAM C. CHU, DON-LIN YANG, JEN-CHIH YU AND YEH-CHING CHUNG (a) Hook (1849 nodes, 3411 elements) (b) Letter_S (6075 nodes, elements) (c) Truss (7325 nodes, elements) (d) Femur (6141 nodes, 7448 elements) (e) Tibia (973 nodes, 1168 elements) Fig. 21. Unstructured meshes used to evaluate performance.

15 UMPAL: A WWW TOOL 609 Metis/DDM, and Party/DDM were obtained by applying the dynamic diffusion optimization method (DDM) [5] to the partitioned results of Jostle, Metis, and Party, respectively. Jostle/DDM, Metis/DDM, and Party/DDM guarantee that the load among partitioned modules is fully balanced. From Table 2, we can see that there are fewer total cut-edges produced by the methods provided in the partitioner are less than those of their counterparts. Table 1. The number of nodes, elements, and edges in the test samples. Samples #node #element #edges Hook Letter Truss_S Femur Tibia Table 2. The total cut-edges of the methods provided in the partitioner and their counterparts. Method Jostle Jostle/DDM Metis Metis/DDM Party Party/DDM Model Truss Letter_S Hook Tibia Femur Experimental Results for the Load Balancer To evaluate the performance of the prefix code matching parallel load-balancing method (PCMPLB) [3] and the binomial tree based parallel load-balancing method (BI- NOTPLB) [4] provided in the load balancer, we compare these two methods with the direct diffusion method (DD) and the multilevel diffusion method (MD). For an experiment, 3%, 5%, and 10% load imbalance cases for the 2D and 3D unstructured meshes were tested. We modified the multilevel k-way partitioning (MLkP) program provided in Metis to generate the desired test samples. The methods provided in the load balancer guarantee that the load among partitioned modules will be fully balanced, whereas the DD and MD methods do not. Table 3 shows the total cut-edges produced by DD, MD, PCMPLB, and BI- NOTPLB for three 2D unstructured meshes on 50 processors. We can see that the methods provided in the load balancer outperform the DD and MD methods in most cases. The load balancing results of PCMPLB and BINOTPLB depend on the test samples. It is difficult to tell which performs better than the other for a given partitioned unstructured mesh. However, one can compute both methods in the load balancer, check the results, and choose the better one.

16 610 WILLIAM C. CHU, DON-LIN YANG, JEN-CHIH YU AND YEH-CHING CHUNG Table 3. The total cut-edges produced by DD, MD, PCMPLB, and BINOTPLB for three 2D unstructured meshes on 50 processors. Model Truss Letter_S Hook Tibia Femur Load imbalance degree DD MD PCMPLB BINOTPLB 3% % % % % % % % % % % % % % % Experience with the Simulator In this experimental test, we simulate the execution of a parallel Laplace solver on a 70-processor SP2 parallel machine. According to [3], the values of T setup, T c,andt task, are 46µs, 0.035µs, and 350µs, respectively. Each finite element node needs to send 40 bytes to its neighbor nodes. The number of iterations performed by a Laplace solver is set to Table 4 shows the simulator output fir the test samples shown in Fig. 21 under different partitioning methods. For comparison, we also include the simulation results of test samples under Jostle, Metis, and Party. From Tables 2 and 4, we can see that, in general, the fewer the total cut-edges, the lower the execution time. This simulation may provide a reference to help in choosing the right method for a given unstructured mesh. Table 4. The simulator output for the test samples shown in Fig. 21. Method Model Jostle Jostle/DDM Metis Metis/DDM Party Party/DDM Truss Letter_S Hook Tibia Femur Time in seconds 5. CONCLUSIONS AND FUTURE WORK In this paper, we have presented a software tool, UMPAL, for processing partitioning and load balancing problems for unstructured meshes on the World Wide Web. Users can try UMPAL by accessing its Internet address,

17 UMPAL: A WWW TOOL 611 UMPAL is an integrated tool that consists of five components, a partitioner, a load balancer, a simulator, a visualization tool, and a Web interface. It was designed to be easy to use, efficient, and transparent. The experimental results presented here demonstrate the practicality and usefulness of UMPAL. There are several advantages of using UMPAL over the web. Firstly, users do not need to obtain licenses for any of the software packages. Also, there is no need for installation, maintenance, or upgrading the software. Different users will all use the latest version, brining standardization to the tools. The integration of different methods into our UMPAL has made experiments and simulations of parallel programs simple and cost effective. UMPAL offers a high level and user friendly interface. Furthermore, the demonstrations can educate beginners on how to apply FEM to solve parallel problems. In UMPAL, we only offer a simulator to execute the partitioned/load-balanced results produced by the partitioner/load balancer. It is possible to generate parallel codes for real machines e.g., IBM SP2 or PC clusters, according to the partitioner/load balancer results. In the future, we plan to add a parallel PDE code generator in UMPAL. There is one typical shortage of tools on the web, which is the downgrade of performance when there are multiple simultaneous requests. To solve this problem, UM- PAL can either be executed on a more powerful computer or be executed on a cluster of machines. For the current implementation of UMPAL, execution on a more powerful computer is the only way possible. In the future, we will implement a parallel/distributed version of UMPAL to enhance its performance. ACKNOWLEDGMENTS The authors would like to thank Dr. Robert Preis, Professor G. Karypis, and Professor Chris Walshaw for providing the Party, the Metis, and Jostle software packages, respectively. REFERENCES 1. S. T. Barnard and H. D. Simon, Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems, Concurrency: Practice and Experience, Vol. 6, No. 2, 1994, pp Y. C. Chung and C. J. Liao, Tree-based parallel load-balancing methods for solution-adaptive unstructured finite element models on distributed memory multicomputers, IEEE Transactions on Parallel and Distributed Systems, Vol. 10, No. 4, 1999, pp Y. C. Chung, C. J. Liao, and D. L. Yang, A prefix code matching parallel load-balancing method for solution-adaptive unstructured finite element graphs on distributed memory multicomputers, The Journal of Supercomputing, Vol. 15, No. 1, 2000, pp Y. C. Chung and C. J. Liao, A binomial tree-based parallel load-balancing methods for solution-adaptive unstructured finite element graphs on distributed memory multicomputers, in Proceedings of 1998 International Conference on Parallel CFD, 1998, pp

18 612 WILLIAM C. CHU, DON-LIN YANG, JEN-CHIH YU AND YEH-CHING CHUNG 5. Y. C. Chung, D. L. Yang, C. C. Chen, and C. J. Liao, A dynamic diffusion optimization method for irregular finite element graph partitioning, The Journal of Supercomputing, Vol. 17, No. 1, 2000, pp G. Cybenko, Dynamic load balancing for distributed memory multiprocessors, Journal of Parallel and Distributed Computing, Vol. 7, No. 2, 1989, pp F. Ercal, J. Ramanujam, and P. Sadayappan, Task allocation onto a hypercube by recursive mincut bipartitioning, Journal of Parallel and Distributed Computing, Vol. 10, No. 1, 1990, pp C. M. Fiduccia and R. M. Mattheyes, A linear-time heuristic for improving network partitions, in Proceedings of the 19th IEEE Design Automation Conference, 1982, pp M. R. Garey and D. S. Johnson, Computers and Intractability, A Guide to Theory of NP-Completeness, San Francisco, CA: Freeman, J. R. Gilbert, G. L. Miller, and S. H. Teng, Geometric mesh partitioning: implementation and experiments, in Proceedings of 9th International Parallel Processing Symposium, 1995, pp A. Heirich and S. Taylor, A parabolic load balancing method, in Proceedings of International Conference on Parallel Processing 95, 1995, pp B. Hendrickson and R. Leland, The Chaco User s Guide: Version 2.0, Technical Report SAND , Sandia National Laboratories, Albuquerque, NM, B. Hendrickson and R. Leland, An improved spectral graph partitioning algorithm for mapping parallel computations, SIAM Journal on Scientific Computing, Vol. 16, No. 2, 1995, pp G. Horton, A multi-level diffusion method for dynamic load balancing, Parallel Computing, Vol. 19, No. 2, 1993, pp Y. F. Hu and R. J. Blake, An Optimal Dynamic Load Balancing Algorithm, Technical Report DL-P , Daresbury Laboratory, Warrington, UK, G. Karypis and V. Kumar, Multilevel k-way partitioning scheme for irregular graphs, Journal of Parallel and Distributed Computing, Vol. 48, No. 1, 1998, pp G. Karypis and V. Kumar, A parallel algorithm for multilevel graph partitioning and sparse matrix ordering, Journal of Parallel and Distributed Computing, Vol. 48, No. 1, 1998, pp B. W. Kernigham and S. Lin, An efficient heuristic procedure for partitioning graphs, Bell System Technical Journal, Vol. 49, No. 2, 1970, pp C. W. Ou, S. Ranka, and G. Fox, Fast and parallel mapping algorithms for irregular problems, The Journal of Supercomputing, Vol. 10, No. 2, 1996, pp C. W. Ou and S. Ranka, Parallel incremental graph partitioning, IEEE Transactions on Parallel and Distributed Systems, Vol. 8, No. 8, 1997, pp R. Preis and R. Diekmann, The PARTY Partitioning Library User Guide Version 1.1, HENIZ NIXDORF INSTITUTE Universität Paderborn, Germany, K. Schloegel, G. Karypis, and V. Kumar, Multilevel diffusion schemes for repartitioning of adaptive meshes, Journal of Parallel and Distributed Computing, Vol. 47, No. 2, 1997, pp H. D. Simon, Partitioning of unstructured problems for parallel processing, Computing Systems in Engineering, Vol. 2, No. 2/3, 1991, pp

19 UMPAL: A WWW TOOL C. H. Walshaw, M. Cross, and M. G. Everett, Parallel dynamic graph partitioning for adaptive unstructured meshes, Journal of Parallel and Distributed Computing, Vol. 47, No. 2, 1997, pp R. D. Williams, Performance of dynamic load balancing algorithms for unstructured mesh calculations, Concurrency: Practice and Experience, Vol. 3, No. 5, 1991, pp R. D. Williams, DIME: Distributed Irregular Mesh Environment, California Institute of Technology, William Cheng-Chung Chu ( ) is an Associate Professor in the Department of Computer and Information Science at the Tung-Hai University, Taiwan. From 1994 to 1998, he was an Associate Professor at the Department of Information Engineering at the Feng-Chia University, Taiwan. Prior to that, he was a research scientist at Software Technology Center of the Palo Alto Research Laboratories of Lockheed Missiles and Space Company, Inc., where he received a special contribution awards from Lockheed in both 1992 and In 1992, he was a Visiting Scholar in the Department of Engineering Economic Systems at Stanford University, where he was involved in projects related to intelligent knowledge-based expert systems. His current research interests include software reengineering, maintenance, reuse, software quality, and e-commerce. William received his M.S. and Ph.D. degrees from Northwestern University in Evanston Illinois, in 1987 and 1989 respectively, both in Computer Science. His address is: chu@cis.thu.edu.tw Don-Lin Yang ( ) received a B.E. degree in Computer Science from Feng Chia University in 1973, a M.S. degree in Applied Science from the College of William and Mary in 1979, and a Ph.D. degree in Computer Science from the University of Virginia in Prior to joining the Department of Information Engineering at Feng Chia University in 1991, he was a staff programmer at IBM Santa Teresa Laboratory from 1985 to 1987 and a member of technical staff at AT&T Bell Laboratories from 1987 to Dr. Yang is currently an associate professor. His research interests include distributed and parallel computing, data mining, image processing, and network management. He is also a member of the IEEE computer society and ACM.

20 614 WILLIAM C. CHU, DON-LIN YANG, JEN-CHIH YU AND YEH-CHING CHUNG Jen-Chih Yu ( ) received his BS and MS degrees in Information Engineering from Feng Chia University, Taichung, Taiwan, in 1997 and 1999, respectively. His research interests include parallel volume rendering design, computer graphics, visualization, parallel processing, and parallel algorithms. Yeh-Ching Chung ( ) was born in He received a B.S. degree in computer science from Chung Yuan Christian University in 1983, and M.S. and a Ph.D. degrees in computer and information science from Syracuse University in 1988 and 1992, respectively. Currently, he is a Professor and the chair with the Department of Information Engineering at Feng Chia University, where he directs the Parallel and Distributed Processing Laboratory. His research interests include parallel compilers, parallel programming tools, mapping, scheduling, load balancing, Embedded Systems and Virtual Reality.