DYNAMIC LOAD BALANCING APPLICATIONS ON A HETEROGENEOUS UNIX/NT CLUSTER

Transcription

1 European Congress on Computational Methods in Applied Sciences and Engineering ECCOMAS 2000 Barcelona, September 2000 ECCOMAS DYNAMIC LOAD BALANCING APPLICATIONS ON A HETEROGENEOUS UNIX/NT CLUSTER H.U. Akay, A. Ecer, E. Yilmaz, and L.P. Loo Computational Fluid Dynamics Laboratory Purdue School of Engineering and Technology, IUPUI Indianapolis, Indiana USA Web Page: Key words: Parallel Computing, Dynamic Load Balancing, Heterogeneous Cluster, CFD. Abstract. This study involves parallel computing and dynamic load-balancing applications on heterogeneous computer systems using our previously developed tools. A large-scale turbomachinery CFD code is chosen as the application code implemented on a cluster of Unix/IBM RS6000 and Windows NT/Pentium II workstations. The objective here is to be able to run large-scale codes routinely for solving problems on heterogeneous network of computers. In order to evaluate the performance of each workstation in heterogeneous environments, the measured computation and communication times are analyzed. The speedup and efficiency of each workstation during the parallel computations are also compared. For dynamic load balancing, two test cases were run on the combined Unix and NT cluster. The first case involves a curved duct flow with three different initial load distributions, which are: i) random loading, ii) heavy loading on Unix machines, and iii) equal loading on all machines. The second test case is rotor-stator problem. In this case, it is assumed that the system is initially loaded heavily on Unix machines. The effects of load balancing on the total elapsed time for achieving balanced loads on heterogeneous environment are demonstrated. 1

2 1. INTRODUCTION Parallel computing on cluster of workstations has become popular during the last decade, since it provides a cost-effective alternative to using expensive supercomputers for scientific computations. As performance of the networks between computers are improved, computing on multiple clusters is becoming feasible for large-scale computations. In such applications, computing resources may have different operating systems as well as different hardware configurations. Since these are typically multi-user environments and each cluster may have different performance, the conditions for load balancing become very complex and difficult to determine a-priori. Hence, the measurements of computer loads, communication speeds, and computation performance of computers are essential to decide for the loads on each compute nodes. For such applications, the use of dynamic load balancing techniques gain more importance. Movement of parallel tasks from one computer to other periodically under changing system conditions may result with substantial savings in total elapsed times. Domain decomposition-based parallel computing and dynamic load balancing algorithms developed by our group at IUPUI were previously limited to applications on Unix workstations [1,2]. More recently, we have extended such applications to PCs with NT operating systems [3]. The tools we have developed are also applicable for parallel computing of CFD problems in heterogeneous environments comprised of computers and operating systems of different types, speeds, and memory. In our approach, a given computational grid is subdivided into several solution blocks typically a multiple of the number of the available compute nodes [1]. The dynamic load balancer program (DLB) continuously measures the performance of a given system for computation and communication costs of each compute node [2]. Using an optimization algorithm, it then decides to redistribute the loads at some predetermined periods based on the measured performance during executions. In this paper, we present the recent applications of the tools we have developed on a turbomachinery flow code called ADPAC [4]. This code was originally developed by NASA and Rolls-Royce Allison. A parallel version on Unix was developed by our group in 1994 [5]. An NT version of the same parallel code was implemented recently [6]. Here, applications of this parallel code on a heterogeneous cluster consisting of Unix and NT operating systems is highlighted. The use of dynamic load balancing for cost-effective computing on heterogeneous environments is demonstrated. Dynamic load balancing examples involving initial load-distributions on heterogeneous compute-nodes are considered for a curved duct and a rotor-stator test cases. 2. GOVERNING FLOW EQUATIONS The numerical solution for ADPAC uses the conservation law form of the Navier-Stokes equations. For a rotating finite control volume, the inviscid form of the equations, i.e., Euler equations, are expressed as: 2

3 ( Q ) dv + Linv( Q) = K dv t (1) where: L inv ( Q) [ F da + G da + ( H rω Q) da = da inv z inv r inv Q is the vector of conservation variables in the form: Q = [ ρ ρν ρν ρν ρe ] T, K 0 0 ( ρν 2 + ) z r θ t [ p p 0 0] T = θ θ (2) (3) where the total internal energy, ρe t, for a perfect gas is defined as: p ρ et = + ρ( ν z + ν r + ( γ 1) 2 2 ν θ ) (4) γ is the specific heat ratio and ρ, p, ν z, ν r, νθ denote the density, pressure, axial, radial, and circumferential velocity components, respectively, relative to the coordinate system used in ADPAC. F inv, G inv, and H inv are the inviscid flux vectors. 3. PARALLEL COMPUTING ENVIRONMENT The test bed used for our applications is a cluster which consists of six UNIX-based workstations and 16 Windows/NT-based Pentium PCs. The Unix-based processors are IBM RS/6000 Model 43P-260 (RS6K) workstations with 512 MB memory each. The PCs consist of Intel 400 MHz Pentium II processors, each with 256 MB of memory size. Both systems are connected with a 100 Mb switch as the hub of the communication network. PVM library, version 3.4, is used for the inter-processors communication as needed. For all real operations, double precision floating-point arithmetic is used. Shown in Figure 1 is the layout of the computing environment used throughout this study. 4. PARALLEL PERFORMANCE EVALUATION The details of our parallel computing and dynamic load balancing tools may be found in our earlier works [1-4]. In this paper, we will report on the applications of these tools in a heterogeneous environment. For timing evaluations, two test cases were considered. These test cases correspond to different grid sizes for the similar flow conditions. The timing results obtained for these cases were based on the steady-state solution for the Euler/inviscid version of the ADPAC code. For both cases, the total elapsed time for one processor was estimated based on the average elapsed time per iteration per node obtained for two of the processors. In obtaining the timing results, each case was run on both NT and NT/Unix heterogeneous environment 3

4 by varying the number of machines or processors from 2 to 12 (half from each system in the heterogeneous case). The timing results for RS6K are done up to six machines because there are only six RS6K machines available in the cluster. Hence, in some cases, the performances of each computing environment is evaluated based on two, four, and six machines only. The timing records include total elapsed time and CPU time. A series of elapsed times for various computing environments are obtained statistically. Based on these timing results, the evaluation is assessed from three aspects: relative computational speed, parallel speedup, and parallel efficiency. The fully-optimized Fortran compiler options were used for both Unix and NT machines. Figure 1: Layout of the Computing Environment at the IUPUI s CFD Laboratory. 4.1 Test Case 1: interface size / block size = 1.5% This test case is a simple duct problem. Grid structure and geometry of this case is given in Figure 2. It has a mesh size of 765x25x25 grid points in x, y, and z directions, respectively. Shown in Figure 3 are the total elapsed times of two, four, and six-block solutions of the problem on Unix, NT and heterogeneous systems. Also shown in the same figure are the interface solver times, which are due to information exchange between the blocks. It is observed that the computational speed of RS6K is higher than NT by a factor of 1.85 to 1.45 from 2 to 6 machines, respectively. For the same number of machines used, the elapsed time for heterogeneous environment falls between NT and RS6K. 4

5 The corresponding speedup and efficiency curves versus the number of machines are plotted in Figure 4. It is apparent that the parallel efficiency for RS6K drops faster when the number of machines increases. On the other hand, the efficiencies for NT and heterogeneous workstation are higher. This is because the ratio of communication to computation is higher in RS6Ks, which are the faster machines in this case. The efficiency for RS6K and heterogeneous computers drops below 80% when 12 processors are used. This happens at the 6-processor case for RS6K. This is because the computational grids in each machine are relatively small and the time spent for message passing is becoming more dominant in the faster machines RS6Ks in this case. These phenomena imply that as the computing power of computers are increased, the benefit of using more than a certain number of processors for a certain number of data blocks would be diminished. Figure 2: Curved duct mesh. Figure 3: Elapsed Time Comparison for Network of Workstations (HTR = heterogeneous system). 5

6 As mentioned in the foregoing, the poor speedup and efficiency could be attributed to the increase of communication time. Although, the increase in communication time is not significant for these cases, which is normally the case for steady flow, any decrease in computation time was quickly offset by the increase of communication time. Generally, as the number of machines is increased, communication within processors is increased. In this case, proximity of machines could be the major contribution for the increase of communication time. For example, the relatively low speedup and efficiency in heterogeneous workstation are the consequences of higher communication due to machine locations. Figure 4: Efficiency and Speedup Comparison for Network of Workstations (HTR = heterogeneous system). 4.2 Test Case 2: interface size / block size = 10% In this case, the interface size was increased to about 10% of the block size by taking different mesh size for the same geometry given in Figure 2. It has a mesh size of 67x67x105 grid points in x, y, and z directions, respectively. This was done to investigate the effect of interface size on the computing environment performances. Figure 5 shows the elapsed times. The corresponding speedup and efficiency curves for each case are depicted in Figure 6. It is observed that the slower machines substantially affect the total elapsed time for the heterogeneous computing environment. Again, the RS6K machines were found to be the one with the fastest computational speed. An interesting feature observed here is that the parallel speedup for NT is outperformed those in RS6K and heterogeneous environment. For instance, for the 6-processor case, the parallel speedup for RS6K is approximately 30% lower than the ideal speedup while the NT and heterogeneous speedup are only 8% to 14% lower. In general, the relation between 6

7 speedup and number of processors/machines is of linear trend. For most of the cases, utilizing the heterogeneous computing environment has improved the performance of NT and RS6K in terms of computational speed and parallel speedup. Figure 5: Elapsed Time Comparison for Network of Workstations (HTR = heterogeneous system). Figure 6: Efficiency and Speedup Comparison for Network of Workstations (HTR = heterogeneous system). 7

8 As readily observed, the RS6K processors are faster, while the parallel efficiency for NT is higher than RS6K. The efficiency for RS6K is low when six processors were used for parallel computing. The low efficiency in RS6K could be attributed to the relatively high communication time compared to the block solver time. Nevertheless, a higher efficiency has been achieved when RS6K machines are combined with NT machines for heterogeneous computations. 5. DYNAMIC LOAD BALANCING CASE STUDIES 5.1 Curved Duct Flow Mesh of this problem is same as given in Figure 2. It has 765x25x25 grid points. A total of 60 blocks were generated by dividing the mesh in x direction only. Three different numerical experiments were performed to see the effect of initial load distribution on the same computers with UNIX (IBM RS/6K) and NT (Pentium PC) operating systems. In the first run, the loads are distributed randomly. In the second case, UNIX side has five times more loads than NT sides. In this case, all compute nodes has equal numbers of block among their cluster. In the third case, the loads are distributed equally on all computers. For the first run, initial random-loading is shown in Figure 7. Figure 8 gives the dynamic load balancing results obtained by using the Greedy algorithm [4,5]. An extraneous load exists in one of the UNIX machines during ADPAC simulation for this run. Four cycles are required to reduce the total elapsed time by 30%. After that, further simulation does not provide any time reduction. This suggests that a local minimum, which is a criteria used to determine the minimum time, has been achieved. Although the final distribution is still somewhat unbalanced, the total amount of time reduced is considered to be impressive. For the second run, initial loading is given in Figure 9. Loads in Unix side were intentionally chosen five times more then NT sides. After four cycles of load balancing with the Greedy algorithm, almost 28% time-efficient distribution has been obtained. The new load distribution is given in Figure 10. Note that, at the beginning there were two extraneous tasks that belongs to other users. However, at the final cycle there is only one. For the third run, equal-loading was chosen initially for all computers. Initial loads are shown in Figure 11. There is only one extraneous load on Unix side. After 4 cycles of load balancing, new distribution of blocks is given in Figure 12. The gain of elapsed time in this case is just 1%. Only three blocks are moved when first and last cycles are compared. 8

9 Figure 7: Initial random-loading for the duct problem. Figure 8: Dynamic load balancing result after four cycles for initial random-loading for the duct problem. 9

10 Figure 9: Initial heavy-loading on Unix side for the duct problem. Figure 10: Dynamic load balancing result after four cycles for initial heavy-loading on Unix side for the duct problem. 10

11 Figure 11: Initial equal-loading for all computers for the duct problem. Figure 12: Dynamic load balancing result after four cycles for initial equal-loading for all computers for the duct problem. 11

12 5.2 Single-stage rotor-stator combination In this case, a rotor-stator stage, which is known as Stage 37 is solved with ADPAC. Stage 37 is a single rotor and a stator combinations that has a total mesh size of 146x25x25 grid points. Mesh is shown in Figure 13. For parallel computing and load balancing, 24 divisions were used in x-direction. A total of eight single-cpu computers composed of four IBM RS/6K and four PC/NT computers were chosen. Figure 13: Mesh of Stage 37 a single stage rotor-stator combination. Initial blocks were distributed as five on each RS/6K and one on PC/NT machines. Figure 14 shows the initial loading. There are two extraneous loads: one on the Unix side and the other on the NT side. Figure 14: Initial-heavy loading on Unix side for the Stage 37 problem. After four cycles of load balancing, the obtained new distribution is shown in Figure 15. Two extraneous loads are still on the same processors as in the initial distribution. Around 2% of the gain in elapsed time is observed. This may be due to the size of the problem, 12

13 communication traffic in the network, and instantaneous loading of computers in used the present application. However, load distribution is much better than initial the one. Figure 15: Dynamic load balancing result after four cycles for the Stage37 problem. 6. CONCLUSIONS Algorithms are presented for running large-scale CFD codes on heterogeneous systems consisting of Unix and NT operating systems. The results indicate the feasibility of using available resources for parallel computing, in spite of the unbalances which may occur due to differences in operating systems and networks, as well as processor speed and memory size. The issues associated for achieving optimum efficiency present interesting possibilities for load balancing. A prior knowledge about the performance of available computer resources can help more reasonable initial load distribution by using human intelligence. However, in multi-user and multi-cluster computing environments this may not always be obvious. Moreover, generally, the users cannot monitor extraneous loads in submitting their jobs. Therefore, it is necessary to employ load balancing to get higher performance on such clusters. ACKNOWLEDGEMENTS This research was supported by the NASA Glenn Research Center. The authors would like to express their gratitude to NASA and Rolls-Royce Allison for providing the inviscid version of the ADPAC code in this research. The support and advice of the following individuals are acknowledged: Dr. J.D. Chen, I. Tarkan, and R. Payli from IUPUI CFD Laboratory, IUPUI, and Dr. E.J. Hall from Rolls-Royce Allison, Indianapolis. 13

14 REFERENCES [1] H.U. Akay, R.A. Blech, R.A., A. Ecer, D. Ercoskun, B. Kemle, A. Quealy, and A.A. Williams, A Database Management System for Parallel Processing of CFD Algorithms, Proceeding of Parallel CFD 92, Edited by Pelz, A.B., et al., Elsevier, Amsterdam, pp. 9-23, [2] Y.P. Chien, A. Ecer, H.U. Akay, F. Carpenter, and R.A. Blech, Dynamic Load Balancing on a Network of Workstations for Solving Computational Fluid Dynamics Problems, Computer Methods in Applied Mechanics and Engineering, Vol. 119, pp , [3] Y.P. Chien, J.D. Chen, A. Ecer, and H.U. Akay, Dynamic Load Balancing for Parallel CFD on NT Networks, Proceeding of Parallel CFD 99, Edited by Keyes, et al., Elsevier, Amsterdam, 2000 (in print). [4] E.J. Hall, R.A. Delaney, and J.L. Bettner, Investigation of Advanced Counterrotation Blade Configuration Concepts for High Speed Turboprop Systems, NASA Contractor Report CR , May [5] A. Ecer, H.U. Akay, W.B. Kemle, H. Wang, D. Ercoskun, and E.J. Hall, Parallel Computation of Fluid Dynamics Problems, Computer Methods in Applied Mechanics and Engineering, Vol. 112, 1994, pp [6] L.P. Loo, Parallel Computing and Dynamic Load Balancing of ADPAC in a Heterogeneous Cluster of Unix and Windows/NT Computers, Master s Thesis, IUPUI, May 2000 (in progress). 14