LOAD BALANCING FOR MULTIPLE PARALLEL JOBS

European Congress on Computational Methods in Applied Sciences and Engineering ECCOMAS 2000 Barcelona, 11-14 September 2000 ECCOMAS LOAD BALANCING FOR MULTIPLE PARALLEL JOBS A. Ecer, Y. P. Chien, H.U Akay and J.D. Chen Purdue School of Engineering and Technology Indiana University-Purdue University Indianapolis 723 W. Michigan Street Indianapolis, Indiana 46202, U.S.A. e-mail:ecer@engr.iupui.edu Key words: Parallel CFD, multiple parallel jobs, load balancing. Abstract. Parallel Processing on networked heterogeneous computers has been widely adopted. Load balancing is an issue essential for taking the advantage of parallel processing. Previously reported load balancing techniques assume that there is only one parallel job running on a given set of computers. When many parallel jobs need to be executed on the same set of computers, a high level central manager is usually involved to enforce the queuing and balancing policy. Such a policy usually allows one parallel job running on the reserved computer to ensure load balancing. These arrangements discourage the sharing of hardware resources among computer owners and can be used only on computers that are dedicated for the parallel processing. In this paper, we introduce a new load balancing method that does not need a high-level central manager. This method assumes that the each parallel job has no knowledge of other parallel jobs and can start execution at different time. It also ensures that a newly jointed parallel job gets its share of computation power immediately but does not affect the load balance of other parallel jobs. 1

1. INTRODUCTION With the rapid advancement of computer technology, parallel processing has been implemented on parallel supercomputers, networked workstations, and combinations of workstations and supercomputers. While there are a large number of computers idling in the evening in many organizations, it is desirable to use these computing resources for parallel computing. One unsolved problem in parallel processing is how to distribute multiple mutually dependent parallel jobs fairly among computing resources and achieving optimal computation speed for each parallel job. When the mutually dependent parallel processes are not balanced on the given computers, processes that finish the computation early need to wait the other processes periodically to collect essential information to proceed the computation, thus not fully utilizing the given computing resources and not achieving the best computation speed. To increase the efficiency and speed of parallel computation, the computation load should be distributed to computers in such a way that the elapsed execution time of the slowest process is minimized. With our knowledge, all the available load-balancing methods assume that there is only one parallel job to be load balanced. There are several approaches to achieve computer load balancing for a single parallel job: (1) balancing the distribution of program instructions in compilation stage, (2) balancing the sub-domain assignment in domain decomposition stage, and (3) balancing the process distribution in program execution stage. Most load-balancing methods use the second approach [1-5]. These load-balancing methods have the following assumptions: (1) computers (or processors on parallel computers) are homogeneous, (2) computers operate in the single user mode, and (3) the computation domain can be arbitrarily divided. The first assumption can be satisfied on parallel computers and a small network of workstations. However, it is difficult to be satisfied in a large computer network that consists of computers with different brand and models. The second restriction requires the permission of the system administrator and usually causes inconvenience to other users. The third restriction is also difficult to satisfy due to structural and computation constrains in many applications. To avoid these restrictions, we adopted an approach that balances the computation load in both domain decomposition stage and program execution stage. In the domain decomposition stage, the task domain is divided into N sub-domains (blocks). The sizes of blocks need not to be the same but is desired to be similar (to be explained later). Each block may have different number of neighboring blocks. Each block needs to exchange boundary conditions periodically with its neighboring blocks. It is assumed that M networked computers (or processors on parallel computers) are used in parallel computing where N > M. These computers may have different computation speeds. A task is to find the optimal distribution of data blocks over computers, such as the overall execution time for the application is minimized. In this paper, we describe a new method for dynamic load balancing of multiple parallel jobs. The main objective of the method is to distribute parallel jobs on available computer

resources fairly and optimally. The basic idea of the method is to make sure that load balancing of a parallel job does not affect the load balancing of other parallel jobs. The method is developed under the following assumptions: available networked computers are heterogeneous. They may run under UNIX or Windows NT operating systems, the parallel jobs can be added or removed from the computers by the job owner at any time, loadbalancing of a parallel job is the responsibility of the owner of the application program, and the host computer informs the parallel job the availability of computers and proper time to do load balancing. The paper is organized as follows. Section 2 summarizes the load balancing method for a single parallel job [6,7] that we developed in the past. This method is used as part of our new load balancing method for multiple parallel jobs. Section 3 describes the concept used in the load balancing method for multiple parallel jobs and the procedures. Section 4 demonstrates the effectiveness of the new load balancing method. Section 5 concludes the paper. 2. BACKGROUND This section describes the work we did in the past that will be used in the load balancing of multiple parallel jobs. While most load-balancing methods achieve load balancing by adjusting the block sizes (assuming each computer executes one parallel process), we used a different approach that makes load balancing less problem-dependent. We assume that: (1) the problem domain can be pre-divided into many sub-domains (more than the number of computers), (2) each sub-domain is handled by one process, and (3) each computer can execute many processes. Load balancing is achieved by moving the processes among computers. While each computer process is responsible for solving one block of data, it has to communicate with others during the execution of the parallel job. To identify the time used for process computation and inter-process communication, the parallel programs are divided into two parts: block solvers and interface solvers [8]. While the block solver is for computing the solution for a block, the interface solver is for exchanging information between block boundaries. The execution time of each process is affected by several time-varying factors, e.g., the load of computers, the load of the network, the solution scheme used for solving each block, and sizes of blocks. Therefore, without load balancing, some processes may finish execution much earlier than other processes and wait for information from other processes. Such waiting significantly increases the elapsed execution time of the parallel job. We have developed a dynamic load balancing (DLB) method for a single parallel job on a set of heterogeneous computers. The method can properly distribute the processes among computers to ensure the communication time and the waiting time are minimized. The basic idea of the method is to minimize the elapsed execution time of the slowest parallel process. The basic assumptions used in DLB are as follows: There are a large number of computers available in different locations, which are managed by different owners. At the initiation of the run, the user defines a set of available computers. The user can access all or any subset of these computers. Each of the multi-user computers is operating under time sliced operating system (e.g. Unix or Windows NT). The parallel

application software is running MPI [9] or PVM [10]. When the load is balanced, the effective computation speeds of all computers to each process are the same if the communication time between parallel processes is the same. In fact, since the block size cannot be infinitely small, the word same actually means that the difference of the effective speeds of the computers cannot be further reduced. When many parallel jobs need to be executed on a large network of computers that are owned by many owners, the reservation of dedicated time for a parallel job becomes difficult. In a multi-user environment, multiple parallel jobs may be executed concurrently on the same set of computers. When the mutually dependent load is not balanced, computers that finish the computation early need to wait the other computers periodically in order to collect essential information to proceed further computation, thus loosing its share of computing resources and not achieving the best computation speed. We have tried to assign a load balancer to each parallel job and let each load balancer balance its corresponding single parallel job independently. Since there are conflict of interests among load balancers, their load balancing results interfere with each other. 3. DYNAMIC LOAD BALANCING FOR MULTIPLE PARALLEL PROCESSES To avoid the problem of process threshing in load balancing of multiple parallel jobs, we intended to develop an approach such that the load balancing of one parallel job does not affect the load balancing of other running parallel jobs. In this section, we proof that the round robin load balancing approach can satisfy such requirement. However, this approach is valid only for the parallel applications with small communication costs. 3.1. Nomenclature To simplify the description, the following nomenclature is used: J The set of all parallel jobs running on the system. J n Computation job with job number n. J n ={p i }, where p means process and i is the process number in the job, i =1,2,..., k n. J n can be either a parallel job or a sequential job. It is assumed that different users own different J n. Different J n may be different parallel codes. A user does not have any knowledge of the parallel job owned by other users. S m Computation speed of computer m. E m Effective computation speed of computer m to one computation process running on m. T n The elapsed computation time for J n. DLB A dynamic load balancer that balances one parallel job on a given set of computers. H A set of computer hosts available for parallel computing. HU The set of hosts used to run at least one parallel job. HU H. HN The set of computers not used by any parallel job. HU = H HU. HU n The set of computers actually used for executing J n. HU = HU n, J n J.

HR n The set of computers requested by J n. HR n H. The dynamic load-balancing problem is how to map J to H such that newly added parallel job is load balanced but does not affect the load balance of the existing parallel jobs. 3.2 Round robin load balancing of multiple parallel jobs The round robin load balancing algorithm for multiple parallel jobs are based on the following definitions and theorems. It is assumed that all computers are using the time slicing protocols in their operating systems. Definition 1. A parallel job J n is load balanced on H if T n is minimized. Definition 2. An extraneous process to a parallel job, J n, is a computer process that is running on the same computers with J n but cannot be started or stopped by J n. Definition 3. The effective computation speeds of two computers to each process are defined the same if moving any process from one computer to the other does not reduce the difference of the effective computation speeds between the computers. Round robin load blanching of multiple parallel jobs is based on the following arguments: If a computer that runs a parallel job has extraneous load, the parallel job with equal sized parallel processes can best utilize the given CFU share. If a parallel job is load balanced on a given set of computers and the sizes of the parallel processes of the job are the same, the effective speeds of all computers to every process of the parallel job are the same. If the loads of all load-balanced computers are changed by a same percentage, the relative speeds of hosts to each parallel process are the same. If a set of computers S is executing load balanced parallel jobs and a new parallel job J n is added to S by the J n s load balancer, the newly added J n should not affect load balancing of other existing parallel jobs. If HU m HU n and parallel jobs J m and J n are load balanced, deleting J m may affect the load balance of J n. Assume that the load distribution for all jobs is initially balanced and that the load distribution becomes unbalanced due to completion of some jobs. Sequentially add another balanced load or rebalance an existing parallel job will make all current parallel jobs more balanced. 3.3 Develop a DLB procedure to satisfy the conditions Assuming that the parallel jobs are added to the computer system sequentially one by one, the following algorithm will ensure that the load is more balanced each time a new parallel job is added to or rebalanced on the computers. In the following algorithms, it is assumed that

J n is added to the computers after load balanced J m is being executed. Algorithm 1: Step 1: [Determine HU n ] Once load balanced J m is being executed, J n can only request the computers in HU n such that HU n HU m or HU n HU m =, m. Step 2: [load balancing for J n ] DLB will determine an optimal load distribution for J n in HR n. The algorithm is developed based on Theorem 2. However, Theorem 2 is valid if the communication cost is negligible. If the communication cost is significant in some applications, the algorithm can be modified as follows: (A). The algorithm can be used if the communication costs between different pairs of processes are the same. This condition exists when massively connected parallel machine or local area network is used for parallel computing. In this condition, the effect of communication to the effective computation speed exists to every parallel process. Therefore, Theorem 2 can still be considered as valid. However, the optimal load balancing in such a situation may suggest using a small set of hosts to reduce the communication overhead. Therefore, the load-blancing algorithm need to be modified as follows: Algorithm 2: Step 1: [Determine HR n ] Once load balanced J m is being executed, J n can only request the computers in HR n such that HR n HU m or HR n HU m =, m. Step 2: [load balancing for J n ] DLB will determine an optimal load distribution for J n in HR n. Step 3: Find HU n HU n HR n. Since the optimal distribution may not use all the computers requested in HR n, (B). If the communication costs between different pairs of computers are very different, such as the situation that wide area networks are used for parallel computing, Theorem 2 may not hold. In this situation, algorithm 1 can be the modified as algorithm 3. Algorithm 3 does not allow two parallel jobs to share the same set of computers to prevent mutual interference of tow parallel jobs. Algorithm 3: Step 1: [Determine HR n ] Once load balanced J m is executed, J n can only request the computers in HR n such that HR n HU m =, m.

Step 2: [load balancing for J n ] DLB will determine an optimal load distribution for J n in HR n. Step 3: Find HU n Since the optimal distribution may not use all the computers requested in HR n, HU n HR n 4. EXPERIMENTAL RESULT The applicability of Algorithm 1 is demonstrated in the following load balancing experiment. This experiment is designed to show that, the later balanced parallel jobs do not affect the balance of previously balanced parallel jobs if the parallel jobs are load balanced one by one in round robin fashion. In this experiment, three parallel jobs, J1, J2 and J3, are executed on 5 computers, C1 to C5. The job size and the number of blocks of all parallel jobs are shown in Table 1. The load distribution of each parallel job is represented by a 5-bar bar chart. Figure 1 depicts the load distributions of three parallel jobs during the experiment. Figure 1 has three parts (2a, 2b, and 2c). Bar chart A in Figure 1(a) shows the initial unbalanced load distribution of job J1. The gray blocks in a bar chart are the blocks of J1. The black block is the extraneous load with respect to J1. The measured elapsed execution time of J1 for this load distribution is 1.76 seconds per time steps. After load balancing by DLB, the new load distribution is shown by bar chart B. The measured elapsed execution time of J1 for this load distribution is reduced to 1.01 seconds per time step. Table 1. Information of parallel jobs. Parallel job A Parallel job B Parallel job C Job size (grid points) 250,000 500,000 160,000 Number of blocks 30 30 20 When load balanced J1 is being executed, J2 is added to the computers. Column C in Figure 1(b) shows the load distribution of J1 (as in Figure 1(a)) and initial unbalanced load distribution of job J2. The gray blocks in the bar charts of job x are the blocks of x. The black blocks are the extraneous load with respect to job x. The white blocks are the processes of job x moved from other computers to this computer. The measured elapsed execution times of J1 and J2 are labeled on the top of each bar chart. After load balancing of J2 by DLB, the new load distribution of J1 and J2 are shown in column D in Figure 1(b). Both the measured elapsed execution times of J1 and J2 are reduced. When balanced J1 and J2 are being executed, J3 is added to the computers. Column E in Figure 1(c) shows the load distribution of J1, J2 (as in Figure 1(b)) and initial unbalanced load distribution of J3. The gray blocks in the bar chart of job x are the processes of x. After load balancing of J3 by DLB, the new load distributions of J1, J2 and J3 are shown in column F in Figure 1(c). The measured elapsed execution times of all three jobs are reduced. It can be seen that when an unbalanced job is introduced, the other balanced jobs becomes unbalanced

(see columns C and E). When the unbalanced job is balanced, the other jobs become more balanced too (see columns D and F). In this experiment, the elapsed execution time for every job decreases after the load balancing of each job. This phenomenon should not be considered as general. In fact, after the load balancing a parallel job, the elapsed execution time for the other parallel jobs may increase. The explanation is as follows. If parallel job A is not load balanced, it does not fully utilize its share of computational resources (in a time slicing system). The unused computational resources are used by the other load balanced parallel jobs on the same computers. Therefore, the other parallel jobs may run faster than it supposes to be. Once the parallel job A is load balanced, all its share of computational resources is utilized. Since the extra computational resources for other jobs is vanished, the elapsed execution time of other parallel jobs may increase. The reader may notice that the number of extraneous load on a compute is less than the number of parallel processes assigned to that computer. This phenomenon is explained as follows. Due to the time used for process synchronization, each parallel process need to wait information periodically from neighboring processes. Since only the processes on the running queue are counted statistically for extraneous load, only a part of all parallel processes on the computer can be counted. The more unbalanced distribution of the parallel jobs, the more difference between the number of processes assigned to the computer and the measured number of processes on the computer. Even when the parallel job is balanced, only about 80% of the processes can be measured. 5. CONCLUSION This paper described a practical method for dynamic load balancing of multiple parallel jobs in network computers. The applicability of the method is both theoretically proved and experimentally demonstrated. A software tool that supports the round robin load balancing method for multiple parallel jobs is currently being developed. ACKNOWLEDGEMENTS Financial support provided for this research by the NASA Glenn Research Center is gratefully acknowledged. The authors are also grateful for the computer support provided by the IBM Research Center throughout this study. REFERENCES [1] Williams, D. (1990), Performance of Dynamic Load Balancing Algorithms for Unstructured Grid Calculations, CalTech Report, C3P913. [2] Simon, H. (1991), 'Partitioning of Unstructured Problems for Parallel Processing, NASA Ames Tech Report, RNR-91-008. [3] Lohner, R., Ramamurti, R., and Martin, D. (1993), A Parallelizable Load Balancing

Algorithm, Proceedings of 31st Aerospace Sciences Meeting & Exhibit, January 11-14, Reno, Nevada. [4] Tezduyar, T.E., Aliabadi, S., Behr, M., Johnson, A. and Mittal, S. (1993), Parallel Finite Element Computation of 3D Flows, IEEE Computer, pp. 27-36. [5] Maini, H., Mehrotra, K., Mohan, C., and Ranka, S., 1994, Genetic Algorithms for Graph Partitioning and Incremental Graph Partitioning, Proceedings of Supercomputing 94, Washington D.C., pp. 449-457. [6] Chien, Y. P., Ecer, A., Akay, H. U., and Carpenter, F. (1994), Dynamic Load Balancing on Network of Workstations for Solving Computational Fluid Dynamics Problems, Computer Methods in Applied Mechanics and Engineering, 119 (1994), pp. 17-33. [7] Chien, Y. P., Carpenter, F., Ecer, A., Akay, H.U., 1995, Computer Load Balancing for Parallel Computation of Fluid Dynamics Problems, Computer Methods in Applied Mechanics and Engineering, 125 (1995). [8] Akay, H. U., Blech, R., Ecer, A., Ercoskun, D., Kemle, B., Quealy, A., and Williams, A. (1993), 'A Database Management System for Parallel Processing of CFD Algorithms, Parallel Computational Fluid Dynamics '92, Edited by J. Hauser, et al., Elsevier Science Publishers, The Netherlands. [9] Snir, M., Otto, S., Huss-Lederman, S., Walker, D., and Dongarra, J., 1998, MPI: The Complete Reference, The MIT Press. [10] Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., and Sunderam, V. (1993), PVM 3.0 User's Guide and Reference Manual, Oak Ridge National Laboratory Technical Report, ORNL/TM-12187.

A B 1.76 s 1.01 s J1 Figure 1(a). Load distributions of parallel job 1 before and after load balancing. C D 2.05 s 1.92 s J1 2.63 s 2.26 s J2 Figure 1(b). Load distributions of parallel jobs 1 and 2 before and after load balancing of job 2.

E F 2.70 s 2.52 s J1 3.14 s 2.98 s J2 2.95 s 2.72 s J3 Figure 1(c). Load distributions of parallel jobs 1, 2, and 3 before and after load balancing of job 3.