A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi Ha* Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Abstract The working condition of a multicomputer system based on message passing communication is changeable and unpredictable. Any good algorithm for the system must be adaptive to the dynamic change of working condition. In this paper, we propose a new algorithm called RAS(Reservation And Work stealing). RAS treats load balancing, fault-tolerance and processor selection problems by work stealing and reservation based distribution. According to our experiment on the IBM SP2 with a matrix multiplication program, RAS is proven superior to the weighted factoring under the shared running environment and comparable under the dedicated running environment. Keywords: dynamic load balancing, fault tolerance, processor selection 1 Introduction As the performance of general purpose processors and networks improves, the multicomputer system emerges as a viable platform for high-performance computing. Interprocessor communication of the system is mainly achieved by explicit message passing. Due to the high communication overhead of message passing, only coarse-grain parallelism is exploited on the system. Hence, we are specially aimed at executing data parallel applications which have coarse-grain parallelism. In a time-shared environment in which a system is not dedicated to the program, a program may show very dynamic behavior during its running period. The operating system and other user processes usually interfere the program. The change of loads severely affects the overall performance since the running time of an application depends on the speed of the slowest node. Therefore, balancing loads among nodes is essential to achieving shorter running time. In addition, there are several other considerations in exploiting parallelism on a multicomputer system. First, a multicomputer system, particularly NOW/COW, generally does not guarantee the reliability of nodes. Few previous dynamic load balancing approaches, if any, do allow the failure of nodes and links. To guarantee the termination of a program, an algorithm should tolerate the dynamic failures. Second, it is not always true that the performance increases in proportion with the number of nodes, because high communication overhead allows the master to manage only a limited number of slaves without performance degradation. Superfluous nodes may only raise the scheduling burden of the master. Furthermore, the master may make a wrong decision by sending data to slower nodes. Therefore, it is necessary to select faster nodes up to as many as a certain threshold. We propose a new method, RAS (Reservation And work Stealing), to exploit coarsegrain data parallelism on a multicomputer system by solving load-balancing, fault-tolerance and node selection problems at the same time. In this paper, we compare RAS with the weighted factoring algorithm with a data parallel application, matrix multiplication. Our

experiments are performed on the IBM SP2 using MPI. We first review some prior approaches to dynamic load-balancing problem on several systems. Then, we present the proposed algorithm in details. Last, we report our experimental results, which are used to demonstrate the robustness of our algorithm under various running conditions. 2 Previous Works The simplest way to share N data among P processors is to divide the whole data into N P chunks. However, this method is not appropriate for a multicomputer system, because it does not consider the dynamic change of running environment and communication overhead. Hence, several approaches have been proposed to solve the problems. One approach is to balance workloads by reducing the amount of data as the scheduling progresses. GSS [1] allocates 1 P of the remaining data to the next idle node. The size of chunk for the ith request is G i = (1 1 P )i N P. Factoring [2] schedules data in batches of P chunks sized equally. In practice, the total number of data per batch is half the remaining data. The size of chunks in the ith batch is F i = ( 1 2 )i+1 N P. Originally both algorithms are proposed to a shared memory system. When both algorithms are applied to a multicomputer system, large chunks assigned to the slowest node at the beginning of the computation can delay the overall finishing time, because the remaining data may not be sufficient to achieve a good load balance. Weighed factoring [3] is a variant of factoring which was proposed for multicomputer systems. Weighted factoring is identical to factoring except that the size of chunks in a batch is determined in proportion to the weight(w j ) of nodes. The size of the jth chunk in the ith batch is F ij = ( 1 2 )i+1 N W j k=p. Even k=1 W k though the weighted factoring considered heterogeneity of nodes, it still can not reduce the Algorithm Chunk Size Static Chunk 25 25 25 25 GSS 251914118653321111 Factoring 131313136666333322221111 Weighted Factoring 171313986644332222111111 (P=4, N=100, Weights are 2 1.5 1.5 1.) Table 1: Sample chunk sizes effect of superfluous nodes as mentioned in [3]. Table 1 illustrates the sample chunk sizes according to these algorithms. The other approach is work stealing wherein work migrates dynamically from heavily loaded nodes to lightly loaded ones. Hybrid Scheduling [4] combines a static scheduling method with a dynamic one. The master evenly distributes data to the slaves at the static level. Then, the slaves divide the given data into chunks using an algorithm like GSS at the dynamic level. After the execution of each chunk, all slaves periodically send messages to the master to inform of their performance. When the master detects certain imbalance of workloads, it sends direction to the slaves to redistribute the data. Hamdi s approach [5] is similar to hybrid scheduling. The master periodically polls the slaves to collect workload information and partitions data into several regions. To avoid unnecessary communication, a slave sends a request for the data in the region to the slave who has the data, when the requesting slave has no data to compute. Both algorithms experience high scheduling overhead to collect load information, because lock-step synchronization and periodic polling are burden-some operations. 3 RAS Algorithm We propose a new adaptive algorithm, RAS, for multicomputer systems. To reduce or hide communication overhead, RAS adopts overlapped communication. In a message passing environment, sending a large message is desirable compared with multiple small messages, because it amortizes high start-up cost. However, as shown in figure 1, large message delays the start time of the slaves. In order to achieve high performance, there has to be

a trade-off between the start-up cost and the start time. Communication overhead is proportional to the size of message, if the size is large enough [6]. From an experiment as shown in figure 2, we can find the point on which linearity is broken; the message larger than 1KB gives us relatively early start time without significant overhead. slaves and redistributes data when load imbalance is detected. As the master is responsible for both scheduling and distribution of data, RAS allows several slaves to compute a data redundantly. This mechanism makes RAS tolerable to the failure of slaves. Figure 3: RAS algorithm flowchart Figure 1: Large message vs. Small message 3.1 Performance Estimation of Slaves Figure 2: Overhead per 1KB of MPI Send on SP2 RAS consists of three phases as shown in figure 3. In the first phase, the master estimates the performance of slaves in terms of their computation latency. After the estimation, the master distributes data on the basis of the performance. In this distribution phase, the master can select faster slaves in the slave pool as many as profitably used. In the last phase, the master monitors the workloads of The performance estimation of slaves should be carried out in advance to schedule data precisely. We divide the whole data into basic chunks which are the unit of computation. In case of matrix multiplication, one row or column can be a basic chunk. If the size of a basic chunk is smaller than 1KB, several basic chunks are packed into a basic message to reduce the communication overhead. The size of a basic message is determined as 1KB S basic S basic, where S basic denotes the size of a basic chunk in bytes. When RAS starts up, the master sends one basic message to each slave for estimation. The first basic message sent to each slave is called a probe. The average execution time of a basic message is used to represent the performance. The master wants the slaves to report their performance every computation of a basic message for accurate estimation. However, frequent reports will be burden-some to both the slaves and the network. As a compromise, a slave reports the result of computation piggybacked

with its average execution time, when it meets the following conditions. When a slave finishs computation of the probe. When a slave receives the End Of Data Distribution message. When a slave produces results larger than 1KB. When the master interrupts a slave to steal data. When a slave has no data to compute. The estimation is used both in the distribution phase and in the work stealing phase. The master continues to check the arrival of reply to the probe until the end of the distribution phase, if not all the slaves report their performances. 3.2 Data Distribution on the Basis of Reservation In the distribution phase, the master sends data in the form of basic message. The master converts the average execution time into the sending period of basic message and distributes basic messages on the basis of the period. The sending period(p i ) of the ith slave is deter- Texec(i) T overhead, where T exec (i) denotes the mined as average execution time of a basic message in the ith slave and T overhead does the overhead of sending operation for a basic message. The number of MPI Send operations is used to represent the sending period. When the ith basic message is sent to the jth slave, the (i+p j )th basic message is reserved for the slave. There are two possible ambiguous cases during this reservation. One is the case that several slaves try to reserve the same data. In this case, the master selects the fastest slave and sends the data to the slave. The other slaves who fail to get the data reserve the next data. The other is the case that no slave reserves a data. This case is divided into two sub-cases. When there is no slave who replies Figure 4: Scenario of data distribution based on reservation the result of the probe, the master sends data in a round-robin fashion to all slaves. When there are more than one slaves that responded to the result of the probe, the master sends data in a round-robin fashion to the responded slaves. These two cases make some slaves receive superfluous data. The master does not use these superfluous data for future reservation. This distribution scheme enables the master to select faster slaves. When the number of slaves exceeds a certain threshold, all data are reserved by some slaves. Since the master sends data to faster slaves, slower ones repeatedly fail to get data. A scenario of data distribution based on the reservation is shown in figure 4, where the sending periods of slaves are 6,9,5 and 7, respectively. The first four data M1, M2, M3 and M4 are probes which are used for the performance estimation. Since M5 and M6 are not reserved, the master distributes them in a round-robin fashion to slave1 and slave2. As the master receives a reply from slave1 during sending M7, it sends M8 to slave1 and reserves M14 for slave1. After that, the master receives a reply from slave3, and it sends M9 to slave3 and reserves M14 for slave3. In case of M14, there are two slaves which reserve the data. The master selects the fastest slave, slave3 and sends the data to the slave. The slave1 who fails to get the data, reserves the next data

M15. 3.3 Work Stealing Even though the master tries to distribute data in proportion to the speed of slaves, dynamic change of running condition makes the finishing time of slaves inconsistent. Superfluous data distributed in a round-robin style can be another source of load imbalance. The master redistributes data by work stealing when it detects any significant load imbalance. When a slave S i has no basic chunks to compute, the master selects the slave S M who has the most amount of the remaining basic chunks. The master steals a portion of the basic chunks from S M andsendthemto S i. The amount of basic chunks to steal is T given by exec(m) T D exec(i)+t exec(m) M, where D M is the amount of the remaining basic chunks of the richest slave. When the amount of stealing basic chunks is less than 1, the master redundantly sends one basic chunk of S M to S i to avoid the interrupt overhead of the richest slave. When a slave is heavily loaded, the slave can not reply its computational result. In that case, the master can send the duplicated data, which was sent to the slave, to other slaves. With this mechanism, RAS can be tolerant to the failure of slaves only if the master is alive. We adopted a software interrupt to implement work stealing mechanism. As MPI dose not define the interrupt function between processes running on different nodes, we implement a interrupt using asynchronous I/O with socket [7]. 4 Experimental Result We compare RAS with the weighted factoring(wf) under two running environment, dedicated and shared running environment. We check the net elapsed time of matrix multiplication excluding the initialization time of MPI. To make a shared running condition, we execute a simple parallel version of matrix multiplication program as a background process. We take the mean value of running times after 10 executions of both programs. When we refer to the number of processors, it means the total number of processors allocated including the master. The result shows that RAS is a little better than the weighted factoring even under the dedicated running condition and superior under the shared running condition. It means that adaptation to the dynamic change of working condition is important, even though it requires additional scheduling overhead. Note the results of the experiment with 16 processors shown in figure 7 and figure 8. RAS balances loads very adaptively under the dynamic environment, while the weighted factoring suffers from load imbalance. The main reason is that the master sends too many data to the slower slaves. It is useful to observe the number of stealing, since RAS balances loads by work stealing. Table 2 and table 3 show the number of stealing under the various configurations. The number of interrupts excluding the number of redundant computations is used to represent the number of stealing. The number of stealing implies the degree of decision inaccuracy. If the master distributes data on the basis of correct information, the number of stealing decreases. The number of stealing is related to the number of processors and the variance of environment. When the master schedules data incorrectly, it tries to balance loads with additional stealing. As expected, the number of stealing increases as the number of processors grows and the environment becomes more dynamic. To test real fault tolerance, we must rewrite the error handling routines of MPI. For simplicity, we modeled a sudden failure or a surge of load of a node with infinite loop. We let a slave fall into infinite loop after program execution. Then the master can not obtain any information about the slave. As the master is responsible for distributing data, it redundantly sends the data sent to the slave to others. Even though the weight factoring was not terminated, RAS was successfully terminated

Figure 5: 100x100 matrix multiplication under the dedicated running environment Figure 7: 100x100 matrix multiplication under the shared running environment Figure 6: 200x200 matrix multiplication under the dedicated running environment Figure 8: 200x200 matrix multiplication under the shared running environment without significant performance degradation. In this paper, we applied RAS to parallelizing a matrix multiplication problem. RAS showed good performance under both dedicated and shared running environments, even though the degree of difference is changeable according to the running environment. Furthermore, RAS was fault-tolerant and adaptive to the number of slaves. For more rigorous evaluation, it is necessary to execute various applications. We also have a plan to apply RAS to PC cluster and cluster of workstations. In addition, it is necessary to give user friedly interface of library for application programmer. matrix size number of processors 4 8 16 100x100 2.93 4.50 8.00 200x200 1.39 3.30 4.15 Table 2: The number of stealing under the dedicated running environment matrix size number of processors 4 8 16 100x100 3.67 4.67 8.00 200x200 3.73 8.53 9.20 Table 3: The number of stealing under the shared running environment

References [1] Constantine D. Polychronopoulos and David J. Kuck. Guided Self-scheduling: A Practical Scheduling Scheme for Parallel Supercomputers. IEEE Transaction on Computers, Vol.C-36, No.12:1425 1439, DEC. 1987. [2] Susan Flynn Hummel, Edith Schonberg, and Lawrence E.Flynn. Factoring: A Method for Scheduling Parallel Loops. Communication of the ACM, Vol.35, No.8:90 101, Aug. 1992. [3] Susan Flynn Hummel, Jeanette Schmidt, R.N.Uma, and Joel Wein. Load-Sharing in Heterogeneous Systems via Weighted Factoring. SPAA, pages 318 328, 1996. [4] Oscar Plata and Francisco F. Rivera. Combining Static and Dynamic Scheduling on Distributed-Memory Multiprocessors. ICS, pages 186 195, 1994. [5] Mounir Hamdi and Chi-Kin Lee. Dynamic Load Balancing of Data Parallel Applications on a Distributed Network. ICS, pages 170 179, 1995. [6] Zhiwei Xu and Kai Hwang. Modeling Communication Overhead: MPI and MPL Performance on the IBM SP2. IEEE Parallel and Distributed Technology, Spring:9 23, 1996. [7] Message Passing Interface Forum. MPI : A Message-Passing Interface Standard. May, 1994.