Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department of Computer and Information Sciences University of North Florida Jacksonville, Florida 32224 Abstract: Networks of workstations (NOW) have become popular environments for parallel and distributed processing. In many cases, the researcher or educator is faced with a heterogeneous collection of workstations of different ages and processing capabilities. Many times the NOW is non-dedicated so external workload plays a role in performance of the cluster. The use of known workstation characteristics can help determine an ideal load distribution. In this paper we present methods for utilizing system information to determine a dynamic load balance in a heterogeneous non-dedicated NOW. We focus on two examples and show benefits resulting in decreased execution time. Keywords: distributed parallel processing, dynamic load balancing, message passing interface 1 Introduction Networks of workstations (NOW) have become the de-facto standard for distributed and parallel processing applications in our modern computing environment. Whether or not the machines are tightly coupled, the workstation cluster exists as the researcher or educator s laboratory. In general, there are many factors influencing the performance of a NOW. The age and speed of the machines, the nature of the processor, and the external workload all play a role in the throughput provided by the network. It is often necessary to provide effective load balancing to such a workstation cluster in order to maximize its performance. Load balancing is often times, depending on the problem at hand, a very difficult task. There are many theoretical methods for accomplishing load balancing, but often times in practice, problems are not suitable for application of these methods. In this paper we present some very practical methods for load balancing on a NOW such as may be found in many universities and small research institutions. These methods are easy to apply, and yield some interesting results. 2 Problem solving To address the problem of load balancing, we discuss issues associated with problem solving in general. To attempt the development of a parallel algorithm for the solution of a given problem, a four step approach may be attempted. First, the original problem must be partitioned, as finely as possible. This partitioning allows us to uncover any inherent parallelism in the solution to the problem. The original decomposition should be as fine as possible in order to allow maximum flexibility in design of solution. Next, the required communication between the tasks identified in the original partition must be examined. While some problems exhibit simple communication patterns, it is often the case that tasks in the partition can not execute independently. Information must be exchanged between the tasks so the required computation can continue to a suitable conclusion. The third step in the algorithm development process requires us to re-examine the original partition to discover related tasks. Often times the original partition is much to fine to provide efficient execution. The topology of the NOW onto which the partition must be mapped must be considered. Thus, the fourth step in the design is the process is attempting to find a suitable mapping to existing hardware. This step in the process is where many interesting questions arise. Providing a mapping that balances the load for each of the processors in our network is a challenge.
3 Load balancing A wide variety of load balancing algorithms have been proposed. Some are general methods and others are task specific. Load balancing can be static or dynamic. Static methods involve deciding in advance the mapping of tasks to processors and leaving the balance unchanged for the duration of the execution of the algorithm. Much more exciting is the prospect of dynamic load balancing, which involves changing the allocation of tasks to processors during the execution of the algorithm. A clientserver model where the server continually polls clients to find out who needs more work, and allocates work to these clients is one possibility. A possible difficulty with such a model is the overhead associated with the communication of tasks between clients and servers. Moreover, the topology of the workstation cluster, the nature of the task communication and the combination of tasks accomplished in the third problem solving step all play a role in the dynamic mapping of tasks to processors. 4 The method In this paper we propose a load balancing method to be used on a non-dedicated NOW. The proposal provides methods to be used on a collection of heterogeneous workstations and deals with the performance issues that can be caused by external workload factors that may be beyond the control of the parallel cluster user. The method initially determines a load distribution based on the speed of the processors and on the amount of workload on each host. The method then attempts to maintain a well balanced load among the available workstations by evaluating the completion of subsets of the work and basing future load distributions on those results. 4.1 Determining the initial distribution In general the CPU rate of a particular processor cannot be directly translated to the workstation performance on a particular application, and so is not a correct measure of performance [7]. However, for our purposes, the clock rate does provide a way to make a general approximation of the relative performance of the workstation. In fact, the length of the run queue is all that is required to describe the workload [6]. We use the average number of jobs in the run queue, Qavg, to describe the workload. Then, the portion of the processing power that is available to a task on one of the workstations in the cluster is PA = (Nprocs) / (Qavg + Ntasks) (1) where Nprocs is the number of processors on that workstation and Ntasks is the total number of new tasks being started on that particular machine. Therefore the normalized performance is NormPerf = PA * CPUrate (2) and the total available performance is the sum of the performances on each of the workstations. NormTotal = Σ NormPerf i (3) Thus, for each parallel task executing on a executing on a workstation, the portion of the work it should be assigned is related to the portion of the total cluster power it contributes. Thus the amount of work for each initially is W i = W * (NormPerf)/(NormTotal). (4) where W is the total amount of work for the application. 4.2 Dynamic load balancing The algorithm now maintains and promotes a fair distribution of the load by evaluating the progression of the load, and using this information as a metric for the next load distribution. The load is redistributed at specific intervals during the course of the program s execution. The redistribution is determined by the processor s most recent performance. This performance is simply Performance = W/t where W is the amount of work completed during this time interval t. With the performance information from each process, the total performance is calculated as above (3) and the amount of work dynamically allocated for each process is determined (4). This adaptive
distribution of the load helps to correct any imbalances in the load, which might be the result of changing external workload or the complex nature of the application. 5 The experiment The proposed method for load balancing is a useful and simple technique for parallel applications to better utilize a network of workstations. To evaluate this method, we tested and compared it against other possible methods. We selected two different and contrasting test applications for our investigation. The first was a volcanic ash dispersion model, while the second was a finite difference model. 5.1 Volcanic ash model The volcanic ash model is a dispersion and accumulation model developed by Suzuki [8]. The model is an empirical model which describes volcanic ash fallout and can be utilized for volcanic hazard assessment. Such a model aids in the hazard assessment by making the variability of the volcanic systems quantifiable without the use of more complex and computationally expensive physical models. This model computes the mass of ash X accumulated at location x and y distances from the volcanic vent according to the following formula 5.2 Finite difference model Many scientific problems have solutions that include partial differential equations. To numerically solve equations with partial derivatives, the derivatives can be replaced with finite difference approximations. Our test case will consist of computing the steady state temperature distribution in a rectangular slab [3]. For the solution to this problem, we use an iterative method called Liebmann s method. With this method, the domain of interest (the slab) is represented by an array of evenly spaced points. The boundary points are given a specific temperature that remains constant. The new temperature for any point in the interior can be computed with the temperature of its border cells. Due to the limited space offered by this forum, the authors will concentrate on providing results for the volcanic ash model, and request that persons interested in the results of the finite difference model contact the authors directly. 5.3 Model implementation The proposed load balancing method is a useful and simple technique for parallel applications which facilitates efficient utilization of a network of workstations. To evaluate the method, it was tested and compared with other possible methods, to be described below. Each of the models described above was coded in C utilizing the features of the Message Passing Interface (MPI). A master-slave relationship model was used for the test cases. In this scenario, a master process is responsible for keeping track of the model organization, while the slave processes to the work. The master process handles all input-output duties as well as all of the computation associated with the dynamic load distribution to the slaves. 5.4 Workstation environment The workstations used for our experiment used the linux operating system running RedHat Linux v6.2. Machines used in our experiments consisted of single processor workstations, dual processor workstations, and one four-processor workstation. Each of the multiple processor workstations has a symmetric multiprocessor architecture. The machines are connected with 100 megabit fast ethernet. From the linux operating system, we were able to obtain both the processor speed and the active current load information. The linux kernel maintains a file system /proc/ which contains information on running processes,
memory, devices and network details [1]. One file, loadavg, contains information on system load as well as the average number of jobs in the run queue averaged over 1, 5, and 15 minutes. The cpuinfo file contains information about the processors including the number of processors, their type and speed. Functions were written to obtain the information from the /proc/ file system. A process, using these functions, can collect the average run queue length, the number of processors, and the processor s clock rate. Functions were also written to calculate equations (1), (2) and (3) previously described. We also calculated a relative speedup of using this adaptive method. The results are summarized below. 6 Experimental results 6.1 The methods Four distribution load balancing methods were implemented and timed with the volcanic ash dispersion model. Even distribution, self-scheduling, characteristics based, and adaptive characteristics based methods were implemented. The even distribution method and characteristics based method are static methods. The distribution of the load to the processes is decided in advance. The characteristics based distribution utilizes the CPU clock rate and the average run length queue as described earlier. The master process sends the data and the correct percentage of the points to calculate to each slave process. The self scheduling algorithm for our volcanic ash dispersion model is a task-queue load balancing method. This self scheduling method distributes a single point to each machine to be calculated. After the slave process calculates the results, it returns the result to the master process and requests another calculation. With our adaptive characteristics based distribution, only a portion of the total grid points were originally distributed in a manner determined by the characteristics of the machine, as in the characteristics based distribution. After each machine finishes its calculations, it requests an additional set of points to calculate. The master process determines the processing speed by keeping track of the length of time it takes to compute a particular number of calculations for each slave process. With this performance data, all future distributions are based on the performance of the process in relation to the total performance of the cluster. 6.2 The experimental apparatus To test the relative performance of the methods described above, both homogeneous and heterogeneous clusters of workstations were assembled. Various four processor clusters were used for uniformity of test results. These clusters are all homogeneous. The first cluster, named Atlas, consisted of four Pentium II machines at 233 mhz. The second cluster, named Janus, consisted of four Pentium III machines at 500 mhz. The third cluster, named Xena, consisted of four Pentium III machines at 550 mhz. The fourth cluster, named Dwarf, consisted of two 2 processor machines at 500 mhz. The last cluster, named Snowwhite (SW) consisted of one four processor machine at 500 mhz. In addition to the above clusters, four heterogeneous clusters were assembled. The first cluster, H1 consisted of two Atlas and two Janus machines. The second cluster, H2, consisted of two Janus and two Xena machines. The third cluster, H3, consisted of 9 total processors, and consisted of one Atlas, one Janus, one Xena, two Dwarf, and one SW machine. The fourth cluster, H4, consisted of 20 processors, four Atlas, four Janus, four Xena, four Dwarf, and SW. 6.3 The experimental results First, the homogeneous clusters were tested with no external load. The results are illustrated in figure 1 below.
No External Load 600.00 500.00 400.00 300.00 200.00 100.00 Even Distribution Self Scheduling Characteristics Based Adaptive 0.00 Atlas Janus Xena Dwarf SW Figure One. Homogenous, No External Load We can see from the figure that the static load balance and the self scheduling algorithm performed the worst. We also see that the adaptive dynamic load balance performed best in all cases, if only slightly better in some cases. Next, we tested the results on the heterogeneous clusters H1 H4. The results are shown in figure two below. Heterogenous Clusters 500.00 450.00 400.00 350.00 300.00 250.00 200.00 150.00 100.00 50.00 0.00 H1 H2 H3 H4 Even Distribution Self Scheduling Characteristics Based Adaptive Figure Two: Heterogeneous Clusters We can see that in this case, the adaptive distribution method produced the lowest run times in three of four cases. We attribute the slightly higher run times in H4 to the communication amongst the 20 processors involved in cluster H4. Finally, to simulate an unevenly distributed external workload on the homogeneous clusters, a workload producing program was developed which was capable of producing up to 100% CPU utilization. Due to this artificial workload, the times for the volcanic ash dispersion model increased from the results shown in Figure One. The results are shown in figure three below.
External Load 900.00 800.00 700.00 600.00 500.00 400.00 300.00 200.00 100.00 0.00 Atlas Janus Xena Dwarf SW Even Distribution Self Scheduling Characteristics Based Adaptive Figure Three. Imbalanced External Workload, Homogeneous Clusters We can see that in most cases the adaptive characteristics based load balancing algorithm produced excellent results compared to the static methods. 7 Summary and conclusions We formed and implemented a simple dynamic load balancing method based on the characteristics of the cluster involved and its workload. We rebalanced the load dynamically at steps during the execution of the application, with each new distribution based upon the performance of the workstations during the last computation phase. We have shown that in most cases, the adaptive characteristics dynamic method performs at least as well as and in most cases better than other simple load balancing methods. Our methods indicate that in nondedicated clusters of workstations improvements in performance can be achieved by these calculations. We also point out that even though the calculations for the load balancing are done along with the parallel computation, improvement was seen. We hope that the development of these methods will improve the use of parallel applications in shared network workstations found in many educational and research environments. Testing of these methods continues. We would encourage anyone interested in methods for improving the performance of networks of workstations to contact the authors for more detail concerning the methods used. 8 References [1] Bishop, A.M., The /proc file system and ProcMeter, Linux Journal, #36, 1997 [2] Foster, Ian, Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering, Addison Wesley, Reading, MA. 1995. [3] Gerald, C., and Wheatley, P., Applied Numerical Analysis, Addison Wesley, Reading, MA., 1999 [4] Gropp, W. E. et.al., Using MPI: Portable Programming with the Message-Passing Interface, MIT Press, Cambridge, MA., 1999. [5] Hargrove, W and Hoffman, F., Optimizing Master/Slave Dynamic Load Balancing in Heterogeneous Parallel Environments, Oak Ridge National Laboratory, Oak Ridge, TN, 1999. [6] Kunz, T., The Influence of Different Workload Descriptions on a Heuristic Load Balancing Scheme, IEEE Transactions on Software Engineering, 1991.
[7] Hennesy, J. L. and Patterson, D.A., Computer Organization and Design: The Hardware/Software Interface, Morgan Kaufmann, San Francisco, CA 1998. [8] Suzuki, T., A theoretical model for the dispersion of tephra, Arc Volcanism: Physics and Tectonics, Terra Scientific Publishing, Tokyo, 1983