Load Balancing on a Non-dedicated Heterogeneous Network of Workstations



Similar documents
Cellular Computing on a Linux Cluster

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

CHAPTER 1 INTRODUCTION

Control 2004, University of Bath, UK, September 2004

An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems

Multilevel Load Balancing in NUMA Computers

MOSIX: High performance Linux farm

Building an Inexpensive Parallel Computer

Reliable Systolic Computing through Redundancy

Efficiency Considerations of PERL and Python in Distributed Processing

High Performance Computing in CST STUDIO SUITE

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Overlapping Data Transfer With Application Execution on Clusters

Operating System Multilevel Load Balancing

A Comparison of General Approaches to Multiprocessor Scheduling

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Load Balancing. Load Balancing 1 / 24

Advanced Load Balancing Mechanism on Mixed Batch and Transactional Workloads

Parallel Analysis and Visualization on Cray Compute Node Linux

Chapter 1: Introduction. What is an Operating System?

A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

@IJMTER-2015, All rights Reserved 355

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Experiments on the local load balancing algorithms; part 1

LOAD BALANCING FOR MULTIPLE PARALLEL JOBS

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Dynamic Load Balancing in a Network of Workstations

Partitioning and Divide and Conquer Strategies

Source Code Transformations Strategies to Load-balance Grid Applications

Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces

FPGA area allocation for parallel C applications

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Performance evaluation

Support Vector Machines with Clustering for Training with Very Large Datasets

Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors

A Survey on Load Balancing and Scheduling in Cloud Computing

Cluster, Grid, Cloud Concepts

Proposal and Development of a Reconfigurable Parallel Job Scheduling Algorithm

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Review of Customized Dynamic Load Balancing for a Network of Workstations

A Theory of the Spatial Computational Domain

Parallel Computing. Benson Muite. benson.

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

Various Schemes of Load Balancing in Distributed Systems- A Review

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

The Importance of Software License Server Monitoring

Sla Aware Load Balancing Algorithm Using Join-Idle Queue for Virtual Machines in Cloud Computing

Fair Scheduling Algorithm with Dynamic Load Balancing Using In Grid Computing

A Distributed Render Farm System for Animation Production

An Implementation Of Multiprocessor Linux

EFFICIENCY CONSIDERATIONS BETWEEN COMMON WEB APPLICATIONS USING THE SOAP PROTOCOL

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun

A Study on the Application of Existing Load Balancing Algorithms for Large, Dynamic, Heterogeneous Distributed Systems

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

Efficient Load Balancing using VM Migration by QEMU-KVM

IMPROVED PROXIMITY AWARE LOAD BALANCING FOR HETEROGENEOUS NODES

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

Optimizing Shared Resource Contention in HPC Clusters

PGGA: A Predictable and Grouped Genetic Algorithm for Job Scheduling. Abstract

On-Demand Supercomputing Multiplies the Possibilities

How To Balance In Cloud Computing

Table of Contents. Cisco How Does Load Balancing Work?

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Training a Self-Organizing distributed on a PVM network

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

CS2101a Foundations of Programming for High Performance Computing

Dynamic load balancing of parallel cellular automata

DYNAMIC LOAD BALANCING SCHEME FOR ITERATIVE APPLICATIONS

PERFORMANCE EVALUATION OF THREE DYNAMIC LOAD BALANCING ALGORITHMS ON SPMD MODEL

Exploring RAID Configurations

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

Dynamic Load Balancing Strategy for Grid Computing

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Study of Various Load Balancing Techniques in Cloud Environment- A Review

LOAD BALANCING TECHNIQUES

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

V:Drive - Costs and Benefits of an Out-of-Band Storage Virtualization System

Principles and characteristics of distributed systems and environments

Load Balancing Between Heterogenous Computing Clusters

nanohub.org An Overview of Virtualization Techniques

An approach to grid scheduling by using Condor-G Matchmaking mechanism

Group Based Load Balancing Algorithm in Cloud Computing Virtualization

How To Virtualize A Storage Area Network (San) With Virtualization

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Hectiling: An Integration of Fine and Coarse Grained Load Balancing Strategies 1

Contents. Chapter 1. Introduction

Distributed Particle Simulation Method on Adaptive Collaborative System

How To Compare Load Sharing And Job Scheduling In A Network Of Workstations

A Performance Analysis of Secure HTTP Protocol

Virtual Machine Instance Scheduling in IaaS Clouds

Performance Analysis of IPv4 v/s IPv6 in Virtual Environment Using UBUNTU

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Collaborative & Integrated Network & Systems Management: Management Using Grid Technologies

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

.:!II PACKARD. Performance Evaluation ofa Distributed Application Performance Monitor

Transcription:

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department of Computer and Information Sciences University of North Florida Jacksonville, Florida 32224 Abstract: Networks of workstations (NOW) have become popular environments for parallel and distributed processing. In many cases, the researcher or educator is faced with a heterogeneous collection of workstations of different ages and processing capabilities. Many times the NOW is non-dedicated so external workload plays a role in performance of the cluster. The use of known workstation characteristics can help determine an ideal load distribution. In this paper we present methods for utilizing system information to determine a dynamic load balance in a heterogeneous non-dedicated NOW. We focus on two examples and show benefits resulting in decreased execution time. Keywords: distributed parallel processing, dynamic load balancing, message passing interface 1 Introduction Networks of workstations (NOW) have become the de-facto standard for distributed and parallel processing applications in our modern computing environment. Whether or not the machines are tightly coupled, the workstation cluster exists as the researcher or educator s laboratory. In general, there are many factors influencing the performance of a NOW. The age and speed of the machines, the nature of the processor, and the external workload all play a role in the throughput provided by the network. It is often necessary to provide effective load balancing to such a workstation cluster in order to maximize its performance. Load balancing is often times, depending on the problem at hand, a very difficult task. There are many theoretical methods for accomplishing load balancing, but often times in practice, problems are not suitable for application of these methods. In this paper we present some very practical methods for load balancing on a NOW such as may be found in many universities and small research institutions. These methods are easy to apply, and yield some interesting results. 2 Problem solving To address the problem of load balancing, we discuss issues associated with problem solving in general. To attempt the development of a parallel algorithm for the solution of a given problem, a four step approach may be attempted. First, the original problem must be partitioned, as finely as possible. This partitioning allows us to uncover any inherent parallelism in the solution to the problem. The original decomposition should be as fine as possible in order to allow maximum flexibility in design of solution. Next, the required communication between the tasks identified in the original partition must be examined. While some problems exhibit simple communication patterns, it is often the case that tasks in the partition can not execute independently. Information must be exchanged between the tasks so the required computation can continue to a suitable conclusion. The third step in the algorithm development process requires us to re-examine the original partition to discover related tasks. Often times the original partition is much to fine to provide efficient execution. The topology of the NOW onto which the partition must be mapped must be considered. Thus, the fourth step in the design is the process is attempting to find a suitable mapping to existing hardware. This step in the process is where many interesting questions arise. Providing a mapping that balances the load for each of the processors in our network is a challenge.

3 Load balancing A wide variety of load balancing algorithms have been proposed. Some are general methods and others are task specific. Load balancing can be static or dynamic. Static methods involve deciding in advance the mapping of tasks to processors and leaving the balance unchanged for the duration of the execution of the algorithm. Much more exciting is the prospect of dynamic load balancing, which involves changing the allocation of tasks to processors during the execution of the algorithm. A clientserver model where the server continually polls clients to find out who needs more work, and allocates work to these clients is one possibility. A possible difficulty with such a model is the overhead associated with the communication of tasks between clients and servers. Moreover, the topology of the workstation cluster, the nature of the task communication and the combination of tasks accomplished in the third problem solving step all play a role in the dynamic mapping of tasks to processors. 4 The method In this paper we propose a load balancing method to be used on a non-dedicated NOW. The proposal provides methods to be used on a collection of heterogeneous workstations and deals with the performance issues that can be caused by external workload factors that may be beyond the control of the parallel cluster user. The method initially determines a load distribution based on the speed of the processors and on the amount of workload on each host. The method then attempts to maintain a well balanced load among the available workstations by evaluating the completion of subsets of the work and basing future load distributions on those results. 4.1 Determining the initial distribution In general the CPU rate of a particular processor cannot be directly translated to the workstation performance on a particular application, and so is not a correct measure of performance [7]. However, for our purposes, the clock rate does provide a way to make a general approximation of the relative performance of the workstation. In fact, the length of the run queue is all that is required to describe the workload [6]. We use the average number of jobs in the run queue, Qavg, to describe the workload. Then, the portion of the processing power that is available to a task on one of the workstations in the cluster is PA = (Nprocs) / (Qavg + Ntasks) (1) where Nprocs is the number of processors on that workstation and Ntasks is the total number of new tasks being started on that particular machine. Therefore the normalized performance is NormPerf = PA * CPUrate (2) and the total available performance is the sum of the performances on each of the workstations. NormTotal = Σ NormPerf i (3) Thus, for each parallel task executing on a executing on a workstation, the portion of the work it should be assigned is related to the portion of the total cluster power it contributes. Thus the amount of work for each initially is W i = W * (NormPerf)/(NormTotal). (4) where W is the total amount of work for the application. 4.2 Dynamic load balancing The algorithm now maintains and promotes a fair distribution of the load by evaluating the progression of the load, and using this information as a metric for the next load distribution. The load is redistributed at specific intervals during the course of the program s execution. The redistribution is determined by the processor s most recent performance. This performance is simply Performance = W/t where W is the amount of work completed during this time interval t. With the performance information from each process, the total performance is calculated as above (3) and the amount of work dynamically allocated for each process is determined (4). This adaptive

distribution of the load helps to correct any imbalances in the load, which might be the result of changing external workload or the complex nature of the application. 5 The experiment The proposed method for load balancing is a useful and simple technique for parallel applications to better utilize a network of workstations. To evaluate this method, we tested and compared it against other possible methods. We selected two different and contrasting test applications for our investigation. The first was a volcanic ash dispersion model, while the second was a finite difference model. 5.1 Volcanic ash model The volcanic ash model is a dispersion and accumulation model developed by Suzuki [8]. The model is an empirical model which describes volcanic ash fallout and can be utilized for volcanic hazard assessment. Such a model aids in the hazard assessment by making the variability of the volcanic systems quantifiable without the use of more complex and computationally expensive physical models. This model computes the mass of ash X accumulated at location x and y distances from the volcanic vent according to the following formula 5.2 Finite difference model Many scientific problems have solutions that include partial differential equations. To numerically solve equations with partial derivatives, the derivatives can be replaced with finite difference approximations. Our test case will consist of computing the steady state temperature distribution in a rectangular slab [3]. For the solution to this problem, we use an iterative method called Liebmann s method. With this method, the domain of interest (the slab) is represented by an array of evenly spaced points. The boundary points are given a specific temperature that remains constant. The new temperature for any point in the interior can be computed with the temperature of its border cells. Due to the limited space offered by this forum, the authors will concentrate on providing results for the volcanic ash model, and request that persons interested in the results of the finite difference model contact the authors directly. 5.3 Model implementation The proposed load balancing method is a useful and simple technique for parallel applications which facilitates efficient utilization of a network of workstations. To evaluate the method, it was tested and compared with other possible methods, to be described below. Each of the models described above was coded in C utilizing the features of the Message Passing Interface (MPI). A master-slave relationship model was used for the test cases. In this scenario, a master process is responsible for keeping track of the model organization, while the slave processes to the work. The master process handles all input-output duties as well as all of the computation associated with the dynamic load distribution to the slaves. 5.4 Workstation environment The workstations used for our experiment used the linux operating system running RedHat Linux v6.2. Machines used in our experiments consisted of single processor workstations, dual processor workstations, and one four-processor workstation. Each of the multiple processor workstations has a symmetric multiprocessor architecture. The machines are connected with 100 megabit fast ethernet. From the linux operating system, we were able to obtain both the processor speed and the active current load information. The linux kernel maintains a file system /proc/ which contains information on running processes,

memory, devices and network details [1]. One file, loadavg, contains information on system load as well as the average number of jobs in the run queue averaged over 1, 5, and 15 minutes. The cpuinfo file contains information about the processors including the number of processors, their type and speed. Functions were written to obtain the information from the /proc/ file system. A process, using these functions, can collect the average run queue length, the number of processors, and the processor s clock rate. Functions were also written to calculate equations (1), (2) and (3) previously described. We also calculated a relative speedup of using this adaptive method. The results are summarized below. 6 Experimental results 6.1 The methods Four distribution load balancing methods were implemented and timed with the volcanic ash dispersion model. Even distribution, self-scheduling, characteristics based, and adaptive characteristics based methods were implemented. The even distribution method and characteristics based method are static methods. The distribution of the load to the processes is decided in advance. The characteristics based distribution utilizes the CPU clock rate and the average run length queue as described earlier. The master process sends the data and the correct percentage of the points to calculate to each slave process. The self scheduling algorithm for our volcanic ash dispersion model is a task-queue load balancing method. This self scheduling method distributes a single point to each machine to be calculated. After the slave process calculates the results, it returns the result to the master process and requests another calculation. With our adaptive characteristics based distribution, only a portion of the total grid points were originally distributed in a manner determined by the characteristics of the machine, as in the characteristics based distribution. After each machine finishes its calculations, it requests an additional set of points to calculate. The master process determines the processing speed by keeping track of the length of time it takes to compute a particular number of calculations for each slave process. With this performance data, all future distributions are based on the performance of the process in relation to the total performance of the cluster. 6.2 The experimental apparatus To test the relative performance of the methods described above, both homogeneous and heterogeneous clusters of workstations were assembled. Various four processor clusters were used for uniformity of test results. These clusters are all homogeneous. The first cluster, named Atlas, consisted of four Pentium II machines at 233 mhz. The second cluster, named Janus, consisted of four Pentium III machines at 500 mhz. The third cluster, named Xena, consisted of four Pentium III machines at 550 mhz. The fourth cluster, named Dwarf, consisted of two 2 processor machines at 500 mhz. The last cluster, named Snowwhite (SW) consisted of one four processor machine at 500 mhz. In addition to the above clusters, four heterogeneous clusters were assembled. The first cluster, H1 consisted of two Atlas and two Janus machines. The second cluster, H2, consisted of two Janus and two Xena machines. The third cluster, H3, consisted of 9 total processors, and consisted of one Atlas, one Janus, one Xena, two Dwarf, and one SW machine. The fourth cluster, H4, consisted of 20 processors, four Atlas, four Janus, four Xena, four Dwarf, and SW. 6.3 The experimental results First, the homogeneous clusters were tested with no external load. The results are illustrated in figure 1 below.

No External Load 600.00 500.00 400.00 300.00 200.00 100.00 Even Distribution Self Scheduling Characteristics Based Adaptive 0.00 Atlas Janus Xena Dwarf SW Figure One. Homogenous, No External Load We can see from the figure that the static load balance and the self scheduling algorithm performed the worst. We also see that the adaptive dynamic load balance performed best in all cases, if only slightly better in some cases. Next, we tested the results on the heterogeneous clusters H1 H4. The results are shown in figure two below. Heterogenous Clusters 500.00 450.00 400.00 350.00 300.00 250.00 200.00 150.00 100.00 50.00 0.00 H1 H2 H3 H4 Even Distribution Self Scheduling Characteristics Based Adaptive Figure Two: Heterogeneous Clusters We can see that in this case, the adaptive distribution method produced the lowest run times in three of four cases. We attribute the slightly higher run times in H4 to the communication amongst the 20 processors involved in cluster H4. Finally, to simulate an unevenly distributed external workload on the homogeneous clusters, a workload producing program was developed which was capable of producing up to 100% CPU utilization. Due to this artificial workload, the times for the volcanic ash dispersion model increased from the results shown in Figure One. The results are shown in figure three below.

External Load 900.00 800.00 700.00 600.00 500.00 400.00 300.00 200.00 100.00 0.00 Atlas Janus Xena Dwarf SW Even Distribution Self Scheduling Characteristics Based Adaptive Figure Three. Imbalanced External Workload, Homogeneous Clusters We can see that in most cases the adaptive characteristics based load balancing algorithm produced excellent results compared to the static methods. 7 Summary and conclusions We formed and implemented a simple dynamic load balancing method based on the characteristics of the cluster involved and its workload. We rebalanced the load dynamically at steps during the execution of the application, with each new distribution based upon the performance of the workstations during the last computation phase. We have shown that in most cases, the adaptive characteristics dynamic method performs at least as well as and in most cases better than other simple load balancing methods. Our methods indicate that in nondedicated clusters of workstations improvements in performance can be achieved by these calculations. We also point out that even though the calculations for the load balancing are done along with the parallel computation, improvement was seen. We hope that the development of these methods will improve the use of parallel applications in shared network workstations found in many educational and research environments. Testing of these methods continues. We would encourage anyone interested in methods for improving the performance of networks of workstations to contact the authors for more detail concerning the methods used. 8 References [1] Bishop, A.M., The /proc file system and ProcMeter, Linux Journal, #36, 1997 [2] Foster, Ian, Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering, Addison Wesley, Reading, MA. 1995. [3] Gerald, C., and Wheatley, P., Applied Numerical Analysis, Addison Wesley, Reading, MA., 1999 [4] Gropp, W. E. et.al., Using MPI: Portable Programming with the Message-Passing Interface, MIT Press, Cambridge, MA., 1999. [5] Hargrove, W and Hoffman, F., Optimizing Master/Slave Dynamic Load Balancing in Heterogeneous Parallel Environments, Oak Ridge National Laboratory, Oak Ridge, TN, 1999. [6] Kunz, T., The Influence of Different Workload Descriptions on a Heuristic Load Balancing Scheme, IEEE Transactions on Software Engineering, 1991.

[7] Hennesy, J. L. and Patterson, D.A., Computer Organization and Design: The Hardware/Software Interface, Morgan Kaufmann, San Francisco, CA 1998. [8] Suzuki, T., A theoretical model for the dispersion of tephra, Arc Volcanism: Physics and Tectonics, Terra Scientific Publishing, Tokyo, 1983