Comet: A Communication-Efficient Load Balancing Strategy for Multi-Agent Cluster Computing

Comet: A Communication-Efficient Load Balancing Strategy for Multi-Agent Cluster Computing KA-PO CHOW 1,YU-KWONG KWOK 1,HAI JIN 1, AND KAI HWANG 1, 2 The University of Hong Kong 1 and University of Southern California 2 Email: {ykwok, kpchow, hjin, kaihwang}@eee.hku.hk Abstract This paper proposes a new load balancing strategy, called Comet, for fast multi- cluster computing. We use a new load index that takes into account the cost of inter- communications. Agents with predictable workload are assigned statically to cluster nodes, whereas s with unpredictable workload are allowed to migrate dynamically between cluster nodes using a creditbased load balancing algorithm based on the proposed load index. We have implemented and tested our load balancing strategy in a multi- cluster platform consisting of Linux PC machines. Experimental results indicate that the proposed Comet system outperforms traditional distributed load balancing strategies for applications with regular as well as irregular communication patterns. Keywords: Load balancing, cluster computing, multi- systems, task migration, credit-based algorithms. 1 Introduction A multicomputer cluster is a collection of complete computers, which are physically connected by local area networks or high-bandwidth switches [8], [12]. The most distinctive feature of a cluster computing system, in contrast to traditional distributed systems, is that a cluster offers single system image (SSI) [3], [5], [6] at a wide range of abstraction levels. SSI is a very desirable feature for resiliently harnessing the great aggregate computing power of a cluster to solve a wide range of applications including not only scientific tasks but commercial workload as well. To achieve the goal of SSI, we have to tackle many intricate resource management chores such as checkpointing, process states management, task migration, load balancing, etc. In this paper, we focus on the load balancing aspect of executing a multi- application on a Linux PC cluster. In our cluster platform (see Figure 1(a)), a user accesses the cluster with a Web browser. The name server (ANS) indicates the location of each in the system. An is essentially a light-weight software process running on a single machine. All drafted s must interact with the database server, which updates the Oracle database of the addressed application. The final decision result is returned to the user by the ANS through the Internet. This research was supported by the Hong Kong Research Grants Council under the contract numbers HKU 2/96C, HKU 7022/97E, by the Information Technology Development Fund at HKU, and by a special research grant from the Engineering School of the University of Southern California. Our multi- system has been designed to manage investors portfolio and provide customized alarming services. It monitors the risk level and expected return of the users portfolio continuously and gives advice to the investors on their portfolio selection. Alarming signals can be set according to the users criteria on the risk level of portfolio or other stock data. Different classes of s are organized in hierarchy. These s are: portfolio management s, economic performance s, factor analysis s, and indicator s. At the lowest level of the hierarchy, the indicator s acquire information from outside sources. There are now 33 indicator s and ten factor analysis s residing in our system. To efficiently response to user s queries, the s are required to complete their designated tasks using the shortest amounts of time. However, if the mapping of s to machines are not handled properly, it is likely that some machines will be overloaded with too many s while some other machines may be idle. This results in unnecessarily long query response times. Thus, we need a judicious s management scheme to monitor the execution of s and properly balance the workload of machines. In this paper, we propose a new load balancing strategy, called Comet, for balanced multi- cluster computing. We use a credit-based load index associated with each to keep track of the cost of inter- messages exchanged among the s. Agents with predictable workload are assigned statically to cluster nodes. Agents with unpredictable workload are allowed to migrate dynamically between cluster nodes based on the credit-based load index, which is composed of computational load as well as the communication load of an, indicating the affinity of the to a cluster node. The objective of the load balancing algorithm is to allow s to migrate so as to replace expensive inter-machine messages with intra-machine communications. Our load balancing strategy is different from previous work in that we consider both communication and computation workload of an in making migration decisions, while in most traditional distributed load balancing schemes, only the computation part is taken into account [13], [14]. We have implemented several multi- applications and the proposed Comet system using the Java programming language on a Linux PC cluster. Our extensive experimental

Oracle databases (b) a 7 a 5 Web browser (a) Internet Agent Name Server Database Server a 1 a 2 a 3 a 4 a 6 a 1 (c) a 2 a3 a 4 a 5 a 6 a 7 Figure 1: (a) A multi- cluster computing system; (b) a directed graph modeling the communication structure of the system; (c) an example tree-structured multi- system. results indicate that the proposed credit-based approach outperforms a traditional distributed load balancing method in that the former always gives more balanced workload throughout the entire execution span of the applications. This in turn implies a much shorter response time can be obtained in servicing user s queries. This paper is organized as follows. In the next section, we describe in more detail the architectural characteristics of our multi- cluster computing system. We then formalize the proposed load index in Section 3 and the load balancing algorithm in Section 4. This is followed by the experimental results in Section 5. The last section concludes the paper. 2 A Multi-Agent Software System for Clusters Unlike usual processes in a distributed system, an is a light-weight unit of execution capable of performing multiple functions in an autonomous manner. Agents are persistent software objects with long-lived missions [1], [9], [10], [11], [12], [15]. Specifically, an consists of two parts. The first part is a set of data structures manipulated by the. For instance, in our financial database application, the set of data types might consist of a set of primitive attribute types (e.g., exchange rates) as well as some record types derived from those attribute types. The second part is a set of functions, may or may not be multithreaded, that manipulate the data types. These are typically function calls for gathering data and computing results. The distinctive feature of an is that the embedded computation in each is minimal compared with communication cost among s. Although the s are autonomous, they usually cooperate to collectively satisfy a user s query or command. For example, if the user issues a request about certain stock market index, then some s may be responsible for collecting necessary raw data from the databases, while some other s may be responsible for computing the intermediate or end results. Thus, the coordination among s is represented by some structure (e.g., a tree or a mesh) rather than entirely disconnected. The difference between such a multi- software system from a message-passing parallel program is that s are loosely coupled and there does not exist any strict precedence constraints as in among tasks of a parallel program. The load characteristics of a multi- system are identified below. These characteristics are used to establish the proposed load-balancing scheme. Life-span: In our multi- system, some s are rather perpetual, e.g., indicator and factor analysis s. They are not frequently added or removed. The workloads of these s are predictable. On the other hand, some s are instantiated on-demand, like portfolio management s. Their quantity will be increased or decreased at any time. Their workloads change dynamically and are unpredictable. Comet takes this into account, submits the static s to some dedicated machines for execution according to their predicted workloads, and arranges the dynamic s according to the credit-based load index. Computation cost: Certain indicator s will be left idle most of the time. Its computational cost is minimal and predictable. On the other hand, some data will change almost in real time, like exchange rate. Their associated indicator s will be kept busy most of the time. The complexity and computational requirements will be different for each. Their workloads change dynamically. We arrange the s with minimal predictable computation cost to a group of few machines in the cluster, while arrange the other s with dynamic changed workloads to other machines in the cluster as possible.

Communication pattern: In the multi- financial database system, though the number of static s remains fairly constant, their interrelationship may change over time as newer variance patterns are discovered through examining co-variances among the components. Moreover, the stocks in the users portfolios may change and, thus, the communication pattern is also affected since a different set of stocks needs to be consulted. Therefore we include the number of inter communications as a part of the load index. Migration of the s in the cluster is performed according to this load index. Persistence: Most of the s will be running from system initiation and never stop. It is their interrelationship and computational requirement that will vary. Sometimes the mathematical formula of computing a certain index may change so that the workload of such s will be increased or decreased. These events may trigger the execution of the load balancing scheme. Agents migration: In our system, there is no persistent state that needs to be preserved during migration. All s will work by restarting them at any instance. Using JATLite system, these s can un-register and re-register again at another location. They can resume their monitoring task without going back to previous state. The references to other s are all done through the name server. With the above scrutiny of s characteristics, we introduce the proposed credit-based load balancing scheme in the next section. 3 A Credit-Based Load Index In the multi- system considered in our study, we assume an application is composed of n s executable on any of the p homogeneous machines of the cluster. The structure of the application is modeled by the interdependence relationships among the s. More specifically, we use an undirected graph to model the application structure. For example, the undirected graph shown in Figure 1(b) can be used to model the structure of the multi- system depicted in Figure 1(a). Also, a binary tree structure, as shown in Figure 1(c), can be used to model most divide-and-conquer types of applications. An undirected graph is an appropriate generic model because a multi- application executes perpetually and produces results continuously in response to user queries (e.g., financial database queries). One particular feature in our multi- system is that the communication pattern among the s is known. Even if it changes, all such changes will be registered on the Agent Name Server and JATLite Message Router. Using this feature, we can arrange the s to minimize communication overhead through the inter-connection network. Notice that the computational load of an and the communication load between two s may be different for processing different queries. Thus, the data or control dependencies among s are not constant. Hence, the communication dependency relationship between any two s cannot be modeled by a directed edge. Indeed, an edge between two s only indicates that the s communicate during the processing of a particular query but not implying a precedence relationship. Given these characteristics, we can also see that the application is inherently iterative in nature in that each iteration corresponds to the execution of the application for one particular query. Figure 2 illustrates the dynamics and structure of the multi- application. Traditional load balancing strategies commonly use the computational load of a process as the load index based on the assumption that computation is the dominant activity in a process and communication can be fully overlapped with computation [13], [14]. While this approximation might be valid for heavy weight processes in a distributed system, such a load index is clearly not appropriate for the multi- system considered in our study. As mentioned earlier, the computation load within an is sometimes minimal (e.g., time computation remote comm. heavy computation moderate computation light computation heavy communication moderate communication light communication time computation local comm. (a) (b) Figure 2: (a) Iterative and dynamic nature of a multi- application; (b) the structure of an.

a fast execution of a certain financial formula) and a BSP style of multithreaded programming model [16] is more accurate. Thus, we propose a composite attribute to indicate the load of an that takes into account the effect of remote and local communications among s. Specifically, the load of an a i executing on machine m k is defined as the sum of its computational load w i and the communication load u i, where: f h i + g i = ca ( i, a j ) + -- ca ( 2 i, a j ) M( a i ) = M( a j ) M( a i ) M( a j ) (note that a j may be local or remote depending upon the value of M( a j )). Here, h i and g i represent the intra-machine and inter-machine communication load, respectively. The factor 2 is included to avoid double counting the intermachine communication. The value of w i is computed statically by profiling the different execution instances of the and measuring the running times [7]. Note that the communication cost ca ( i, a j ) can be computed by each using the message sizes. The scaling factor f (> 1) is system dependent and calculated based on the network bandwidth of the system (in our implementation, the point-to-point bandwidth of the ATM network is 155 Mbps so that intermachine communication is approximately more than an order of magnitude slower than intra-machine communication). The load L k of a machine m k is defined as the sum of all its local s load. More specifically, L k = ( w i + u i ) ( 2) M( a i ) = k The goal of a load balancing algorithm is to minimize the variance of load among all the machines in the cluster. This will in turn minimize the average response time of serving user s queries. 4 The Comet Load Balancing Scheme With the above definition of load index, an overview of the proposed Comet load balancing algorithm is in order. Below we describe the important aspects of the Comet system. Agents segregation: During the start-up phase of an multi- application, the Comet system first segregates the s with predictable workload from the others. Then, we map these s to some dedicated machines in a balanced manner. The remaining s are then assigned to the machines in a round-robin fashion and are subject to migration during the execution span of the application. Grouping the s into two subclasses with predictable and unpredictable workloads can reduce the number of s migrating among the cluster nodes. Agents with predictable workload are assigned statically in the cluster. Migrating the s with dynamically changing workload can reduce the communication overhead significantly, based on the checkpointing file which has already mirrored the process file in neighboring machines. Information policy: In the Comet system, we employ a distributed periodic information policy. Specifically, the machines in the system synchronize periodically and check their local aggregate load against the load thresholds, which are estimated high ( T H ) and low ( T L ) load levels computed during application start-up, where both T H and T L are computed based on the mean and standard deviation of load. A centralized approach is also viable if one machine is assigned as the controller to collect the load information from other machines. Transfer policy: After the load information collection phase, the machines perform complete exchange to determine the machines with the highest load (and L > T H ). Agents then migrate from this sender machine to some receiver machines so as to reduce its load to below T H. This completes one iteration and the machines perform complete exchange again to determine the next sender machine. This process is repeated until no overloaded machine is available. Migration policy: Each a i is associated with a credit C i which is defined as: C i = x 1 w i + x 2 h i x 3 g i, where x 1, x 2, x 3 are positive real value numbers and are application dependent coefficients. The rationale of this credit attribute is to capture the affinity of an to the machine in that the intra-machine communication component contributes positively to the credit whereas the reverse is true for the inter-machine communication. In the sender machine, the with the smallest credit is selected for migration because such an spends dominant amount of time communicating with a remote and hence, is a suitable candidate for migration in order to reduce the local load level. After migration, the heavy inter-machine communication becomes local communication in the receiver machine. Although some local communication in the sender machine also becomes inter-machine communication, the overall effect is still desirable because the sender s load is reduced. Location policy: After a migrating is selected, we need to determine the target machine. In the Comet system, each a i keeps track of a p -element vector V i storing the value of remote communication (the local communication component is stored as the M( a i )-th element) between the local machine and all other machines in the network. Specifically, each element v s of the vector is simply ca ( i, a j ). Suppose the y -th M( a j ) = s element ( y M( a i )) is the largest element. Then, machine m y will be chosen as the receiver of the. The proposed Comet multi- load balancing scheme is specified by the following algorithm.

COMET LOAD BALANCING 1. Application start-up: segregate the s with statically predictable load from the others. Map these s to the machines in a load-balanced manner (note that multi-programming is assumed). Map the remaining dynamic s in a round-robin manner to the machines. 2. Do the following forever (with period τ ): 3. repeat 4. All machines participate in a complete exchange to elect the overloaded machine with the heaviest load. 5. The heaviest load machine allows the a i with lowest credit to migrate. The target machine is indicated by the communication component vector V i. This process is repeated until the machine s load falls below T H. Credit value of the migrated s are updated. 6. until no overloaded machine exists Notice that the location policy of the proposed is novel in that we implicitly specify a receiver machine in the network for a migrating. By contrast, most existing load balancing schemes require the system to determine a receiver which is usually the one with the lowest load. It should be noted that for multi- systems in which communication is the dominant event, choosing the lowest load machine may not be a suitable strategy because such machine may not necessarily reduce the inter-machine communication overhead, which is a dominant part of the aggregate load. The rationale of selecting the lowest load machine as the receiver is to avoid thrashing. However, the Comet system is inherently free of thrashing because the total load across all machines is reduced after each migration. The advantage of Comet lies in its communication efficiency. Agents are mostly reactive, continuous, mobile, communicative, collaborative, and light-weighted software processes. The embedded computation in each is minimal compared with communication cost among s. In this sense, Comet is special-tailored for the multi- cluster computing systems, aiming at hiding communications among s. Other cluster job management systems are more suitable for balancing general workload with more computations involved. For brevity, other analytical results are not shown here but can be found in [2]. 5 Experimental Results and Interpretations We have implemented the Comet system using Java for a Linux PC cluster. To evaluate the effectiveness of the system, several multi- financial-analysis benchmark applications have been developed in Linux cluster environment. We combine mirrored checkpointing [6] with the Comet strategy to balance benchmark workloads on two research clusters. Mirrored checkpointing is applied to avoid the transfer of states if they migrate between neighboring machines. One Beowulf cluster has been built and tested at the High Performance Computing Research laboratory at the University of Hong Kong. Another Beowulf PC cluster has been built and tested at the Internet and Cluster Computing Laboratory at the University of Southern California. We performed a number of experiments using a cluster with four Pentium Pro PC running Linux with 64 MB memory each. In all the experiments, we set the period of load monitoring and balancing to be ten seconds (i.e., τ = 10 ). For comparison, we also implemented a classical senderinitiated workload based load balancing (abbreviated as WBLB) scheme [13]. In WBLB, the machines also synchronize periodically to identify sender (overloaded) cluster nodes and receiver (lightly loaded) cluster nodes. The sender node then transfers the s with the greatest computational load to the receivers. Notice that in such a classical load balancing method, communication load is ignored. We considered an irregularly structured financial application with 18 s, which was executed under the control of the Comet system. We collected performance data for a time period of approximately 1400 seconds and the same process was repeated several times. To evaluate the performance of the proposed algorithm, we computed the standard deviation (SD) of the load over time normalized with respect to the mean load. The reason of normalization is that as the computation of the multi- application proceeds, the load level may fluctuate in that some extra workload may be generated or destroyed on-the-fly, depending upon the user queries. Thus, the normalized SD is a more accurate measure of the instantaneous degree of load balance. Figure 3(a) shows the normalized SD of the two algorithms over time. As can be seen, the proposed Comet system almost always resulted in a lower normalized load SD. Indeed, a close scrutiny of the execution traces revealed that the WBLB scheme showed a moderate degree of thrashing some s were repeatedly being transferred among the machines without being actually executed until after the maximum number of transferals was reached. Apart from instability and overhead, an adverse effect is that the resulting load was usually not balanced, as is illustrated by the wide range of load (normalized with respect to mean load) as shown in Figure 3(b). Using the same four-node Linux cluster, we varied the number of s in the application from 12 to 24. The results are shown in Figure 3(c). We can see that the Comet system outperformed the WBLB scheme in all cases. These results indicate that the proposed credit-based load index and balancing strategy are effective for efficiently executing applications composed of dynamic light-weight s. Similar results were obtained for a tree structured application as detailed in [2].

normalized SD of load 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Comet WBLB 0 0 200 400 600 800 1000 1200 1400 time (seconds) (a) normalized load 2 1.5 1 0.5 Comet Max Load Comet Min Load WBLB Max Load WBLB Min Load 0 200 400 600 800 1000 1200 1400 time (seconds) (b) average normalized SD of load 0.5 0.45 0.4 0.35 0.3 0.25 Comet WBLB 0.2 12 14 16 18 20 22 24 number of s (c) Figure 3: (a) Normalized standard deviation of load across all cluster nodes; (b) maximum and minimum load levels; average normalized standard deviation of load across all machines over a time period of 1400 seconds with various number of s. 6 Conclusions A novel load balancing strategy is presented for multi cluster computing system. The proposed Comet system uses a load index that takes into account the effect of both intra-machine and inter-machine communication. This is important for -based systems because the embedded computation in each is minimal compared with communication cost among s. The load balancing algorithm is credit-based and works by selecting the with the heaviest inter-machine communication for migration. We have also implemented the whole system using Java for a Linux PC cluster and tested it using several regular and irregular financial database applications. The experimental results indicate that the Comet system outperforms an existing distributed load balancing algorithm considerably. References [1] K.A. Arisha, F. Ozcan, R. Ross, V.S. Subrahmanian, T. Eiter, and Sarit Kraus, Impact: A Platform for Collaborating Agents, IEEE Intelligent Systems, vol. 14, no. 2, pp. 64-72, Mar./Apr. 1999. [2] K.-P. Chow, Load Balancing in Distributed Multi-Agent Computing, M.Phil. Thesis, The University of Hong Kong, August 1999. [3] D.W. Duke, T.P. Thomas, and J.L. Pasko, Research Toward a Heterogeneous Networked Computing Cluster: The Distributed Queueing System, Version 3.0, Florida State University, May 1994. [4] D.H.J. Epema, M. Livny, R. van Dantzig, X. Evers, and J. Pruyne, A Worldwide Flock of Condors: Load Sharing among Workstation Clusters, Journal on Future Generations of Computer Systems, vol.12, 1996. [5] M. Harchol-Balter, and A.B. Downey, Exploiting Process Lifetime Distribution for Dynamic Load Balancing, Proc. ACM Sigmetrics Conf., pp.13-24, 1996. [6] K. Hwang, H. Jin, E. Chow, C.-L. Wang, and Z. Xu, Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space, IEEE Concurrency, vol.7, no.1, pp.60-69, Jan./Mar. 1999. [7] K. Hwang, C. Wang, C.-L. Wang, and Z. Xu, Resource Scaling Effects on MPP Performance: The STAP Benchmark Implications, IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 5, pp. 509-527, May 1999. [8] K. Hwang, and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, 1998. [9] N.R. Jennings and M.J. Wooldridge, (eds.), Agent Technology: Foundations, Applications, and Markets, Springer-Verlag, Berlin, 1998. [10] A. Joshi, N. Ramakrishnan, and E.N. Houstis, Multi-Agent System Support of Networked Scientific Computing, IEEE Internet Computing, vol. 2, no. 3, pp. 69-83, May/June 1998. [11] N.M. Karnik and A.R. Tripathi, Design Issues in Mobile- Agent Programming Systems, IEEE Concurrency, vol. 6, no. 3, pp. 52-61, July/Sept. 1998. [12] MG.F. Pfister, In Search of Clusters, 2nd Ed., Prentice-Hall, 1998. [13] N.G. Shivaratri, P. Krueger, and M. Singhal, Load Distributing for Locally Distributed Systems, IEEE Computer, vol. 25, no. 12, pp. 33-44, Dec. 1992. [14] P. Williams, Dynamic Load Sharing within Workstation Clusters, Honours Dissertation in Information technology, University of Western Australia, October 1994. [15] M.J. Wooldridge, and N.R. Jennings, Agent Theories, Architectures, and Languages: A Survey, Proc. ECAI-94 Workshop on Agent Theories, Architectures and Languages, Springer-Verlag, pp.1-39, 1995. [16] L.G. Valiant, A Bridging Model for Parallel Computation, Communications of the ACM, vol. 33, no. 8, pp. 103-111, Aug. 1990.