Parallel CBIR implementations with load balancing algorithms

Transcription

1 J. Parallel Distrib. Comput. 66 (2006) Parallel CBIR implementations with load balancing algorithms José L. Bosque a,, Oscar D. Robles a, Luis Pastor a, Angel Rodríguez b a Dpto. de Informática, Estadística y Telemática, U. Rey Juan Carlos, C. Tulipán, s/n, Móstoles, Madrid, Spain b Dept. de Tecnología Fotónica, UPM, Campus de Montegancedo s/n, Boadilla del Monte, Spain Received 22 December 2003; received in revised form 23 February 2005; accepted 7 April 2006 Available online 13 June 2006 Abstract The purpose of content-based information retrieval (CBIR) systems is to retrieve, from real data stored in a database, information that is relevant to a query. When large volumes of data are considered, as it is very often the case with databases dealing with multimedia data, it may become necessary to look for parallel solutions in order to store and gain access to the available items in an efficient way. Among the range of parallel options available nowadays, clusters stand out as flexible and cost effective solutions, although the fact that they are composed of a number of independent machines makes it easy for them to become heterogeneous. This paper describes a heterogeneous cluster-oriented CBIR implementation. First, the cluster solution is analyzed without load balancing, and then, a new load balancing algorithm for this version of the CBIR system is presented. The load balancing algorithm described here is dynamic, distributed, global and highly scalable. Nodes are monitored through a load index which allows the estimation of their total amount of workload, as well as the global system state. Load balancing operations between pairs of nodes take place whenever a node finishes its job, resulting in a receptor-triggered scheme which minimizes the system s communication overhead. Globally, the CBIR cluster implementation together with the load balancing algorithm can cope effectively with varying degrees of heterogeneity within the cluster; the experiments presented within the paper show the validity of the overall strategy. Together, the CBIR implementation and the load balancing algorithm described in this paper span a new path for performant, cost effective CBIR systems which has not been explored before in the technical literature Elsevier Inc. All rights reserved. Keywords: Parallel implementations; CBIR systems; Load balancing algorithms 1. Introduction The tremendous improvements experimented by computers in aspects such as price, processing power and mass storage capabilities have resulted in an explosion of the amount of information available to people. But this same wealth makes finding the best information a very hard task. CBIR 1 systems try to solve this problem by offering mechanisms for selecting the data items which resemble most a specific query among all the available information [12,34], although the complexity Corresponding author. addresses: joseluis.bosque@urjc.es (J.L. Bosque), oscardavid.robles@urjc.es (O.D. Robles), luis.pastor@urjc.es (L. Pastor), arodri@dtf.fi.upm.es (A. Rodríguez). 1 Content-based information retrieval. of this task depends heavily on the volume of data stored in the system. As usual, parallel solutions can be used to alleviate this problem, given the fact that the search operations present a large degree of data parallelism. Distributed solutions on clusters offer a good cost/performance ratio to solve this problem, given their excellent scalability, fault tolerance and flexibility attributes [37,6,4]. Also, this architecture allows concurrent access to disks, considered as the main bottleneck in CBIR systems. Although homogeneous clusters could also be considered for this applications, it is difficult to keep the homogeneity of this type of systems during all of their life-cycle. Among the factors that affect their configuration stability we can mention the addition of new nodes or substitution of faulty ones, technological evolution factors, and even exploitation aspects such as disk fragmentation, etc. In consequence, clusters present additional challenges, since they /$ - see front matter 2006 Elsevier Inc. All rights reserved. doi: /j.jpdc

2 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) can easily become heterogeneous, requiring load distributions that take into consideration each node s computational features [33]. This way, one of the critical parameters to be fixed in order to keep the efficiency high for this architectures is the workload assigned to each of the cluster nodes. Even though load balancing has received a considerable amount of interest, it is still not definitely solved, particularly for heterogeneous systems [10,18,41,45]. Nevertheless, this problem is central for minimizing the applications response time and optimizing the exploitation of resources, avoiding overloading some processors while others are idling. This paper describes the architecture, implementation and performance achieved by a parallel CBIR system implemented on a heterogeneous cluster that includes load balancing. The flexibility of the architecture herein presented allows the dynamical addition or removal of nodes from the cluster between two user queries, achieving reconfigurability, scalability and an appreciable degree of fault tolerance. This approach allows a dynamic management of specific databases that can be incorporated to or removed from the CBIR system in function of the desired user query. The heterogeneity of the system is managed by a new dynamic and distributed load balancing algorithm, introducing a new load index that takes into account the computational nodes capabilities and a more accurate measure of their workload. The proposed method introduces a very small system overhead when departing from a reasonably balanced starting point. As mentioned before, the amount of data to be managed in CBIR systems is so huge nowadays that it is almost mandatory to use parallelism in order to achieve a reasonable user response times. Two alternatives were tested in a previous work: a shared-memory multiprocessor and a cluster [6]. Since the cluster implementation has given better results, it seems advisable to introduce load balancing strategies to improve the efficiency in heterogeneous clusters. The selected approach is based on a dynamic, distributed, global and highly scalable load balancing algorithm. An heterogeneous load index based on the number of running tasks and the computational power of each node is defined to determine the state of the nodes. The algorithm automatically turns itself off in global overloading or under-loading situations. Together, the CBIR implementation and the load balancing algorithm described in this paper open a new path for performant, cost effective CBIR systems which has not been explored before in the technical literature. The rest of this article is organized as follows: Section 2 presents an overview of parallel CBIR systems and load balancing algorithms. Section 3 presents an analysis of a sequential version of the CBIR algorithm and a brief description of its parallel implementation on a cluster (without load balancing). Section 4 describes the distributed load balancing algorithm applied to the parallel CBIR system and Section 5 details its implementation on a heterogeneous cluster. Section 6 shows the tests performed in order to measure the improvement achieved by the heterogeneous cluster version with load balancing and the results achieved. Finally, Section 7 presents the conclusions and ongoing work. 2. Previous work The technological development experimented during the last 20 years has turned into a spectacular increase in the volume of data managed by information systems. This fact has lead to the search for methods to automate the process of extracting structured information from these systems [12,31]. The potential importance of CBIR systems has been reflected in the variety of approaches taken while dealing with different aspects of CBIR systems. The multidisciplinary nature of this problem has often resulted in partial advances that have been integrated later on in new prototypes and commercial systems. For example, it is possible to find research work that takes into consideration man machine interaction issues [32]; the users behavior from a psychological modeling standpoint [27]; multidimensional indexing techniques [5]; multimedia database management system issues [19]; pattern recognition algorithms [17]; multimedia signal processing [39]; object representation and modeling techniques [21]; benchmarks for testing the performance of CBIR systems [16,24]; etc. In any case, most of the research effort for CBIR systems has been focused on the search for powerful representation techniques for discriminating elements among the global database. Although the data nature is a crucial factor to be taken into consideration, most often the final representation is a feature vector 2 extracted from the raw data, which reflects somehow its content. While dealing with 2D images, it is possible to find techniques using color, shape, or texture-based primitives. Other techniques use spatial relationships among the image components or a combination of the above-mentioned approaches. For higher-dimensionality input data, it is possible to find proposals dealing with 3D images or video sequences. Nowadays, one of the most promising research lines is to increase the abstraction level of the semantics associated to the primitives managed, representing high-level concepts derived from the images or the multimedia data. From the computational complexity point of view, CBIR systems are potentially expensive and have user response times growing with the ever-increasing sizes of the databases associated to them. One of the most common approaches followed to reach acceptable price/performance ratios has been to exploit the algorithms inherent parallelism at implementation time. However, the novelty of CBIR systems hinders finding references dealing with this aspect. Some contributions that can be cited are Zaki s compilation [43], and the contributions of Srakaew et al. [37] and Bosque et al. [6]. Another reason that has made difficult widespread parallel CBIR system development is that prototype analysis demands a manual image classification stage that limits in practice the number of images used in the tests. Nevertheless, the volume of data managed by current DBs, and obviously those with multimedia information, will demand parallel optimizations for commercial implementations of CBIR systems. In those cases, load balancing operations preventing the coexistence of idling and overloaded 2 Named signature or primitive.

3 1064 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) processors will be almost required, since total response times are usually considerably improved with the introduction of even simple load balancing approaches. Load balancing techniques can be classified according to different criteria [8]. First, algorithms can be labeled as static or dynamic. Static methods perform workload distribution at compilation time, not taking into consideration the system state variations. Dynamic methods are able to redistribute workload among nodes at run time, depending on changes in the system state. The work of Rajagopalan et al. [28] and Obeloer et al. [25] are agent-based techniques. These are flexible and configurable approaches but the amount of resources needed for agent implementation is considerably large. Grosu et al. [15] present a very different cooperative approach to the load balancing problem, considering it as a game in which each cluster node is a player and must minimize its job execution time. Banicescu et al. propose a load balancing library for scientific applications on distributed memory architectures. The library integrates dynamic loop scheduling as an object migration policy with the object migration mechanism provided by the data movement and control substrate which is extended with a mobile object layer [2]. Load balancing algorithms can also be classified as centralized or distributed. In the first case, there is a single central node in charge of keeping the system s information updated, making decisions and actually performing the load balancing operations. In distributed methods, every node takes part in the load balancing operations; Zaki et al. [44] show that distributed algorithms yield better results than their centralized counterparts. Last, load balancing algorithms can be classified as global or local. In the first case, a global view of the system state is kept [10]. In the second case, nodes are arranged in sets or domains, and distribution decisions are made only within each domain [9,40]. Other approaches mix this taxonomy by combining several features that could be considered mutually exclusive, like the work of Ahmad and Ghafoor [1], where a semidistributed algorithm with a two level hierarchy is presented; their work focus on static networks where communication latency is very important and depends on node placement. In this type of networks, distributed algorithms may produce instability, scalability and bottleneck problems. The improvement of dynamic network technologies solves these problems with broadcast solutions and very low latencies. The technique proposed by Ahmad and Ghafoor [1], although interesting, is not easily applicable to general, unrestricted distributed systems: it was developed for static network environments, where latency is dependent on node location and where broadcast operations are very costly in terms of system performance. Clusters, which in the present work appear as a very attractive option for CBIR systems in terms of cost/performance ratio, present very different communication features, and therefore, advise using a different approach. Although a set of projects have been developed to implement CBIR systems on clusters like the IRMA project for medical images [14] and the DISCOVIR project (distributed contentbased visual information retrieval system on peer-to-peer network) [13], none of them include a load balancing algorithm to distribute the workload of the cluster nodes and therefore they cannot manage system heterogeneity. 3. CBIR system description The experimental work presented in this paper has been performed on a test CBIR system containing information from 29.5 million color pictures. The system provides the user with a data set containing the p images considered most similar to the query one. If the result does not satisfy the user, he/she can choose one of the selected images or enter a new one that presents some kind of similarity with the desired image. The following sections describe the heart of the CBIR system, where the signature is extracted from each image (a feature vector describing the image content), as well as the processes involved in serving a user s query. More detailed analysis of the retrieval techniques involved in the CBIR system and the method s stages from the standpoint of parallel optimization can be found elsewhere [30,29,6], respectively Signature computation Many different approaches can be used for computing the images signatures, as mentioned in Section 2. In the work presented here, a primitive that represents the color information of the original image at different resolution levels has been selected. To achieve a multiresolution representation, a wavelet transform is first applied to the image [22,11] Analysis of the sequential CBIR algorithm The search for images contained in a CBIR system can be broken down into the following stages: (1) Input/query image introduction: The user first selects a pixel bidimensional image to be used as a search reference. Then the system computes its signature as described above. The whole process can be efficiently implemented using an O(i_s) order algorithm, i_s being the image s size [38]. This stage does not require high computational resources since the system deals with just one image. (2) Query and DB image s signature comparison and sorting: The signature obtained in the previous stage is compared with all of the DB images signatures using an Euclidean distance-based metric. After this process, the identifiers of the p images most similar to the input image are extracted, ranked by their similarity. Even though this process of signature comparison, selection and ranking is not very demanding from the computational point of view, it has to be performed with all of the images within the DB. (3) Results display: The following step is to assemble a mosaic made up of the selected p images which has to be presented to the user as the search result (see Fig. 1). (4) Query image update: If the user considers the search result to be unsatisfactory, he may select one of the displayed images as a new input and then return to the first stage.

4 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) Fig. 1. Visual result of a query. Upon observing the operations involved, it is possible to notice that the comparison and sorting stage involves a much larger computational load than the others. Luckily, the exploitation of data parallelism can be done just by dividing the workload among n independent nodes, since there are no dependencies. This can be accomplished by distributing off-line the CBIR image s signatures across the processing nodes. Then, each node can compare the image query s signature with every available signature. In order to ease also the storage requirements, it is possible to distribute images, signatures and computation over all of the n available nodes Parallel implementations without load balancing Global strategy A remarkable feature of the signature comparison and sorting stage is the problem s fine granularity: it is possible to perform an efficient data-oriented parallelization by combining the signature comparison and sorting stages, and distributing among the different nodes only the data needed to perform this stage, which are the signatures of the DB images assigned to each node as well as a scalar defining the total number of signatures to be returned, p. It has to be noted that the amount of communications among the corresponding processes is very small, since only the input image s signature and the p identifiers from the most similar images which have been found at each node, together with their corresponding similarity measures, have to be exchanged among the processes involved, as we will see below. The programmed optimization strategy is based on a farm model, in which a master process distributes the data to be dealt with upon a set of slave processes which analyze the data and return the partial results to the master once they have finished their computations. Since this approach makes it possible to maintain a large degree of data handling locality, it is well suited for distributed memory multiprocessors with message passing communication. Further advantages of this solution are its good price/performance ratio and its high level of scalability, whenever the number of images stored in the database is increased. In our case, the following solution has been adopted: (1) The master process computes the signature of the input image and broadcasts it to the n slave processes. (2) The slave processes then proceed to compare the signature of the input image with the signatures of the images assigned to their corresponding process node. Once each comparison has been performed, a check is then carried out to ascertain whether the result obtained is one of the best p images and, should that be the case, it is then incorporated into the set which is repeatedly sorted using a bubble sorting algorithm. (3) The slave processes forward the p image identifiers and similarity measurements to the master process after comparing and selecting the p images which are most similar within each process node. (4) The master process collects the similarity results obtained from each of the n process nodes and sorts the n p similarity results, truncating the sort so as to include only the best p.

5 1066 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) Query Image Signature + selected images identifiers (if any) Query Image Signature + selected images identifiers (if any) MASTER Query Image Signature + selected images identifiers (if any) p more similar list + Requested images to show p more similar list + Requested images to show p more similar list + Requested images to show SLAVE 1 SLAVE 2 SLAVE n Fig. 2. Process communication in the cluster implementation without load balancing. (5) Finally, the master process requests the process nodes that contain the previously selected images to forward them so that they may be presented to the user and, once available, proceeds to compose a mosaic that is then displayed to the user. Fig. 2 represents a schematic diagram of the communication between the processes involved in the unbalanced system. It must be noticed that each node of the heterogeneous cluster runs two processes: a master to attend the user queries and a slave to provide the local results achieved by each process node to the master process of the cluster node where the query has been generated. This situation is very similar to that found on a grid MPI cluster implementation The application has been programmed using the MPI libraries as communication primitives between the master and slave processes. MPI has been selected given that it currently constitutes a standard for message passing communications on parallel architectures, offering a good degree of portability among parallel platforms [23]. The MPI version used is LAM, from the Laboratory for Scientific Computing of Notre Dame University, a free distribution of MPI [36]. The pseudo-code corresponding to the implementation of the master and slave processes is shown below. Master loop Request an image to the user Compute its signature Forward the signature to each of the n slave processes using the MPI_BCAST (broadcast) primitive Receive the results of the n slave processes using the MPI_- RECV (receive) primitive Sort the partial n p comparisons selecting the top p Request the p most similar images to the slave processes where the corresponding images are stored Receive the p more similar images from the nodes containing them using the MPI_RECV primitive Compose the mosaic to be presented to the user end loop Slave j (1 j n) M being the number of images stored in process node j loop Receive the signature of the query image forwarded from the master using the MPI_BCAST (reception from a previous broadcast) primitive Initialize the P j set which shall contain the p better results of the comparisons for k = 1toM do Find the signature of the image k Compare the query signature with that of the current image obtaining the similarity measurement ms jk if ms jk P j then Eliminate the worst result of P j Incorporate the result corresponding to the image k to P j Sort P j using a bubble sorting algorithm end if end for Forward P j to the master if the master requests images to compose the mosaic then Forward the requested images endif end loop

6 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) The size of the data corresponding to each one of the p best results that are transferred from every slave to the master is around 336 bytes. Therefore, each slave process transfers 336 p bytes per query. For example, for p = 20 and n = 25, the traffic involved in the response will be less than 165 kb Description of the load balancing algorithm A dynamic, distributed, global and highly scalable load balancing algorithm has been developed for CBIR application and tested with the CBIR parallel application previously described. A more detailed description of the load balancing algorithm can also be found in [7]. A load index based on the number of running tasks and the computational power of each node is used to determine the nodes state, which is exchanged among all of the cluster nodes at regular time intervals. The initiation rule is receiver-triggered and based on workload thresholds. Finally, the distribution rule takes into account the heterogeneous nature of the cluster nodes as well as the communication time needed for the workload transmission in order to divide the amount of workload between a pair of nodes in every load balancing operation. These ideas are detailed along the following sections State rule The load balancing algorithm is based on a load index which estimates how loaded a node is in comparison to the rest of the nodes that compose the cluster. Many approaches can be taken to compute the load index. Like in any estimation process, it is necessary to find a trade-off between accuracy and cost, since keeping frequently updated node rankings according to their workload might be costly. The index is based on the number of tasks in the run-queue of each CPU [20]. These data are exchanged among all of the nodes in the cluster to update the global state information. Moreover, each node takes into account the following information about the rest of the cluster nodes: Cluster heterogeneity: each node can have a different computational power P i, so this factor is an important parameter to take into account for computing the load index. It is defined as the inverse of the time taken by node i to process a single signature. Total amount of workload for each node: it is evaluated when the application begins its execution and it is updated if there are any changes in a node. Percentage of the workload performed by each node, W i :itis defined in function of the total workload, the computational power and the number of tasks in this node. Period of time from the last update, D, and total execution time, T. 3 These figures do not take into consideration either the data corresponding to the images presented to the user or the overheads originated by the communication primitives, although the latter could be considered negligible. Therefore, the updates of the number of tasks are performed as N ave = (N last T)+ (N cur D), (1) T + D where N cur is the number of current running tasks in the node, N last is the average of the number of tasks running from the last update, T is the total execution time of the N last tasks considered, and D is the interval of time since the last update. This expression gives the average number of tasks of the node during the execution time of the application. So, the percentage of workload processed in each node, W i, is evaluated as W i = 4.2. Information rule P i T W N ave 100. (2) Given that the load balancing approach described here is dynamic, distributed and global, every node in the system needs updated information about how loaded the remaining system nodes are [42]. The selected information rule is periodic: each node broadcasts its own load index to the rest of the nodes at specific time instants. A periodic rule is necessary because each node has to compute the amount of workload processed by the rest of the cluster nodes, based on the average number of tasks per node. To evaluate the average number of tasks it is necessary that the information is updated periodically, which makes other information rules such as event driven or under demand not suitable Initiation rule The initiation rule determines the current times for performing load balancing operations. It is a receiver initiated rule, where load balancing operations involve pairs of idling and heavily loaded nodes: whenever a processor finishes its assigned workload, it looks for a busy node and asks it to share part of its remaining workload. Since each node keeps information about the amount of pending work of the remaining nodes, the selection of busy nodes is simple. The initiation rule described above minimizes the number of load balancing operations, reducing the algorithm overhead. Also, all the operations improve the system performance, because the total response time of the nodes involved in the load balancing operation are equalized, provided that there are not any additional changes in their state or they are not involved in other load balancing operations Load balancing operation The load balancing operation is broken down in three phases: first it is necessary to find an adequate node which will provide part of its workload (localization rule). Then, the amount of workload to be transferred has to be computed (distribution rule). Finally, the workload has to be actually transferred.

7 1068 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) Localization rule Whenever a node finishes its workload, it looks for a sender node to start a load balancing operation. The receiver node checks the state of the rest of the cluster nodes and computes a node list, ordered by the amount of pending work. To select the sender node, the receiver checks its own position in the list and selects the node which is in the symmetric position; for example, if nodes are ranked according to their workload, the node less loaded will look for the most loaded; the second less loaded node will look for the second most loaded, and so on. In consequence, each pair of sender receiver nodes will have between both of them a similar amount of workload. Apart from being very simple to implement, this approach gives good results since whenever a node finishes its work it is placed in one end of the list, selecting a heavily loaded (in the other end of the list). This way, the selection of the sender node is very coherent: the underloaded nodes take workload from the overloaded nodes, while the nodes in middle positions in the list do not receive a load balancing request (since it is very unlikely that a node placed in an intermediate position starts a load balancing operation). Additionally, if several nodes are looking for a sender at the same time, it is unlikely that they address their requests to the same sender. This way, situations where a loaded node receives several load balancing petitions and the rest of the loaded nodes do not receive any are avoided. Finally, this approach is not time consuming, because the nodes have always up-to-date state information to make their own list. Whenever a node receives a load balancing request, it can accept or reject it. In order to accept it, the sender node should have a minimum amount of work left. Otherwise, the sender node is near to complete its workload and the cost of the load balancing operation can be higher than finishing the remaining workload locally. In that case, the receiver node will select another node from the list using the same procedure until an adequate node is found or the end of the list is reached Distribution rule The distribution rule computes the amount of work that has to be moved from the sender to the receiver node. An appropriate rule should take into consideration the relative nodes capabilities and availabilities, so that they finish processing their jobs at the same time (provided that no additional operations change their processing conditions). The communication time to transfer the workload among the nodes is also taken into consideration because the receiver node cannot run the new assigned task until it receives the corresponding load, having an additional delay. The global equilibrium is obtained through successive operations between couples of nodes. The proposed distribution rule is based on two parameters: the number of running tasks NT i and the computational power P i of the nodes which take part in the operation. This reflects the fact that the contribution of a powerful node might be hampered by a large amount of external workload. Both parameters will be included in the nodes actual computational power, Pact i, which is obtained as Pact i = P i. (3) NT i This is a multi-phase application within two different phases: comparison and sorting. Whenever the load balancing operation is finished, the sender node has to finish the comparison phase with the remaining workload. Then, it must sort all the processed workload. The receiver should compare and sort the new workload. Additionally, the communication time has to be taken into account, because the receiver cannot continue the processing until it receives the new workload. Then, the distribution rule is determined by the following expressions: W s T s = + W W r, Pact s Pact s T r = W r + W r + W r, Pact r Pact r P c W = W s + W r, (4) where T s and T r are the response times of the sender and receiver processors, since the load balancing operation is finished. W is the total workload of the sender which has not still been processed, W s is the remaining workload in the sender node after the load balancing operation and W r the workload sent to the receiver. Pact s and Pact r are the sender and receiver current computational power. Finally, P c is the communication power expressed in units of workload per second. The communication power is obtained by computing offline the number of signatures that can be exchanged between two of the cluster nodes per second. This model takes into consideration two assumptions: The computational power for a node is the same in both the comparison and sorting phases. The response time in both phases and the communication time are linear with respect to the workload. Solving these expressions, the amount of both sender and receiver workload can be computed as 2WPact r P c W r =, 2P c Pact s + 2Pact r P c + Pact s Pact r W s = W W r. (5) The values for both workloads W r and W s take into account the heterogeneity of the nodes, their current state, the communication times and the two different phases of the application. In consequence, the load balancing algorithm described is dynamic, being able to redistribute workload among nodes at run time, depending on how the system state changes. It is also distributed, because every node takes part in the load balancing operations. And finally, it is global, because a global view of the system state is always kept. The following section describes the implementation of this algorithm on a CBIR system running in a heterogeneous cluster.

8 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) Distributed load balancing implementation on a heterogeneous cluster 5.1. Process structure of the load balance implementation DISTRIBUTION PROCESS OF NODE i RECEIVES LOCAL REQUEST FROM SLAVE LOAD DAEMON OF NODE i Two replicated processes are distributed among each one of the cluster nodes: (1) Load daemon: This process implements both the state and the information rules. (2) Distribution daemon: It collects requests from slave nodes demanding workload and proceeds with the transference. Fig. 3 shows a decomposition of all the actions that must be carried out when a slave node finishes its local workload and triggers the initiation rule. First, it demands new load to the distribution process, which obtains the demanded load and sends it to the slave node with the purpose to allow the continuation of the computations. The following section describes the structure of the group of processes and their functions. REQUEST NODE S NUMBER RECEIVES REQUEST TO DEMAND LOAD OF NODE S NUMBER SORTS THE TABLE SELECTS THE NODE S NUMBER DISTRIBUTION PROCESS OF NODE i RECEIVES THE SENDS THE SELECTED NODE S NUMBER NODE S NUMBER DISTRIBUTION PROCESS OF NODE j 5.2. Groups of processes SENDS LOAD REQUEST TO THE SELECTED NODE RECEIVES LOAD REQUEST As mentioned in Section 3.3.2, communication and synchronization between processes is based on MPI. A structure of groups of processes based on communicators [23,26,35] has been implemented, where the groups allow to establish communication structures between processes and to use global communication functions over subsets of processes. This way, each type of process belongs to his own group: MPI_COMM_MS: this group is composed by the master process and all of the slave processes. MPI_COMM_DIST: this group is formed by the distribution processes. MPI_COMM_LOAD: this group is composed by the load daemon of each of the nodes. The group concept is the more natural way to implement this process scheme, because most often the messages transmitted involve processes that belong to the same group. Fig. 4 presents this communication hierarchy. DISTRIBUTION PROCESS OF NODE i RECEIVES THE AMOUNT OF LOAD DEMANDS THE LOAD DISTRIBUTION PROCESS OF NODE i RECEIVES THE LOAD COMPUTES THE AMOUNT OF LOAD SENDS THE AMOUNT OF LOAD DISTRIBUTION PROCESS OF NODE j RECEIVES THE LOAD REQUEST SENDS THE LOAD 5.3. Load daemon The main function of this process is to compute the local load index, to send this information to the load daemons of the other nodes and to transmit all of the information available to the local distribution process, whenever it is required to do so. Also, it is in charge of initializing and managing a table that stores the state of the other nodes. The table stores the following information for each of the nodes: computational power, average number of active tasks while the application is running, percentage of completed work, time of the last update, total execution time with some load level and number of signatures to be processed. At predetermined fixed intervals, the process evaluates the load index of the node where it is running and sends the state NOTATION: STORES THE LOAD SENDS THE EXECUTION ORDER TO THE SLAVE COMMUNICATION BETWEEN PROCESSES COMPUTATION INSIDE A NODE Fig. 3. General overview of the whole load balancing algorithm. information to the other nodes. The rest of the time it remains blocked waiting for messages from other processes; its functionality depends on the received messages. Table 1 summa-

9 1070 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) MPI_COMM_WORLD MPI_COMM_MS MASTER SLAVES MPI_COMM_DIST DISTRIBUTION PROCESS MPI_COMM_LOAD LOAD DAEMON Computational Power Fig. 4. Group scheme. Table 1 Messages and associated functions of the load daemon Message identifier Associated tasks 0 Task information 1 The distribution process has finished and demands the identifier of a transmitter node 2 The distribution process informs about the number of signatures delivered to other node 3 The distribution process notifies that there are no available nodes to transfer load 4 The distribution process shows the number of signatures obtained from other node 5 Another load daemon informs about its new number of signatures because their transference to other node 6 Another load daemon reports about the new number of signatures assigned to 7 Another load daemon tells that there are no nodes to transfer load rizes the messages involved with the load daemon and the tasks associated with each one Distribution process The main function of the load distribution process is to implement the initiation rule and the load balancing operation. Whenever a particular slave finishes its local work, the distribution process is then alerted, evaluating therefore the initiation rule, finding a candidate node, establishing the negotiation and delivering the load to the slave. On the other hand, if the node receives a load balancing request, the distribution rule must be triggered and the appropriate workload is sent to the remote node. 6. Analysis of the CBIR implementation with load balancing in a heterogeneous cluster A set of experiments have been performed for testing the behavior of the parallel CBIR system implemented on the heterogeneous cluster using the above distributed load balancing algorithm. To compare the results achieved by the parallel CBIR system with and without the distributed load balancing algo Processor Fig. 5. Computational power of the cluster nodes, measured in workload units/second. rithm, the total response time of the CBIR system, with and without load balance, has been measured. Additionally, two classical load balancing algorithms have been implemented as reference: the random algorithm [3] and the Probin algorithm [9]. The random algorithm is the one of the most simple and distributed load balancing algorithms because each node makes decisions based on local information. A node is considered sender if the queue length of the CPU exceeds a predetermined and constant threshold. The receiver is selected randomly because the nodes do not share any information about the status system. The Probin algorithm is a diffusion-based algorithm, where the information is locally exchanged defining communication domains between neighbor nodes. Several levels of coordination can be established varying the domains size. The experiments have been executed on a heterogeneous cluster composed of 25 nodes, linked through a 100 MB/s Ethernet. Each of the process nodes features 4 GB of storage capacity in an IDE hard disk linked through DMA with 16.6 MB/s transfer speed. The PC s operating system is Linux v The heterogeneity is determined by the hard disk features. It has to be noted that this component determines each node s response, as shown in Fig. 5, since in this CBIR system (as in many others), I/O operations are predominant with respect to CPU operations. Two different tests have been performed for measuring the improvement achieved by the heterogeneous cluster implementation using the distributed load balancing algorithm. The first one analyzes search operations within a 30 million image database using an underloaded system. Since none of the nodes are overloaded, this test studies how heterogeneity affects the system performance, and how this performance is improved using the load balancing algorithm. The second experiment adds some artificial external tasks to a node in order to test how well the load balancing algorithm copes with the situation of strong load unbalance. In this case the underloaded nodes should wait for the overloaded one to finish the application. The load balancing algorithm must remove the unloaded nodes idle time.

10 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) Execution time (seconds) (a) Without algorithm Random algorithm Probin algorithm Proposed algorithm Speedup Number of processes (b) Number of processors Random algorithm Probin algorithm Proposed algorithm Fig. 6. Results without external tasks (speedup with respect to the algorithm without load balancing): (a) response time and (b) speedup. Table 2 Response time without external workload, measured in seconds No. nodes Without alg. Random alg. Probin alg. Proposed alg Table 3 Speedup without external tasks No. nodes Speedup Speedup Speedup random alg. Probin alg. proposed alg Tests considering cluster heterogeneity and load balancing overhead The main purposes of these tests were to detect the amount of overhead introduced by the load balancing algorithm, and how the algorithm can manage the system heterogeneity. The tests were performed on clusters with 5, 10, 15, 20 and 25 slave nodes plus a master node, in order to evaluate the algorithm scalability. The results are presented in Table 2 and in Fig. 6. Table 2 shows that the response times are always shorter with some load balancing algorithm, which means that the overhead introduced by the algorithm is smaller than the improvements achieved by using any of the implemented load balancing algorithms. From these results two main considerations can be pointed out: The tested load balancing algorithms improved always the response times between 10% and 15%. The best results were achieved by the proposed algorithm. The proposed approach proved to be more stable, while the results obtained with the other algorithms were less consistent. Fig. 6(b) and Table 3 present the speedup of these algorithms, where the speedup refers to the improvements. An interesting parameter for estimating the methods behavior is the standard deviation of the response times of the different cluster nodes, shown in Table 4 and in Fig. 7. The standard deviation of the nodes response times is a measurement directly related to idling times of nodes waiting for other nodes to finish their assignments. Table 4 Standard deviation of the cluster nodes without external tasks No. nodes Without alg. Random alg. Probin alg. Proposed alg Time (seconds) Without algorithm Random algorithm Probin algorithm Proposed algorithm Number of processors Fig. 7. Standard deviation without external tasks.

11 1072 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) Execution time (seconds) (a) Without algorithm Random algorithm Probin algorithm Proposed algorithm Number of Processors Speedup (b) Rando algorithm Probin algorithm Proposed algorithm Number of processors Fig. 8. Results with external tasks: (a) response time and (b) speedup. Table 5 Response time with external tasks on a node, measured in seconds No. nodes Without alg. Random alg. Probin alg. Proposed alg The load balancing algorithm presented here decreases the standard deviation, equilibrating the response times, while with the random algorithm a slight reduction is achieved but with the Probin algorithm this value is erratic, depending highly on the probed nodes. Finally, the proposed algorithm achieves the best values of all the load balancing algorithms tested, ranging from a reduction of the standard deviation from 86.45% with 5 nodes to 93.56% with 25 nodes with respect to the response times without a load balancing algorithm Results with system overload For these experiments the system is slightly overloaded, having one of the nodes heavily loaded. The goal of this test is to measure the algorithm s ability to distribute the work of the loaded node among the remaining cluster nodes, without affecting the system performance. The tests were performed on a heterogeneous cluster with 5, 10, 15, 20, and 25 slave nodes and a master node, using a database of 12.5 million images. Table 5 and Fig. 8 present the results achieved in this experiment. For these tests, the differences obtained between executions with or without load balancing were very strong. The reductions in response times range from 45% with 5 nodes to 38% with 25 nodes. As the number of nodes increases, the differences in response times decrease. Again, the best results are achieved with the proposed algorithm. Table 6 and Fig. 8(b) show the speedup achieved in these tests. Finally, Table 7 and Fig. 9 present the standard deviation results. In these tests, the reduction of the standard deviation ranged from 90% to 95%. An interesting point to be remarked is the lack of consistency of the results provided by the random algo- Table 6 Speedup with external tasks No. nodes Speedup Speedup Speedup random alg. Probin alg. proposed alg Table 7 Standard deviation with external tasks No. nodes Without alg. Random alg. Probin alg. Proposed alg Time (seconds) Without algorithm Random algorithm Probin algorithm Proposed algorithm Number of processors Fig. 9. Standard deviation with external tasks. rithm. This method provides only marginal improvements with respect to the algorithm without load balancing for more than nodes. The Probin algorithm has a better behavior for less than 10 nodes although the relative improvements drop dramatically

12 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) Response time (seconds.) Without algorithm Proposed algorithm Number of tasks per node Fig. 10. Response time considering a loaded node for a 25 node cluster. Table 8 Response time increasing the number of external tasks, measured in seconds for a 25 node cluster No. tasks Without alg. Proposed alg above 15 nodes. Finally, the proposed algorithm has very stable values, achieving better results when the number of nodes is increased. The method takes advantage of the availability of additional nodes having them all finishing within a short time interval Results increasing the load of a heavily loaded node The last experiment presented in this paper has been performed increasing the number of external tasks on a heavily loaded node. This way, the unbalance among different nodes is higher, and the algorithms behavior when confronted with highly overloaded nodes can be checked. Table 8 and Fig. 10 present the results achieved. From the results above it can be seen that the response times without any load balancing algorithm increase linearly with the external load. When the proposed load balancing algorithm is introduced, depending on the system s external workload status, it is possible to achieve a small increment in the system s response time. But globally, the response times remain almost invariant when the amount of external workload is increased, as it can be seen in Table 8 and Fig. 10. This behavior proves that if there are some underloaded nodes the extra workload can be split among them and the application response time can be kept constant. 7. Conclusions and future work This paper begins with an analysis of the operations involved in a typical CBIR system. From the analysis of the sequential version it can be observed a lack of data or algorithmic dependencies. This allows efficient cluster implementations of CBIR systems since it is a parallel architecture that meets very well the application needs [6]. Improvements on the cluster implementation have been made by introducing a dynamic, distributed, global and scalable load balancing algorithm which has been designed specifically for the parallel CBIR application implemented on a heterogeneous cluster. An additional important feature is that the load balancing algorithm takes into account the system heterogeneity originated both by the different node computational attributes and by external factors such as the presence of external tasks. The experiments presented here show that the amount of overhead introduced by this method is very small. In fact, this overhead is hidden by the improvements achieved whenever any degree of system heterogeneity shows up, a common situation in grid systems. All these experiments have also shown that using the load balancing algorithm results in large execution time reductions and in a more uniform distribution of the node s response times, which can be detected through strong reductions in the response times standard deviation. As it has been shown in the experiments presented here, another important aspect that should be stressed is the algorithm scalability: increasing the number of system nodes does not significantly change the execution time increments originated by the introduction of the load balancing algorithm. At this moment, considering a network with a much higher number of nodes is not possible with the available resources. In any case, it is feasible and even simple to extend the current implementation to define a hierarchical algorithm using MPI communicators. The cluster version of the CBIR system that includes the load balancing algorithm is nowadays fully operative. Finally, further work will be devoted to the evaluation of the effects in the method s performance of using more complex node load indices and initiation rules. New efforts will be made in order to refine the primitives used in the CBIR system, and to introduce fault tolerance mechanisms in order to increase the system robustness. Analysis on the system response will also be made after distributing the database of the CBIR system between different clusters. Future migration of the implemented CBIR system to a grid will also be performed. Acknowledgments This work has been partially funded by the Spanish Ministry of Education and Science (Grant TIC C02) and Government of the Community of Madrid (Grant GR/SAL/0940/2004). References [1] I. Ahmad, A. Ghafoor, Semi-distributed load balancing for massively parallel multicomputer systems, IEEE Trans. Software Engrg. 17 (10) (1991) [2] I. Banicescu, R. Carino, J. Pabico, M. Balasubramaniam, Design and implementation of a novel dynamic load balancing library for cluster computing, Parallel Comput. 31 (7) (2005) [3] K.M. Baumgartner, B.W. Wah, Computer scheduling algorithms past, present and future, Inform. Sci (1991)

13 1074 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) [4] G. Bell, J. Gray, What s next in high-performance computing?, Commun. ACM 45 (2) (2002) [5] A.P. Berman, L.G. Shapiro, A flexible image database system for contentbased retrieval, Comput. Vision Image Understanding 75 (1/2) (1999) [6] J.L. Bosque, O.D. Robles, A. Rodríguez, L. Pastor, Study of a parallel CBIR implementation using MPI, in: V. Cantoni, C. Guerra (Eds.), Proceedings on International Workshop on Computer Architectures for Machine Perception, IEEE CAMP 2000, Padova, Italy, 2000, pp , ISBN [7] J.L. Bosque, O.D. Robles, L. Pastor, Load balancing algorithms for CBIR environments, in: Proceedings on International Workshop on Computer Architectures for Machine Perception, IEEE CAMP 2003, The Center for Advanced Computer Studies, University of Louisiana at Lafayette, IEEE, New Orleans, USA, 2003, ISBN [8] T.L. Casavant, J.G. Kuhl, A taxonomy of scheduling in general-purpose distributed computing systems, in: T.L. Casavant, M. Singhal (Eds.), Readings in Distributed Computing Systems, IEEE Computer Society Press, Los Alamitos, CA, 1994, pp [9] A. Corradi, L. Leonardi, F. Zambonelli, Diffusive load-balancing policies for dynamic applications, IEEE Concurrency 7 (1) (1999) [10] S.K. Das, D.J. Harvey, R. Biswas, Parallel processing of adaptive meshes with load balancing, IEEE Trans. Parallel Distributed Systems 12 (12) (2001) [11] I. Daubechies, Ten Lectures on Wavelets, vol. 61 of CBMS-NSF Regional Conference Series in Applied Mathematics, Society for Industrial and Applied Mathematics, Philadelphia, PA, [12] A. del Bimbo, Visual Information Retrieval, Morgan Kaufmann Publishers, San Francisco, CA, 1999, ISBN [13] Department of Computer Science and Engineering, The Chinese University of Hong Kong, DISCOVIR Distributed Content-based Visual Information Retrieval System on Peer-to-Peer(P2P) Network, Web, URL miplab/ discovir/. [14] Department of Diagnostic Radiology and Department of Medical Informatics and Division of Medical Image Processing and Lehrstuhl für Informatik VI of the Aachen University of Technology RWTH Achen, IRMA: Image Retrieval in Medical Applications, Web, URL index_en.php. [15] D. Grosu, A. Chronopoulos, M. Leung, Load balancing in distributed systems: an approach using cooperative games, in: 16th International Parallel and Distributed Processing Symposium IPDPS 02, IEEE, 2002, pp [16] N.J. Gunther, G. Beretta, A benchmark for image retrieval using distributed systems over the internet: birds-i, Technical Report HPL , Imaging Systems Laboratory, Hewlett Packard, December [17] R.M. Haralick, L.G. Shapiro, Computer and Robot Vision, vol. I, Addison-Wesley, Reading, MA, 1992, ISBN: [18] C.-C. Hui, S.T. Chanson, Hydrodynamic load balancing, IEEE Trans. Parallel Distributed Systems 10 (11) (1999) [19] S. Khoshafian, A.B. Baker, Multimedia and Imaging Databases, Morgan Kaufmann, San Francisco, CA, [20] T. Kunz, The influence of different workload descriptions on a heuristic load balancing scheme, IEEE Trans. Software Engrg. 17 (7) (1991) [21] L.J. Latecki, R. Melter, A. Gross, et al., Special issue on shape representation and similarity for image databases, Pattern Recognition 35 (1). [22] S. Mallat, A theory for multiresolution signal decomposition: the wavelet representation, IEEE Trans. Pattern Anal. Mach. Intell. 11 (7) (1989) [23] MPI Forum, A message-passing interface standard, URL [24] H. Müller, W. Müller, D.M. Squire, S. Marchand-Maillet, T. Pun, Performance evaluation in content-based image retrieval: overview and proposals, Pattern Recognition Lett. 22 (2001) [25] W. Obeloer, C. Grewe, H. Pals, Load management with mobile agents, in: 24th Euromicro Conference, vol. 2, IEEE, 1998, pp [26] P.S. Pacheco, Parallel Programming with MPI, Morgan Kaufmann Publishers Inc., San Francisco, [27] J.S. Payne, L. Hepplewhite, T.J. Stonham, Evaluating content-based image retrieval techniques using perceptually based metrics, in: Proceedings of SPIE on Applications of Artificial Neural Networks in Image Processing IV, vol. 3647, SPIE, 1999, pp [28] A. Rajagopalan, S. Hariri, An agent based dynamic load balancing system, in: International Workshop on Autonomous Decentralized Systems, IEEE, 2000, pp [29] O.D. Robles, A. Rodríguez, M.L. Córdoba, A study about multiresolution primitives for content-based image retrieval using wavelets, in: M.H. Hamza (Ed.), IASTED International Conference On Visualization, Imaging, and Image Processing (VIIP 2001), IASTED, ACTA Press, Marbella, Spain, 2001, pp , ISBN [30] A. Rodríguez, O.D. Robles, L. Pastor, New features for content-based image retrieval using wavelets, in: F. Muge, R.C. Pinto, M. Piedade (Eds.), V Ibero-american Symposium on Pattern Recognition, SIARP 2000, Lisbon, Portugal, 2000, pp , ISBN [31] S. Santini, Exploratory Image Databases: Content-based Retrieval, Communications, Networking, and Multimedia, Academic Press, New York, 2001, ISBN [32] S. Santini, A. Gupta, R. Jain, Emergent semantics through interaction in image databases, IEEE Trans. Knowledge Data Engrg. 13 (3) (2001) ISSN: [33] B. Schnor, S. Petri, R. Oleyniczak, H. Langendörfer, Scheduling of parallel applications on heterogeneous workstation clusters, in: K. Yetongnon, S. Hariri (Eds.), Proceedings of the ISCA Ninth International Conference on Parallel and Distributed Computing Systems, vol. 1, ISCA, Dijon, 1996, pp [34] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Contentbased image retrieval at the end of the early years, IEEE Trans. PAMI 22 (12) (2000) [35] M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker, J. Dongarra, MPI: The Complete Reference, The MIT Press, Cambridge, [36] J.M. Squyres, K.L. Meyer, M.D. McNally, A. Lumsdaine, LAM/MPI User Guide, University of Notre Dame, lam 6.3, URL [37] S. Srakaew, N. Alexandridis, P.P. Nga, G. Blankenship, Content-based multimedia data retrieval on cluster system environment, in: P. Sloot, M. Bubak, A. Hoekstra, B. Hertzberger (Eds.), High-Performance Computing and Networking. Seventh International Conference, HPCN Europe 1999, Springer, Berlin, 1999, pp [38] E.J. Stollnitz, T.D. DeRose, D.H. Salesin, Wavelets for Computer Graphics, Morgan Kaufmann Publishers, San Francisco, [39] Y. Wang, Z. Liu, J.-C. Huang, Multimedia content analysis, IEEE Signal Process. Mag. 16 (6) (2000) [40] M.H. Willebeek-LeMair, A.P. Reeves, Strategies for dynamic load balancing on highly parallel computers, IEEE Trans. Parallel Distributed Systems 4 (9) (1993) [41] L. Xiao, S. Chen, X. Zhang, Dynamic cluster resource allocations for jobs with known and unknown memory demands, IEEE Trans. Parallel Distributed Systems 13 (3) (2002) [42] C. Xu, F. Lau, Load Balancing in Parallel Computers: Theory and Practice, Kluwer Academic Publishers, Dordrecht, [43] M.J. Zaki, Parallel and distributed association mining: a survey, IEEE Concurrency 7 (4) (1999) [44] M.J. Zaki, S. Pathasarathy, W. Li, Customized Dynamic Load Balancing, vol. 1, Architectures and Systems, Prentice-Hall PTR, Upper Saddle River, NJ, 1999 (Chapter 24). [45] A.Y. Zomaya, Y.-H. Teh, Observations on using genetic algorithms for dynamic load-balancing, IEEE Trans. Parallel Distributed Systems 12 (9) (2001) Jose L. Bosque graduated in Computer Science and Engineering from Universidad Politécnica de Madrid in He received the Ph.D. degree in Computer Science and Engineering from Universidad Politécnica de Madrid

14 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) in His Ph.D. was centered on theorical models and algorithms for heterogeneous clusters. He has been an associate professor at the Universidad Rey Juan Carlos in Madrid, Spain, since His research interest are parallel and distributed processing, performance and scalability evaluation and load balancing. Oscar D. Robles received his degree in Computer Science and Engineering and the Ph.D. degree from the Universidad Politécnica de Madrid in 1999 and 2004, respectively. His Ph.D. was centered on Content-based image and video retrieval techniques on parallel architectures. Currently he is Associate Professor in the Rey Juan Carlos University and has published works in the fields of multimedia retrieval and parallel computer systems. His research interests include content-based multimedia retrieval, as well as computer vision and computer grapyhics. He is an Eurographics member. Luis Pastor received the B.S.EE degree from the Universidad Politécnica de Madrid in 1981, the M.S.EE degree from Drexel University in 1983, and the Ph.D. degree from the Universidad Politécnica de Madrid in Currently he is Professor in the university Rey Juan Carlos (Madrid, Spain). His research interests include image processing and synthesis, virtual reality, 3D modeling and and parallel computing. Angel Rodríguez received his degree in Computer Science and Engineering and the Ph.D. degree from the Universidad Politécnica de Madrid in 1991 and 1999, respectively. His Ph.D. was centered on the tasks of modeling and recognizing 3D objects in parallel architectures. He is an Associate Professor in the Photonics Technology Department, Universidad Politécnica de Madrid (UPM), Spain and has published works in the fields of parallel computer systems, computer vision and computer graphics. He is an IEEE and an ACM member.