RESERV: A Distributed, Load Balanced Information System for Grid Applications

RESERV: A Distributed, Load Balanced Information System for Grid Applications Gábor Vincze, Zoltán Novák, Zoltán Pap, Rolland Vida Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics {vincze, novak, pap, vida}@tmit.bme.hu Abstract Resource information systems are a key component of Computational Grids. Centralized information systems hamper scalability and reliability, and thus, completely distributed resource information systems, based on Distributed Hash Tables have been proposed. In some cases resource distribution might be highly uneven, load balancing of data becomes thus a crucial problem. However, current load balancing schemes cannot handle large amounts of data corresponding to a single resource type. In this paper we propose therefore RESERV, a distributed information system for Grid applications with a novel load balancing approach, able to handle extreme load unbalance. 1. Introduction Computational Grids provide means to organize globally distributed resources into a virtual supercomputer, supplying thus the computing power to solve grand challenge problems such as financial modelling, earthquake simulation or global climate prediction [1]. The heart of any Grid system is the information service which allows applications to find resources appropriate for their needs. A centralized information system, such as the Monitoring and Discovery System (MDS) of the Globus Toolkit [2], can however quickly become a performance bottleneck, limiting scalability and introducing a single point of failure. These shortcomings have led to the introduction of peer-to-peer (p2p) information systems that organize MDS directories into a flat p2p network similar to Gnutella [3]. Unfortunately, the proposed protocol is based on flooding query messages, which also limits scalability. Distributed Hash Tables (DHTs), such as Chord [4] and Kademlia [5], offer scalable mechanisms for information lookup, but the use of cryptographic hash functions means that looking up ranges of information is not possible (only specific, well defined keys can be retrieved). Nevertheless, when looking for resources that match the needs of a given application, it s not always possible or necessary to define exact query parameters (e.g., a certain task may need computers that have at least 512 Mb RAM memory), There was a need thus to propose range query algorithms that can answer partially defined queries in a distributed manner. The main problem in designing a range queryable information system for Grid applications lies in the uneven distribution of resources in the attribute space. For attributes with numerical values (such as the amount of memory, CPU speed, etc.), For attributes with numerical values (such as the amount of available memory, or the available free disk space), data might get concentrated on an attribute value interval, but not on a single value. This situation can be successfully handled with current load balancing algorithms. As opposed to this, in the case of string attributes (CPU type, operating system), we might have some values that correspond to the majority of the resources (e.g., most of the machines will use the Windows OS). Thus, the node responsible for such a particular value will have to handle an extremely high load. Current range queryable systems are not able to cope with such uneven load distribution. In this paper, we introduce RESERV (REsource SERVice), a range queryable information system for grid applications with a novel load balancing approach. In RESERV, similarly to previous range query algorithms, we organize resources into a multidimensional attribute space. However, instead of storing data about resources in a distributed database, information about a node s resources is stored in the node s address itself. Thus, finding a resource in RESERV simply means routing a message to a node with specific attribute values. Uneven distribution of nodes in the attribute space will thus affect routing tables instead of data distribution between nodes, and

the key problem in RESERV will be load-balanced routing. This approach allows RESERV to operate efficiently even in cases of extremely skewed node distribution in the attribute space. 2. Related work 2.1. Range query algorithms Range query algorithms are generally built over already existing DHTs, or use custom-designed DHTs. Squid [6] and the range query algorithm proposed by Andrzejak et al. [7] have a very similar approach to RESERV. Resources are organized into a multidimensional attribute-space, where each dimension indexes data along one attribute. Squid uses a recursive Hilbert space-filling curve to walk through the overlay network for range queries. Andrzejak et al. also use a Hilbert curve over a DHT to provide range query functionality. MAAN [8] uses a locality preserving hash function over Chord to extend its functionality to range queries, similarly to the solution proposed by Gupta et al. [9]. SWORD [10] also uses a locality-preserving hash function to map data onto a DHT. Mercury [11] uses a routing hub for each attribute; all the hubs have to be contacted successively during a range query. P-Tree [12] supports range queries by using B+ trees that can remain in a temporarily inconsistent state. Brushwood [13] uses a linearized indexing tree for single attribute range queries, and a K-D tree for multi-attribute range queries. correspond to a single attribute-value pair (e.g., for most of the nodes, OS = Windows); in a DHT-based range query algorithm all these data will always be mapped on the same node, and thus, trying to alleviate the load of this node by distributing a range of values among more nodes doesn t help. RESERV, the load balanced information system that we propose in this paper targets exactly these very realistic cases. 3. Distributed resource information service with RESERV 3.1. Background Kademlia Kademlia [5] is one of the most used DHTs in practical applications such as the KAD network, or the trackerless BitTorrent clients. Routing in RESERV is heavily based on the Kademlia DHT; thus, before presenting the details of our approach, we summarize the basic operation of Kademlia. Kademlia uses a 160 bit address space, in which both nodes and keys are mapped. Every node stores data with keys closest to its address, in terms of the binary XOR operator. Every node maintains 160 k- buckets for routing information. The i-th k-bucket contains at most K nodes whose distance from the current node is between 2 160-i and 2 160-i+1, where K is a pre-chosen system parameter. Figure 1. shows the address space of a 3-bit Kademlia network. The encircled subtrees correspond to 3-buckets of the node represented by the black dot: 2.2. Load balancing in range query algorithms Load balancing is a critical problem in grid information systems, as data distribution in the attribute space can be very uneven. The above mentioned range query algorithms are all based either on a passive or an active load balancing solution. Passive load balancing, as in SWORD, is accomplished by the locality preserving hash function, which tries to smooth unevenness of data distribution. Active load balancing methods work in two similar ways: nodes with high load try to hand over part of the data they are responsible for (that is, part of the range along one attribute) to less loaded neighbour nodes (as in Squid, for example), or nodes with high load leave the system, and join again with an address corresponding to a less loaded part of the key-space (as in Squid, Mercury, SWORD or Brushwood). However, neither of these methods can cope with extreme load unbalance, where most of the resources Figure 1: Kademlia k-buckets The Kademlia protocol contains four RPC-s that all the functions are built on: PING, to check if a node is still connected; STORE, to store a key and corresponding data; FIND_NODE, with an address as its parameter, to look for the K closest values to the given address; FIND_VALUE, with a key as its parameter; if a node stores data corresponding to the key, the return value is the stored data; otherwise it behaves identically to FIND_NODE. When node A looks for node B with address y, the search goes thus through the following steps:

I. Node A creates a list L containing the K closest addresses to y. It first fills this list from its own k- buckets. It also marks every node in the list on which it has already run the FIND_NODE RPC. II. Node A selects α unmarked nodes from the list, and runs the FIND_NODE RPC on them (α is a system-wide parameter). III. Node A updates the list using the return values of the FIND_NODE RPCs, so as to still contain the K closest addresses to y. IV. If A hasn t found the node it was looking for, or the list still contains unmarked nodes, it returns to step II. K-buckets are ordered lists of nodes, with the most recently seen node at the beginning of the list. If we receive a message from another node, we try to insert that node into the appropriate k-bucket, if there s still room. If there s not, we PING the node from the end of the list; if it replies, we move it to the head of the list; if it does not, we delete it from the list, and replace it with the new node. With adequate network traffic, k-buckets remain consistent thanks to the above procedures. A new node joining a Kademlia system simply has to know about one other node already in the system. It chooses a random address for itself, and then searches for its own address. By doing so, it learns about nearby nodes. The new node then fills its k-buckets by selecting random addresses from the node lists returned by the successive FIND_NODE RPCs. In parallel, other nodes also gain knowledge of the new node. When a node leaves the network, it simply copies data stored on it to the node nearest to itself, and disconnects. dimensional attribute space. Of course, no matter which space filling curve we use, some addresses which are near each other in the N-dimensional attribute space will be far from each other on the space-filling curve. We thus create an N-dimensional attribute space (where N is the number of attributes), where nodes occupy a position depending on their attribute values. A node s address along each dimension will have a first part corresponding to the value of that attribute, and a second, random part to differentiate nodes from each other. Figure 2. shows the attribute space for two attributes: system memory (a numerical attribute), and operating system (a string attribute). Figure 2: RESERV attribute space We then use a Z-order space-filling curve to create one-dimensional addresses from the N-dimensional addresses, as shown in Figure 3: 3.2. The RESERV approach The basic idea behind RESERV is that node addresses are not assigned randomly, but depend on the attribute-value pairs corresponding to that node. The address of a node is composed of as many parts as there are attributes. The goal is to give similar addresses to nodes with similar attribute values, in order to facilitate range queries. Let s suppose we want to code each attribute on 10 bits. In the case of string attributes, such as the operating system, we can obtain addresses for example by using a hash function which converts ASCII strings to a 10 bit long binary number. In case of numerical attributes, such as the amount of available memory, this address can be obtained directly, or by using some transformation which preserves locality (for example taking the base 2 logarithm of the original value); Thus nodes with similar attributes will be near each other in the N- Figure 3: Z-order curve application This is accomplished by interleaving bits from each attribute successively, which yields a one-dimensional bit string address. Thus for the example shown on figure 3, the one-dimensional address from attributes with binary values 10 (on the x-axis) and 01 (on the y- axis) will be 10 01 (by interleaving the first bit from the first attribute, the first bit from the second attribute, the second bit from the first attribute, and finally the second bit from the second attribute. Let s take a more complex example: we have four attributes with the

following binary values: 110110, 011, 1, 11010 (as we can see in this example, attributes needn t be of the same length). We then write these attribute values in a matrix, as follows: 1 1 0 1 1 0 0 1 1 x x x 1 x x x x x 1 1 0 1 0 x We can obtain the one-dimensional address by reading bits successively from each column and omitting the x fields. Thus, the one dimensional address will be: 1011 111 010 11 10 0 To avoid address collisions between nodes belonging to the same category, a pre-defined length random bit sequence is added at the end the linearized address of each node. The notion of k-buckets in RESERV is very similar to the original Kademlia network: the i-th k-bucket contains at most K nodes whose distance from the current node is between 2 L-i and 2 L-i+1, where L is the total length of the linearized address of nodes. K- buckets in RESERV will correspond to successively larger and farther away portions of the attribute space from the current node, as shown in Figure 4: 3.3. Node lookup in RESERV As no data is stored in RESERV, we can only look up nodes. There are two types of lookups: simple lookup, where every attribute value is specified, and range lookup, where for some attributes no value is specified, or instead of a value, a range of values is given. The mechanism of simple lookup is identical to the Kademlia search mechanism described in section 3.1.: when looking for a well-specified resource (a node with all attribute values defined), then by using the RESERV addressing mechanism and the Kademlia lookup mechanism, we can find the k closest nodes satisfying the search criteria. For range lookups, we introduce a new RPC: FIND_INTERVAL. In a range lookup, we can specify a list of values, or a range of values for some attributes (or leave the attribute value blank, in which case the range of the lookup will be the entire attribute range). The lookup interval in the attribute space will be the Cartesian product of the specified sets. In order to handle range lookups, we define a new XOR distance metric between an address and a set of addresses, as being the distance between the address, and the address nearest to it in the set. This distance is easy to calculate by exploiting the fact that the set of addresses in the range of the query is the cartesian product of the set of addresses in the range of the query along each attribute. We simply have to find thus the smallest distance along each attribute, and linearize the resulting attribute-value pairs using the Z-order curve. Because a simple greedy range lookup could quickly reach a dead-end, we use binary stochastic beam search. 3.4. Load balancing Figure 4: k-buckets and attribute space of node 0110 However, contrary to Kademlia, where k-buckets are filled randomly with nodes we receive messages from, in RESERV, k-buckets will be filled with the nearest nodes (based on the XOR distance between addresses) from the part of the attribute space which corresponds to each k-bucket. This is achieved by the following mechanism. When a node x joins, it starts a search for the node with address x 100000... that is, the nearest node to itself whose address differs in the first bit. After joining, we keep the k nearest nodes in each k-bucket, except in cases where load balancing dictates otherwise (see section 3.4.). As addresses in RESERV are not assigned randomly, nodes corresponding to rare resources would have much more links pointing to them than ordinary nodes, as without load balancing, each node would try to fill its k-buckets uniformly from the address-space. This would mean that these nodes, probably also constituting the most valuable resources in the system, would have to take a disproportionately large part of the system maintenance effort. This is why load balancing of routing tables is a crucial question in RESERV. The principle of load balancing is that every node tries to estimate how many other nodes know of a given node before inserting it into its routing tables. The new node is only inserted in the corresponding k-bucket if the result of this estimation is smaller than K.

We achieve this by modifying the original Kademlia k-bucket handling rules. As in Kademlia, whenever a node with address x receives a message from another node with address z, it tries to insert z into its k-buckets. However, a new node z is only inserted into the k- buckets of node x if the number of elements in the set { y y R( x) y z < x z} (where R(x) is the union of all k-buckets of node x) is smaller than K. In other words, x only inserts z into its routing table if it doesn t know about K nodes nearer to z than itself. Each time the k-buckets are updated, the above criteria has to remain true for all nodes in the routing table of node x. If we find a node z in the k-buckets of node x which has more than K nodes nearer to it than x, we delete node z from the routing table. The basis of this load balancing technique is that since every k-bucket of a node x contains the nodes nearest to x, nodes known to x will also know each other with a very high probability, especially in the case of nodes nearer to a target node than x. The use of this load-balancing technique also means that RESERV can run with a very sparsely populated attribute space; the dynamically changing distribution of nodes in the attribute does not affect system performance, and the fact that attribute ranges and granularity is fixed in advance is not a serious limitation. Our first test was to evaluate how this average length depends on the network size, as shown on Figure 5: Figure 5: Number of nodes and route length In this test, we modified the number of connected nodes, with a constant k-bucket size (K=5). As expected, the route length scales sub-linearly with the number of nodes. Figure 6. shows the effect of node distribution on routing path length. We ran the test with N=500 and K=5. However, in this test, connecting nodes didn t choose attribute values uniformly, but with a variable parameter zipf distribution. 4. Evaluation RESERV was implemented in Java as part of a distributed job execution and data storage system. As our approach in creating a distributed resource information system was quite different from previous work, comparing RESERV to other range query algorithms would have made little sense. The goal of our evaluation was to examine the effect of k-bucket size on routing efficiency, scalability, and resilience to skewed node distribution in the address space. Simulations were run on one computer, with each node running as a separate thread. During each test, N nodes were connected sequentially, with each node choosing a random node for bootstrapping. Each node x ran one lookup for the address farthest away from itself in the address space (i.e., the node with address ( x 111...11) ). At the end of the test, the number of FIND_NODE RPCs was divided by N, giving us the average length of the longest possible lookup in the system. Figure 6: Zipf node distribution with parameter S and routing length The results might seem surprising at first: we get a shorter routing path length for a more uneven node distribution. The explanation is however simple: with a more uneven distribution, a larger proportion of nodes will have an address with the same prefix. To calculate average routing path length, we take into account a lookup performed by each node; thus, many nodes with similar address will compensate for bad results achieved by rare nodes. This does not mean that an unbalanced distribution is an advantage, especially if we look for rare nodes. Figures 7. and 8. show the number of links pointing to rare nodes without and with load balancing. In these tests, we were not interested in routing path length, but rather in the effects of load balancing.

Network size was 500 nodes, with K=5. To further increase the unbalance, every node joined through the same initial node. After the join process, we examined the number of links pointing to a given node. For these tests, we calculated the number of links pointing to the most popular node, the least popular node, the average number of links, and the standard deviation of the number of links. On the first figure, we didn t represent the most popular node, which was the initial node: all the 499 other nodes kept a link to it in their routing tables in every test. The most representative data is the high standard deviation. Figure 8: Number of connections with load balancing On Figure 8. we can see that at most about 60 links point to the most popular node instead of 499, and that with load balancing the standard deviation of the number of links is much lower. These tests show that our load balancing scheme is working as expected. 5. Conclusion The goal of RESERV was to create a distributed information system for grid applications that can handle uneven distribution of data which can arise in the case of resource attributes with discrete values. As our tests show, load balancing did not affect the O(logN) routing complexity typical to DHT systems. RESERV is thus a solution that supports load balancing even in cases of extremely skewed distributions, while preserving scalability and routing efficiency. After these initial encouraging results, we plan to deploy RESERV on PlanetLab to test in on a larger scale and in a real network environment. Figure 7: Number of connections without load balancing 6. References [1] I. Foster, C. Kesselman: The grid: blueprint for a new computing infrastructure, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998. [2] I. Foster, C. Kesselman: Globus: a Metacomputing Infrastructure Toolkit, Int. Journal of High Performance Computing Applications, vol. 11, no. 2, 115-128. [3] A. Iamnitchi, I. Foster, D. Nurmi. A peer-to-peer approach to resource discovery in grid environments, Proc. of the 11th Symposium on High Performance Distributed Computing, 2002. [4] I. Stoica, et al, "Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications," IEEE/ACM Transactions on Networking, Vol. 11, No. 1, pp. 17-32, February 2003. [5] P. Maymounkov, D. Mazires: Kademlia: A peer-to-peer information system based on the XOR metric, Proc. of 1st International Workshop on Peer-to-Peer Systems (IPTPS), Cambridge, Mar. 2002. [6] C. Schmidt, M. Parashar, "Enabling Flexible Queries with Guarantees in P2P Systems," IEEE Internet Computing, Vol. 8, No. 3, pp. 19-26, May/June 2004. [7] A. Andrzejak, Z. Xu, "Scalable Efficient Range Queries for Grid Information Services," Proc. IEEE P2P 2002, Linköping, Sweden, September 2002. [8] M. Cai, M. Frank, J. Chen, P. Szekely, MAAN: A Multi-Attribute Addressable Network for Grid Information Services, Journal of Grid Computing, Springer, 2004. [9] A. Gupta, D. Agrawal, A. El Abbadi, "Approximate Range Selection Queries in Peer-to-Peer Systems," Proc. of CIDR 03, Asilomar, California, USA, January 2003. [10] D. Oppenheimer, J. Albrecht, D. Patterson, A. Vahdat, "Distributed Resource Discovery on Planetlab with SWORD," Proc. of WORLDS 04, Santa Fe, New Mexico, USA, December 2004. [11] A. R. Bharambe, M. Agrawal, S. Seshan, "Mercury: Supporting Scalable Multi-attribute Range Queries," Proc. SIGCOMM 04, Portland, Oregon, USA, 2004. [12] A. Crainiceanu, et al, "PTree: A P2P Index for Resource Discovery Applications," Proc. of WWW 04, New York, USA, May 2004. [13] C. Zhang, A. Krishnamurthy, R. Y. Wang, "Brushwood: Distributed Trees in Peer-to-Peer Systems," Proc. of IPTPS 05, New York, USA, 2005.