A Fast Algorithm for Online Placement and Reorganization of Replicated Data

Transcription

1 A Fast Algorith for Online Placeent and Reorganization of Replicated Data R. J. Honicky Storage Systes Research Center University of California, Santa Cruz Ethan L. Miller Storage Systes Research Center University of California, Santa Cruz Abstract As storage systes scale to thousands of disks, data distribution and load balancing becoe increasingly iportant. We present an algorith for allocating data objects to disks as a syste as it grows fro a few disks to hundreds or thousands. A client using our algorith can locate a data object in icroseconds without consulting a central server or aintaining a full apping of objects or buckets to disks. Despite requiring little global configuration data, our algorith is probabilistically optial in both distributing data evenly and iniizing data oveent when new storage is added to the syste. Moreover, our algorith supports weighted allocation and variable levels of object replication, both of which are needed to perit systes to efficiently grow while accoodating new technology. 1 Introduction As the prevalence of large distributed systes and clusters of coodity achines has grown, significant research has been devoted toward designing scalable distributed storage systes. Scalability for such systes has typically been liited to allowing the construction of a very large syste in a single step, rather than the slow accretion over tie of coponents into a large syste. This bias is reflected in techniques for ensuring data distribution and reliability that assue the entire syste configuration is known when each object is first written to a disk. In odern storage systes, however, configuration changes over tie as new disks are added to supply needed capacity or bandwidth. The increasing popularity of network-attached storage devices (NASDs) [11], which allow the use of thousands of sart disks directly attached to the network, has coplicated storage syste design. In NASD-based systes, disks ay be added by connecting the to the network, but efficiently utilizing the additional storage ay be difficult. Such systes cannot rely on central servers because doing so would introduce scalability and reliability probles. It is also ipossible for each client to aintain detailed inforation about the entire syste because of the nuber of devices involved. Our research addresses this proble by providing an algorith for a client to ap any object to a disk using a sall aount of infrequently-updated inforation. Our algorith distributes objects to disks evenly, redistributing as few objects as possible when new disks are added to preserve this even distribution. Our algorith is very fast, and scales with the nuber of disk groups added to the syste. For exaple, a 1 disk syste in which disks were added ten at a tie would run in tie proportional to 1. In such a syste, a odern client would require about 1 µs to ap an object to a disk. Because there is no central directory, clients can do this coputation in parallel, allowing thousands of clients to access thousands of disks siultaneously. Our algorith also enables the construction of highly reliable systes. Objects ay have an arbitrary, adjustable degree of replication, allowing storage systes to replicate data sufficiently to reduce the risk of data loss. Replicas are distributed evenly to all of the disks in the syste, so the load fro a failed disk is distributed evenly to all other disks in the syste. As a result, there is little perforance loss when a large syste loses one or two disks. Even with all of these benefits, our algorith is siple. It requires fewer than 1 lines of C code, reducing the likelihood that a bug will cause an object to be apped to the wrong server. Each client need only keep a table of all of the servers in the syste, storing the network address and a few bytes of additional inforation for each server. In a syste with thousands of clients, a sall, siple distribution echanis is a big advantage. 2 Related Work Litwin, et al. describe a class of data structures and algoriths on those data structures which the authors dubbed Scalable Distributed Data Structures (SDDS) [2]. There are three ain properties which a data structure ust eet in order to be considered a SDDS. 1. A file expands to new servers gracefully, and only when servers already used are efficiently loaded. This paper appeared at the 17th International Parallel and Distributed Processing Syposiu (IPDPS 23), April 23, Nice, France.

2 2. There is no aster site that object address coputations ust go through, e. g., a centralized directory. 3. File access and aintenance priitives, e. g., search, insertion, split, etc., never require atoic updates to ultiple clients. While the second and third properties are clearly iportant for highly scalable data structures designed to place objects over hundreds or thousands of disks, the first property, as it stands, could be considered a liitation. In essence, a file that expands to new servers based on storage deands rather than on resource availability will present a very difficult adinistration proble. Often, an adinistrator wants to add disks to a storage cluster and iediately rebalance the objects in the cluster to take advantage of the new disks for increased parallelis. An adinistrator does not want to wait for the syste to decide to take advantage of the new resources based on algorithic characteristics and paraeters that they do not understand. This is a fundaental flaw in all of the LH* variants discussed below. Furtherore, Linear Hashing and LH* variants split buckets (disks in this case) in half, so that on average, half of the objects on a split disk will be oved to a new, epty, disk. Moving half of the objects fro one disk to another causes wide differences in the nuber of objects stored on different disks in the cluster, and results in suboptial disk utilization [2]. Splitting in LH* will also result in a hot spot of disk and network activity between the splitting node and the recipient. Our algorith, on the other hand, always oves a statistically optial nuber of objects fro every disk in the syste to each new disk, rather than fro one disk to one disk. LH* variants such as LH*M [19], LH*G [21], LH*S [18], LH*SA [17], and LH*RS [22] describe techniques for increasing availability of data or storage efficiency by using irroring, striping and checksus, Reed- Soloon codes and other standard techniques in conjunction with the basic LH* algorith. Our algorith can also easily take advantage of these standard techniques, although that is not the focus of this paper. The LH* variants do not provide a echanis for weighting different disks to take advantage of disks with heterogeneous capacity of throughput. This is a reasonable requireent for storage clusters which grow over tie; we always want to add the highest perforance or highest capacity disks to our cluster. Our algorith allows weighting of disks. Breitbart, et al. [2] discuss a distributed file organization which resolves the issues of disk utilization (load) in LH*. They do not, however, propose any solution for data replication. Kröll and Widayer [14] propose another SDDS that they call Distributed Rando Trees (DRTs). DRTs are optiized for ore coplex queries such as range queries and and closest atch, rather than the siple priary key lookup supported by our algorith and LH*. Additionally, DRTs support server weighting. Because they are SDDS s, however, they have the sae difficulties with datadriven reorganization (as opposed to adinistrator-driven reorganization) as do LH* variants. In addition, the authors present no algorith for data replication, although etadata replication is discussed extensively. Finally, although they provide no stateents regarding the average case perforance of their data structure, DRT has worst-case perforance which is linear in the the nuber of disks in the cluster. In another paper, the authors prove a lower bound of Ω( ) on the average case perforance of any tree based SDDS [15], where is the nuber of objects stored by the syste. Our algorith has perforance which is O(n logn) in the nuber of groups of disks added; if disks are added in large groups, as is often the case, then perforance will be nearly constant tie. Brinkann, et al. [3, 4] propose a ethod for pseudorando distribution of data to ultiple disks using partitioning of the unit range. This ethod accoodates growth of the collection of disks by repartitioning the range and relocating data to rebalance the load. However, this ethod does not allow for the placeent of replicas, an essential feature for odern scalable storage systes. Chau and Fu discuss and propose algoriths for declustered RAID whose perforance degrades gracefully with failures [5]. Our algorith exhibits siilarly graceful degradation of perforance: the pseudo-rando distribution of objects (declustering) eans that the load on the syste is distributed evenly when a disk fails. Peer-to-peer systes such as CFS [1], PAST [24], Gnutella [23], and FreeNet [7] assue that storage nodes are extreely unreliable. Consequently, data has a very high degree of replication. Furtherore, ost of these systes ake no attept to guarantee long ter persistence of stored objects. In soe cases, objects ay be garbage collected at any tie by users who no longer want to store particular objects on their node, and in others, objects which are seldo used are autoatically discarded. Because of the unreliability of individual nodes, these systes use replication for high availability, and are less concerned with aintaining balanced perforance across the entire syste. Other large scale persistent storage systes such as Farsite [1] and OceanStore [16] provide ore file syste-like seantics. Objects placed in the file syste are guaranteed (within soe probability of failure) to reain in the file syste until they are explicitly reoved (if reoval is supported). OceanStore guarantees reliability by a very high degree of replication. The inefficiencies which are introduced by the peer-to-peer and wide area storage systes address security, reliability in the face of highly unstable 2

3 nodes, and client obility (aong other things). These features introduce far too uch overhead for a tightly coupled ass object storage syste. Distributed file systes such as AFS [13] use a client server odel. These systes typically use replication at each storage node, such as RAID [6], as well as client caching to achieve reliability. Scaling is typically done by adding volues as deand for capacity grows. This strategy for scaling can result in very poor load balancing, and requires too uch aintenance for large disk arrays. In addition, it does not solve the proble of balancing object placeent. 3 Object Placeent Algorith We have developed an object placeent algorith that organizes data optially over a syste of disks or servers while allowing online reorganization in order to take advantage of newly available resources. The algorith allows replication to be deterined on a per-object basis, and perits weighting to distribute objects unevenly to best utilize different perforance characteristics for different servers in the syste. The algorith is copletely decentralized and has very inial storage overhead and inial coputational requireents. 3.1 Object-based Storage Systes NASD-based storage systes are built fro large nubers of relatively sall disks attached to a high bandwidth network, as shown in Figure 1. Often, NASD disks anage their own storage allocation, allowing clients to store objects rather than blocks on the disks. Objects can be any size and ay have any 64-bit nae, allowing the disk to store an object anywhere it can find space. If the object nae space is partitioned aong the clients, several clients can store different objects on a single disk without the need for distributed locking. In contrast, blocks ust be a fixed size and ust be stored at a particular location on disk, requiring the use of a distributed locking schee to control allocation. NASD devices that support an object interface are called object-based storage devices (OBSDs) 1 [25]. We assue that the storage syste on which our algorith runs is built fro OBSDs. Our discussion of the algorith assues that each object can be apped to a key x. While each object ust have a unique identifier in the syste, the key used for our algorith need not be unique for each object. Instead, objects are apped to a set that ay contain hundreds or thousands of objects, all of which share the key x while having different identifiers. Once the algorith has located the 1 OBSDs ay also be called object-based disks (OBDs). Client CPU OBSD Client Client Client Network OBSD OBSD OBSD Client Figure 1. A typical NASD-based storage syste set in which an object resides, that set ay be searched for the desired object; this search can be done locally on the OBSD and the object returned to the client. By restricting the agnitude of x to a relatively sall nuber, perhaps 1 6 or 1 7, we ake the object balancing described in Section 6.1 sipler to ipleent without losing the desirable balancing characteristics of the algorith. Most previous work has either assued that storage is static, or that storage is added for additional capacity. We believe that additional storage will be necessary as uch for additional perforance as for capacity, requiring that objects be redistributed to new disks. If objects are not rebalanced when storage is added, newly created objects will be ore likely to be stored on new disks. Since new objects are ore likely to be referenced, this will leave the existing disks underutilized. We assue that disks are added to the syste in clusters, with the jth cluster of disks containing j disks. If a syste contains N objects and n j = j 1 i= i disks, adding ore disks will require that we relocate N n j + objects to the new disks to preserve the balanced load. For all of our algoriths, we assue that existing clusters are nubered...c 1, and that we are adding cluster c. The cth cluster contains c disks, with n c disks already in the syste. 3.2 Basic Algorith We will call disks servers since this algorith ight be used to distribute data in an object database or other ore coplex service. Our algorith operates on the basic principle that in order to ove the (statistically) optial nuber of objects into a new cluster of servers, we can siply pick a pseudo-rando integer z x = f (x) based on each object s key x such that z x < n c + c. If z x < c, then the object in question oves to the new cluster. Our algorith is applied recursively; each tie we add a new cluster of servers, we 3

4 j = c while (object not apped) seed a rando nuber generator with the object s key x advance the rando nuber generator j steps. generate a rando nuber z < (n j + j ) if z j j j 1 else ap the object to server n j + (z od j ) Figure 2. Algorith for apping objects to servers without replication or weighting. add another step in the lookup process. To find a particular object, we work backward through the clusters, starting at the ost recently added, deciding whether the object would have been oved to that cluster. The basic algorith for deterining the placeent of soe object with key x, before aking considerations for object replication, and weighting is shown in Figure 2. We use a unifor rando nuber generator which allows jup-ahead : the next s nubers generated by the generator can be skipped, and the s + 1st nuber can be generated directly. The generator which we use can be advanced s steps in O(logs) tie, but we are currently exploring generators which can generate paraetric rando nubers in O(1) tie, as described in Section 5.1. Using a siple induction, we sketch a proof that the expected nuber of objects placed in the new cluster by this basic algorith is c n c + c N, and that objects will be randoly distributed uniforly over all of the servers after the reorganization. We also deonstrate that the algorith iniizes the expected nuber of objects which get oved in a reorganization where only a single cluster is added, and that the algorith is therefore optial in the nuber of objects oved during such a reorganization. In the base case, all objects should clearly go to the first cluster since n =, eaning that n + N = N. Furtherore, since z coes fro a unifor distribution and each object will be placed on server + (z od ) = z od, the probability of choosing a given server is 1. Thus each server has an equal probability of being chosen, so the objects will be distributed uniforly over all of the servers after placing the on the first cluster. For the induction step, assue that N objects are randoly distributed uniforly over n c servers divided into c 1 clusters, and we add cluster c containing c servers. We will optially place c n c + c N objects in cluster c. Since each rando nuber z < n c + c is equally likely, we have a probability of c n c + c of oving any given object to a server in cluster c. With N objects, the total c n c + c nuber of objects oved to a server in cluster c is N the optial value. Since the N objects in the syste are distributed uniforly over n c servers by our inductive hypothesis, a relocated object has an equal probability of coing fro any of n c servers. The expected nuber of objects oved fro c n c + c 1 any given server S (where S < n c ) is n c N. so the expected nuber) of objects reaining on any server S will be n 1 c (1 c n c + c N = n N c + c. Since the expected nuber of objects placed in cluster c is c n c + c N, the expected nuber of objects placed on a given server in cluster c is 1 c c n c + c N = N n c + c. Because the expected nuber of objects on any server in N the syste after reorganization is n c + c, the distribution of objects in the syste reains unifor. Since the decision regarding which objects to ove and where to ove the is ade using a pseudo-rando process, the distribution of objects in the syste also reains rando. Finally, we can see that the algorith oves an approxiately optial nuber of objects during the reorganization by noting two facts. First, an object apped to a given cluster will never ove to a different cluster unless it is apped to a newly added cluster objects ay ove to new clusters, but never to old ones. When we add a new cluster, all objects that ove ust therefore ove into the new cluster. Secondly, the expected nuber of objects in a new cluster is exactly the nuber of objects which will allow the distribution of objects over the clusters to reain unifor, so the algorith could not ove fewer objects into the new cluster and reain correct. We therefore ove approxiately the iniu nuber of objects for the algorith to reain correct. Therefore, the algorith oves the optial nuber of objects during a reorganization. 4 Cluster Weighting and Replication Siply distributing objects to unifor clusters is not sufficient for large-scale storage systes. In practice, large clusters of disks will require weighting to allow newer disks (which are likely to be faster and larger) to contain a higher proportion of objects than existing servers. Such clusters will also need replication to overcoe the frequent disk failures that will occur in systes with thousands of servers. 4.1 Cluster Weighting In ost systes, clusters of servers have different properties newer servers are faster and have ore capacity. We ust therefore add weighting to the algorith to allow soe server clusters to contain a higher proportion of objects than others. To accoplish this, we use a integer weight adjustent factor w j for every cluster j. This factor 4

5 will likely be a nuber which describes the power (such as capacity, throughput, or soe cobination of the two) of the server. For exaple, if clusters are weighted by the capacity of the drives, and each drive in the first cluster is 6 gigabytes, and each drive in the second cluster is 1 gigabytes, then w ight be initialized to 6, and w 1 ight be initialized to 1. We then use j = jw j in place of j and n j = j 1 i= i in place of n j in Figure 2. Once an object s cluster has been selected, it can be apped to a server by n j + v od j, as done in the basic algorith. The use of 64-bit integers and arithetic allows for very large systes; a 1, terabyte syste that weights by gigabytes will have a total weight of only 1 illion. If weights are naturally fractional (as for bandwidth, perhaps), they can all be scaled by a constant factor c w to ensure that all w j reain integers. 4.2 Replication The algorith becoes slightly ore coplicated when we add replication because we ust guarantee that no two replicas of an object are placed on the sae server, while still allowing the optial placeent and igration of objects to new server clusters. This version of the algorith, shown in Figure 3, relies on the fact that ultiplying soe nuber n < by a prie p which is larger than and taking the odulus (i. e.. (np) od ) defines a bijection between the ordered set S = {... 1} and soe perutation of S [9]. Furtherore, the nuber of unique bijections is equal to the nuber of eleents of S which are relatively prie to. In other words, ultiplying by a prie larger than perutes the eleents of S in one of φ() ways, where φ( ) is the Euler Phi function [9], as described in Section 4.3. Again, x is the key of the object being placed, c is the nuber of clusters, n j is the total nuber of servers in the first j 1 clusters, and j is the nuber of servers in cluster j, where j {...c 1}. Let R equal the axiu degree of replication for an object, and r {...R 1} be the replica nuber of the object in question. z and s are pseudo-rando values used by the algorith. The algorith also assues that R. That is, the nuber of servers in the first cluster is at least as large as the axiu degree of replication. This akes intuitive sense since if it were not true, there would not be a sufficient nuber of servers available to accoodate all of the replicas of an object when the syste is first brought online. In the case where j < R, our algorith (intuitively speaking) first pretends that the cluster is of size R. It then selects only those object replicas which would be allocated to the first j servers in our iaginary cluster or R servers. In this way, we can avoid apping ore than one replica to the sae server. When j < R, the nuber of objects j c while object is not apped seed a rando nuber generator with the object s key x advance the generator j steps j jw j n j j 1 i= i generate a rando nuber z < (n j + j ) choose a rando prie nuber p j v x + z + r p z (z + r p) od (n j + j ) if j R and z < j ap the object to server n j + (v od j ) else if j < R and z < R w j and v od R < j ap the object to server n j + (v od R). else j j 1 Figure 3. Algorith for apping objects to servers with replication and weighting. w which get apped into cluster j is j R n j j + R = j j n, so j + j the R factor cancels copletely. Let the total weight in the syste W be c i= w i i. The fraction of the total weight possessed by a server in cluster j is thus w i W. We ust therefore show that the expected nuber of object replicas owned by soe server j is w j W N R. We also ust show that no two replicas of the sae object get placed on the sae server. Again, we can prove these facts using induction. We oit the proof that the objects reain distributed over the other clusters according to their weights, since the arguent is essentially identical to that in the basic algorith described in Section 3.2. In the base case, n =, and z is odulus n + = (and hence z < ). Since we require that the first cluster have at least R servers, we will always ap the object to server n + (v od ) = v od which is in the first cluster, as described in Figure 3. v is a pseudorando nuber (because z is pseudo-rando), so an object has equal probability of being placed on any of the servers in cluster. Therefore, the expected nuber of objects placed on a given server when there is only one cluster is 1 N R = w w N R = w W N R, which is what we wanted to prove. Now, [x + z + r p] [x + z] + [r p]. We can therefore exaine the (x+z) od ter, and the (r p) od ter separately. Recall that x is the key of an object. Since x and z can be any value, both of which are (potentially) different for each object, but the sae for each replica of the object, x + z can 5

6 (x+z)od (r*p) od od server. Note that at ost c out of R replicas of a given object can be placed in cluster c, since the other R c replicas will be apped od R to values which are greater than or equal to c when c < R. Thus, no two replicas of the sae object get placed on the sae server. Furtherore, following the sae arguent as given in Section 3.2 (oitted here for the sake of brevity), the algorith oves (approxiately) the optial nuber of objects during a reorganization where a single cluster is added. 4.3 Choosing Prie Nubers Figure 4. The apping of the ordered set of integers {,..., j 1} to a perutation of that set using the function f (x) = (x +z+r p) od j be viewed as defining a rando offset within the servers in the first cluster fro which to start placing objects. p and are relatively prie, so by the Chinese Reainder Theore [9], for a given y, [r p] y has a unique solution odulo. In other words, p defines a bijection fro the ordered set {,..., 1} to soe perutation of that set. Thus we can think of (x + z + r p) od as denoting soe perutation of the set {,..., 1}, shifted by (x + z) od. 2 In other words, if we rotate the the last eleent to the first position x + z ties, then we have the set defined by f (x) = (x + z + r p) od. Since this is also a perutation of {,..., 1}, and since r <, each replica of an object aps to a unique server, as shown in Figure 4. For the induction step, assue that each cluster is weighted by soe per-server (unnoralized) weight w j where j < c, and that all of the object replicas in the syste are distributed randoly over all of the servers according to each server s respective weight (defined by the server s cluster). If we add a cluster c containing c servers, then w c c is the total weight allocated to cluster c. Since a given object replica is placed in cluster c with probability w c c W, the expected nuber of objects placed in cluster c is w c c W N R. As in the base case, the object replicas will be distributed over the servers in cluster c uniforly, so the expected nuber of object replicas allocated to a server in cluster c is w c W N R, which is what we wanted to show. Since p defines a bijection between the ordered set {,..., c 1} and soe perutation of that set, each replica that is placed in cluster c is placed on a unique 2 The nuber of unique perutations of {,..., 1} which can be obtained by ultiplying by a coprie of is equal to the Euler Phi Function φ( ), as described in Section 4.3. Our algorith uses a rando prie nuber, which ust be known by every server and client in the syste. It is sufficient to choose a rando prie fro a large pool of pries. This prie p will be relatively prie to any odulus < p, as will p od. Furtherore, choosing a rando prie and coputing p od is statistically equivalent to aking a unifor draw fro the set of integers in the set Z = { x < gcd(x,) = 1} which are relatively prie to. A proof of this is beyond the scope of this paper. The nuber of integers in the set Z (these relatively prie integers will be called copries for the reainder of this section) is described by the Euler Phi Function: p 1 φ() = p p where p eans the set of all p such that p is a factor of [9]. Since φ() <! when > 2, the nuber of bijections described by the set of copries to is saller in general than the nuber of possible perutations of a set of integers {,..., 1}. It is also beyond the scope of this paper to show the precise statistical ipact of this difference. The practical ipact of this difference, however, can be seen in Figure 6(c). 5 Perforance and Operating Characteristics 5.1 Theoretical Coplexity In this section we deonstrate that our algorith has tie coplexity of O(nr) where n is the nuber of server additions ade, and r is the tie in which it takes to generate an appropriate rando nuber. The algorith that we are currently using to generate rando nubers takes O(logn) tie. This can theoretically be reduced to O(1). As noted in Section 4.3, appropriate prie nubers can be chosen in O(1) tie, and the rest of the operations other 6

7 than those related to generating rando nubers are arithetic, so every operation besides those used for generating rando nubers runs in O(1) tie. The algorith for seeding and actually generating rando nubers is also constant tie [26]. The algorith for jupahead, or advancing the rando nuber generator a given nuber of steps without calculating interediate values, however takes O(logn) tie. Specifically, the algorith for jupahead requires odular exponentiation, which is known to run in O(logn) tie [9]. Since we ust jup ahead by the cluster group nuber each iteration, each iteration of the algorith takes, on average O(logn) tie. In the worst case, an object replica will be placed in the first server cluster, in which case the algorith ust exaine every cluster to deterine where the object belongs. The average case depends on the size and weighting of the different clusters, and thus is not a good etric for perforance. If the weight and clusters sizes are distributed evenly, then clearly we will need, on average n 2 iterations. However, we believe that newer clusters will tend to have exponentially higher weights, so that in the average case, we only need to calculate logn iterations. Rather than using jupahead to generate statistically rando hash values that are paraeterized by the server cluster nuber, we have exained another approach using paraetric rando nuber generators [8]. These rando nuber generators are popular for distributed rando nuber generation. By paraeterizing the generated sequence, the generators can assign a different paraeter to each processor in a cluster, while using the sae seed. This guarantees unique, deterinistic pseudo-rando nuber sequences for each processor. One siple ethod, based on Linear Congruence Generators [8], allows the paraeterization to occur in O(1) tie. LCGs, however, are notorious for generating nubers which all lie on a higher diensional hyperplane, and thus are strongly correlated for soe purposes. Unfortunately, this correlation results in very poor distribution of objects in our algorith, aking LCGs unusable for object distribution. We are currently exaining other ore sophisticated generators, but as a final note, our algorith does actually support O(n) operation, but this is ostly of theoretical interest. O(n) operation can be achieved as follows: On the first iteration, seed the generator and advance it n steps, as would norally be done. Next instead of re-seeding the generator and advancing it n 1 steps, retain the state of the generator (do not reseed it), and then advance it the period of the generator (in this case, the axiu value of an unsigned long integer) inus 1. Since the period of the generator is a known quantity which does not depend on n, this can be done in O(1) tie. Of course, advancing the generator by such a large quantity is very slow, so the classification as O(n) is of acadeic interest only. 5.2 Perforance In order to understand the real world perforance of our algorith, we tested the average tie per lookup under any different configurations. First, we ran a test in which 4, object replicas were placed into configurations starting with 1 servers in a single cluster to isolate the effect of server addition. We coputed the average tie for these 4 lookups, and then added clusters of servers, 1 servers at a tie, and tied the sae 4, lookups over the new server organization. In Figure 5(a), we can see that the line for lookups under this configuration grows faster than linear, but uch slower than nlogn. In Figure 5(b), there are two lines which grow approxiately logarithically. Since disk capacity has been growing exponentially [12], we also consider the perforance of the algorith when the weight of (and hence nuber of object assigned to) new clusters grows exponentially. The botto line illustrates a 5% growth in capacity between cluster additions, and the iddle line represents a 1% growth. The weighting of new servers can therefore significantly iprove the perforance of the algorith. This is consistent with the predictions ade in Section Failure Resilience When a server fails, clients ust read and write to other servers for each of the objects stored on the failed server. If all of the replicas for a particular server are all stored on the sae set of servers, e. g. if all of the replicas for objects on server 3 are stored on server 4 and server 5, then a server failure will cause the read load on the irror servers to increase by a factor of R 1 R, where R is the degree of object replication (eaning that the load on each of the irror severs nearly doubles). This value assues that the replicated clients are not using quorus for reads, in which case, all irrors participate in reads, so that there will be no increase in load. This is a false benefit however, since it is achieved by using resources inefficiently during noral operation; R 1 R can be a severe burden when R is 2 3, as likely will be used in large-scale systes. In order to iniize the load on servers during a failure, our algorith places replicas of objects pseudo-randoly, so that when a server fails, the load on the failed server is absorbed by every other server in the syste. Figure 6(a) shows a histogra of the distribution of objects which replicate objects on server 6. In this case the load is very unifor, as it is in Figure 6(a), where the weight of each server cluster increases. In Figure 6(c), we see several spikes, and several servers which have no replicas of 7

8 Coputation tie (us) Our algorith Linear Nlog(N) Nuber of clusters Coputation tie (us) Even distribution Exponential weighting, 1% Exponential weighting, 5% Nuber of clusters (a) Tie per lookup copared to linear and nlogn functions (b) Tie per lookup with no weighting and exponential weighting Figure 5. Tie for looking up an object versus the nuber of server clusters in the syste. All ties coputed on an Intel Pentiu III 45. Nuber of objects Server ID Nuber of objects Server ID (a) Server 6 fails in a syste with 4 evenly weighted clusters of 5 servers (b) Server 6 fails in a syste with 4 clusters of 5 servers, each cluster having increasing weight Nuber of objects Server ID Nuber of objects Server ID (c) Server 6 fails in a syste with 2 clusters of 5 servers, and 1 cluster of 12 servers. The failed server is in the the cluster of 12 servers. (d) Server 6 fails in a syste with 4 clusters of 5 servers, where object replicas are distributed to adjacent servers Figure 6. The distribution of the replicas of objects stored on a failed server, where the server fails under different syste configurations. A total of 3, objects are stored in the syste. objects on server 6. This occurs because the cluster with which server 6 was added is of size 12, which is a coposite nuber ( = 12 ). Depending on the degree of replication and the nuber of distinct prie factors of the size of the cluster, if the size of a cluster is coposite, soe epty spots ay occur in the cluster. Even in when the nuber is a coposite nuber, the objects are distributed relatively uniforly over ost of the servers. Clearly such a distribution is far superior to a siplistic sequential distribution as illustrated in Figure 6(d), in which a few servers in the syste (R 1 where R is the degree of replication, to be exact) will take on all of the load fro the failed server. Instead, our algorith distributes load fro failed servers nearly uniforly over all of the working servers in the syste. 8

9 6 Operational Issues Our algorith easily supports two desirable features for large-scale storage systes: online reconfiguration for load balancing, and variable degrees of replication for different objects. 6.1 Online Reconfiguration Our algorith easily allows load balancing to be done online while the syste is still servicing object requests. The basic echanis is to identify all of the sets that will ove fro an existing disk to a new one; this can be done by iterating over all possible values of x to identify those sets that will ove. Note that our balancing algorith will never ove any objects fro one existing disk to another existing disk; objects are only oved to new disks. This identification pass is very quick, particularly when copared to the tie required to actually copy objects fro one disk to another. During the process of adding disks, there are two basic reasons why the client ight not locate the object at the correct server. First, server clusters ay have been reconfigured, but the client ay not have updated its algorith configuration and server ap. In that case, the client can receive an updated configuration fro the server fro which it requested the object in question, and then re-run our algorith using the new configuration. Second, the client ay have the ost recent configuration, but the desired object has not yet been oved to the correct server. In that case, if the client thought that the object replica should be located in cluster j, but did not find it, it can siply continue searching as if cluster j had not been added yet. Once it finds the object, it can write the object in the correct location and delete it fro the old one. Different seantics for object locking and configuration locking will be necessary depending on other paraeters in the syste, such as the coit protocol used, but our algorith is equally suited for online or batch reorganization. 6.2 Adjustable Replication Our algorith allows the degree of replication of any or all of the objects to vary over tie with the following constraint when the syste is initially configured, the adinistrator ust set the axiu degree of replication. This value can be no ore than the size of the initial cluster (since we ust have a unique location in which to place all replicas). The client can then decide on a per object basis how any replicas to place. If it places fewer than the axiu nuber possible, the spots for the reaining replicas can be used if a higher degree of replication is desired at a later tie. Practically speaking, a client ight use perfile etadata to deterine the degree of replication of the different objects which copose a file in an OBSD. 7 Future Work Our algorith distributes data evenly and handles disk failures well, but there are further issues we are currently investigating. We are studying a ore efficient paraeterizable rando nuber generation or hashing function, which will ake the worst case perforance of the algorith O(n). In addition, we are studying a odification to the algorith which will allow for cluster reoval. In exchange for this capability, the algorith will need to look up all R replicas at once. This should not significantly affect perforance if locations are cached after they are calculated. We are also considering the exact protocols for the distribution of new cluster configuration inforation. These protocols will not require any global locks on clients, and in soe cases where optiistic locking seantics are acceptable, will not require any locks at all. We are considering different read/write seantics for different types of storage systes, and are integrating this algorith into a assively scalable cluster file syste. Finally, we are considering a fast-recovery technique that autoatically creates an extra replica of any object affected by a failure in order to significantly increase the ean tie to failure for a given degree of replication [27]. 8 Conclusions The algorith described in this paper exhibits excellent perforance and distributes data in a highly reliable way. It also provides for optial utilization of storage with increasing storage capacity, and achieves balanced distribution by oving as little data as possible. The use of weighting allows systes to be built fro heterogeneous clusters of servers. In addition, by using replica identifiers to indicate the location of different stripes of an object, we can also use our algorith to place stripes for Reed-Soloon coding or other siilar striping and data protection schees. Using these techniques, it will be possible to build ulti-petabyte storage systes that can grow in capacity and overall perforance over tie while balancing load over both old and new coponents. Acknowledgents Ethan Miller was supported in part by Lawrence Liverore National Laboratory, Los Alaos National Laboratory, and Sandia National Laboratory under contract B

10 References [1] A. Adya, W. J. Bolosky, M. Castro, R. Chaiken, G. Cerak, J. R. Douceur, J. Howell, J. R. Lorch, M. Theier, and R. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incopletely trusted environent. In Proceedings of the 5th Syposiu on Operating Systes Design and Ipleentation (OSDI), Boston, MA, Dec. 22. USENIX. [2] Y. Breitbart, R. Vingralek, and G. Weiku. Load control in scalable distributed file structures. Distributed and Parallel Databases, 4(4): , [3] A. Brinkann, K. Salzwedel, and C. Scheideler. Efficient, distributed data placeent strategies for storage area networks. In Proceedings of the 12th ACM Syposiu on Parallel Algoriths and Architectures (SPAA), pages ACM Press, 2. Extended Abstract. [4] A. Brinkann, K. Salzwedel, and C. Scheideler. Copact, adaptive placeent schees for non-unifor capacities. In Proceedings of the 14th ACM Syposiu on Parallel Algoriths and Architectures (SPAA), pages 53 62, Winnipeg, Manitoba, Canada, Aug. 22. [5] S.-C. Chau and A. W.-C. Fu. A gracefully degradable declustered RAID architecture. Cluster Coputing Journal, 5(1):97 15, 22. [6] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. RAID: High-perforance, reliable secondary storage. ACM Coputing Surveys, 26(2), June [7] I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong. Freenet: A distributed anonyous inforation storage and retrieval syste. Lecture Notes in Coputer Science, 29:46+, 21. [8] P. D. Coddington. Rando nuber generators for parallel coputers. NHSE Review, 1(2), [9] T. H. Coren, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algoriths, Second Edition. MIT Press, Cabridge, Massachusetts, 21. [1] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica. Wide-area cooperative storage with CFS. In Proceedings of the 18th ACM Syposiu on Operating Systes Principles (SOSP 1), pages , Banff, Canada, Oct. 21. ACM. [11] G. A. Gibson, D. F. Nagle, K. Airi, J. Butler, F. W. Chang, H. Gobioff, C. Hardin, E. Riedel, D. Rochberg, and J. Zelenka. A cost-effective, high-bandwidth storage architecture. In Proceedings of the 8th International Conference on Architectural Support for Prograing Languages and Operating Systes (ASPLOS), pages 92 13, San Jose, CA, Oct [12] J. L. Hennessy and D. A. Patterson. Coputer Architecture A Quantitative Approach. Morgan Kaufann Publishers, 3rd edition, 23. [13] J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols, M. Satyanarayanan, R. N. Sidebotha, and M. J. Wes. Scale and perforance in a distributed file syste. ACM Transactions on Coputer Systes, 6(1):51 81, Feb [14] B. Kröll and P. Widayer. Distributing a search tree aong a growing nuber of processors. In Proceedings of the 1994 ACM SIGMOD International Conference on Manageent of Data, pages ACM Press, [15] B. Kröll and P. Widayer. Balanced distributed search trees do not exist. In Proceedings of the 4th International Workshop on Algoriths and Data Structures, pages Springer, Aug [16] J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Guadi, S. Rhea, H. Weatherspoon, W. Weier, C. Wells, and B. Zhao. OceanStore: An architecture for global-scale persistent storage. In Proceedings of the 9th International Conference on Architectural Support for Prograing Languages and Operating Systes (ASPLOS), Cabridge, MA, Nov. 2. ACM. [17] W. Litwin, J. Menon, and T. Risch. LH* schees with scalable availability. Technical Report RJ 1121 (91937), IBM Research, Aladen Center, May [18] W. Litwin, M. Neiat, G. Levy, S. Ndiaye, T. Seck, and T. Schwarz. LH* S : a high-availability and high-security scalable distributed data structure. In Proceedings of the 7th International Workshop on Research Issues in Data Engineering, 1997, pages , Biringha, UK, Apr IEEE. [19] W. Litwin and M.-A. Neiat. High-availability LH* schees with irroring. In Proceedings of the Conference on Cooperative Inforation Systes, pages , [2] W. Litwin, M.-A. Neiat, and D. A. Schneider. LH* a scalable, distributed data structure. ACM Transactions on Database Systes, 21(4):48 525, [21] W. Litwin and T. Risch. LH*g: a high-availability scalable distributed data structure by record grouping. IEEE Transactions on Knowledge and Data Engineering, 14(4): , 22. [22] W. Litwin and T. Schwarz. LH* RS : A high-availability scalable distributed data structure using Reed Soloon codes. In Proceedings of the 2 ACM SIGMOD International Conference on Manageent of Data, pages , Dallas, TX, May 2. ACM. [23] M. Ripeanu, A. Ianitchi, and I. Foster. Mapping the Gnutella network. IEEE Internet Coputing, 6(1):5 57, Aug. 22. [24] A. Rowstron and P. Druschel. Storage anageent and caching in PAST, a large-scale, persistent peer-to-peer storage utility. In Proceedings of the 18th ACM Syposiu on Operating Systes Principles (SOSP 1), pages , Banff, Canada, Oct. 21. ACM. [25] R. O. Weber. Inforation technology SCSI object-based storage device coands (OSD). Technical Council Proposal Docuent T1/1355-D, Technical Coittee T1, Aug. 22. [26] B. A. Wichann and I. D. Hill. Algorith AS 183: An efficient and portable pseudo-rando nuber generator. Applied Statistics, 31(2):188 19, [27] Q. Xin, E. L. Miller, D. D. E. Long, S. A. Brandt, T. Schwarz, and W. Litwin. Reliability echaniss for very large storage systes. In Proceedings of the 2th IEEE / 11th NASA Goddard Conference on Mass Storage Systes and Technologies. IEEE, Apr. 23. To appear. 1