Take me to your leader! Online Optimization of Distributed Storage Configurations

Transcription

1 Take me to your eader! Onine Optimization of Distributed Storage Configurations Artyom Sharov Aexander Shraer Arif Merchant Murray Stokey {shraex, aamerchant, Googe, Inc. ABSTRACT The configuration of a distributed storage system typicay incudes, among other parameters, the set of servers and their roes in the repication protoco. Athough mechanisms for changing the configuration at runtime exist, it is usuay eft to system administrators to manuay determine the best configuration and periodicay reconfigure the system, often by tria and error. This paper describes a new workoad-driven optimization framework that dynamicay determines the optima configuration at runtime. We focus on optimizing eader and quorum based repication schemes and divide the framework into three optimization tiers, dynamicay optimizing different configuration aspects: 1) eader pacement, 2) roes of different servers in the repication protoco, and 3) repica ocations. We showcase our optimization framework by appying it to a arge-scae distributed storage system used internay in Googe and demonstrate that most cient appications significanty benefit from using our framework, reducing average operation atency by up to 94%. 1. INTRODUCTION Storage is changing from being mosty in-house and oca to become a fuy gobay-distributed service. Coud storage services such as Amazon S3, Microsoft Azure, and Googe Coud Storage, form the underpinnings of many Internet services with cients distributed a over the word. Typicay, appications need continuous avaiabiity and reasonaby good data access atency, which transates into storing mutipe data repicas in different geographic ocations. Distributed storage systems usuay provide consistency across repicas of data (or metadata) using seriaization or confict resoution protocos. Distributed atomic commit or consensus-based protocos are often used when strong consistency is required, whie simper protocos suffice when Computer Science Department, Technion, Israe. Work done as part of a summer internship in Googe s Distributed Storage Anaytics team. This work is icensed under the Creative Commons Attribution- NonCommercia-NoDerivs 3.0 Unported License. To view a copy of this icense, visit Obtain permission prior to any use beyond those covered by the icense. Contact copyright hoder by emaiing [email protected]. Artices from this voume were invited to present their resuts at the 41st Internationa Conference on Very Large Data Bases, August 31st - September 4th 2015, Kohaa Coast, Hawaii. Proceedings of the VLDB Endowment, Vo. 8, No. 12 Copyright 2015 VLDB Endowment /15/08. repicas need ony preserve weak or eventua consistency. Such protocos typicay define mutipe possibe roes for the repicas, such as a eader or master repica coordinating updates, and appoint some of the repicas to participate in the commit protoco. The performance of such a system depends on the configuration: how many repicas, where they are ocated, and which roes they serve; for optima performance, the configuration must be tuned to the workoads. Different appications using the same storage service may have competey different workoads, for exampe a ogging system may use the storage mosty for writes and have reativey few cients, whie an appication responsibe for access contro may be read heavy. The workoads can be extremey variabe, both in the short term for exampe, oad may come from different parts of the word at different times of the day for a socia appication and in the ong term the administrators of the service may reconfigure their servers periodicay, causing different oad patterns. Long term workoad variation coud aso be due to organic changes in the demands on the Internet service itsef; for exampe, if a shopping service becomes more popuar in a region, its demands on the underying storage system may shift. The coud storage service must adapt to these changes seamessy. In fact, easticity is an integra part of coud computing and one of its most attractive premises. Reconfiguring a distributed storage system at run-time whie preserving its consistency and avaiabiity is a very chaenging probem and misconfigurations have been cited as a primary cause for production faiures [29]. Due to its practica significance the probem has received abundant attention in recent years both in academia and in the industry (see Section 7), focusing mainy on the design and impementation of efficient and non-disruptive mechanisms for reconfiguration. Yet, itte insight exists on how to set poicies and use reconfiguration mechanisms in order to improve system performance. For exampe, dynamic reconfiguration APIs have recenty been added to the Apache ZooKeeper distributed coordination system [25, 19] and users have since been asking for automatic workoad-based configuration management, e.g. [6]. At Googe, site reiabiity engineers (SREs) are masters in the back art of determining depoyment poicies and tuning system parameters. However, hand-tuning a coud storage system that supports hundreds of distinct workoads is difficut and prone to misconfigurations. We describe the design and impementation of a workoaddriven framework for automaticay and dynamicay optimizing the repication poicy of distributed storage systems. We showcase our framework by optimizing a arge-scae dis- 1490

2 tributed storage system used internay in Googe. In this work we focus on operation atency as the main optimization objective. We assume that the various aspects of oad distribution and baancing are addressed by the underying storage system, as is the case for our storage system and for many other massivey scaabe systems. Section 2 contains an overview of our storage mode. In short, users define databases which are then partitioned and repicated [1, 10, 14, 15, 16, 17, 26]. The repication poicy is defined by the database administrator, usuay an SRE responsibe for a specific cient appication. For exampe, if most writers are expected to reside in western Europe, the administrator wi ikey pace repicas in European ocations, and make some of them voters so that a quorum required to commit state changes can be coected in Europe without using expensive cross-continenta network inks. One of the voters (eected dynamicay) acts as a eader and coordinates the repication. Since eaders are invoved in many of the operations, their ocation significanty affects atency. Section 3 describes the first tier of our optimization framework, which, given repica ocations and roes, optimizes eader pacement. Our agorithm uses a detaied characterization of the reevant operations supported by the storage system. Unike many of the previousy proposed methods, such as pacing the eader cose to the writers (previousy suggested, for exampe, in the context of Googe Megastore or Yahoo! PNUTS), which may work for one workoad but not for another, we everage Googe s monitoring infrastructure to dynamicay and accuratey track the goba workoad of each database, as we as network atencies. Our evauation, using both simuations with production workoads and production experiments, presented in Section 6, shows that our method is fast, accurate and significanty outperforms previousy proposed common sense heuristics. We show that it resuts in a substantia speed-up for the vast majority of databases in our system, having the operation atency for 17% of the databases and reducing the atency by up to 94% for others. The goa of the second optimization tier, presented in Section 4 is, given repica ocations, to determine the best repica roes in addition to the optima eader pacement. Whie the number of voter repicas is usuay determined based on avaiabiity requirements, their ocations are fexibe. For exampe, if 2 simutaneous repica faiures must be toerated, 5 voting repicas wi be used and our system determines which 5 repicas are to be voters based on the observed workoad. Usuay in such systems the eader is one of the voters and thus optimizing voter pacement not ony affects the atency of commits but aso of other operations invoving the eader. Our evauation in Section 6 confirms the benefits of dynamic roe assignment for both write and read heavy workoads, achieving a speed-up of up to 50% on top of the first optimization tier, for some databases. Finay, in Section 5 we present two new agorithms used to determine repica ocations (in addition to choosing repica roes and a eader), which constitutes our third optimization tier. For exampe, if a database suddeny experiences a surge in cient operations coming from Asia, we wi detect the change in workoad and identify the best repica ocations to minimize atency. The agorithm additionay determines which among the repicas shoud be voters and which voter shoud be eader. Section 6 shows that the two agorithms are near optima and identifies workoads where each of the agorithms is preferabe. We aso show how these agorithms can be used to determine the desired number of repicas. The optimization agorithms are dynamic the best configuration for a given time interva is determined considering the workoad in previous time intervas, weighted according to recency. The agorithms can thus react to workoad changes without being overy sensitive to spikes. In summary, the main contribution of this paper is the design, impementation and evauation of an optimization framework for dynamicay optimizing the repication poicy of eader-based storage systems. Our system frees administrators from manuay and periodicay trying to adjust database repication based on the currenty observed workoads, which is a very difficut task since such systems usuay support hundreds of distinct workoads and mutipe operation types each invoving a series of message exchanges. Our system uses monitoring information to perform the optimization automaticay, aowing a detaied and separate consideration of each workoad. Our evauation demonstrates dramatic atency improvements for a production distributed storage system used in Googe. 2. STORAGE MODEL We assume a common distributed storage mode combining partitioning and repication to achieve scaabiity and faut toerance: users (administrators) define databases, each database is sharded into mutipe partitions, and each partition is repicated separatey [1, 10, 14, 15, 16, 17, 26]. We ca these partitions repication groups. A singe repication poicy, defined by the database administrator, governs a repication groups in a database and determines the configuration of each group the number of repicas, their ocations and their roes in the repication protoco 1. A repica can be either read-write or read-ony. Both are fu repicas and can serve reads ocay. Whereas read-write repicas vote on commits, read-ony repicas are notified of state changes ony after they occur. Such categorization exists in many systems, such as acceptors and earners in Paxos [21] or participants and observers in ZooKeeper [19]. One of the voting repicas in every repication group is chosen as the group eader. The eader is chosen dynamicay and when it fais, a different eader is eected. Leaders are typicay responsibe for coordinating repication in their group, for ocking and concurrency contro to support transactions, and participate in transaction invoving mutipe groups. We denote the voting repicas of a group g by V(g). Reca that a groups in a database share the same configuration, and hence are repicated across the same ocations (i.e., custers or datacenters). Since we are soey interested in the ocation of the repicas, we simpy use V() to refer to V(g) for any group g in a database. Simiary, R() refers to the set of a repicas (read-write and read-ony). We omit the index and g when it is cear from the context. A cient is denoted by c, the repica cosest to the cient (in terms of network atency) by nearest(c, R) (note that it is taken from the set R), and the eader of a group g by eader(g). Finay, we denote the set of cient ocations by C and the universe of potentia repica ocations by S (note that R S). 1 The storage system we used for evauation supports mutipe repication poicies per database, however few databases make use of this feature and a such databases perform manua configuration tuning. 1491

3 Operations are invoked by cients and mutipe operations may be executed together as part of transactions. Transactions may invove one or more groups within a database. Next, we identify five representative operations, most commony supported (under various names) in repicated distributed storage systems. Our framework optimizes operation atency. As such it requires knowedge of the fow of each operation. Hence, for each operation, we give a possibe message fow which we use to showcase our optimization framework in the foowing sections. This fow is simpified and actua impementations may use various optimizations, which may in turn require adjustments to the optimization formuation. However, such optimizations are orthogona to the contributions of this paper. weak read. A read from one of the repicas (typicay the repica cosest to the cient). As repicas may ag behind the eader, the read can return stae vaues. Typicay, a cient c sends a request to nearest(c, R), which repies with its oca copy of the requested data. bounded read. This read typicay incudes a bound on the staeness of returned vaues. If nearest(c, R) is not sufficienty up to date, it forwards the request to the eader eader(g), which responds with the atest version of the data. Once updated, nearest(c, R) responds to the cient. strong read. This read returns the ast committed vaue, typicay returned by the eader. In some systems, cients read directy from the eader; in others, the read is reayed through nearest(c, R). For exampe, Yahoo! PNUTS [15] supports a three types of reads. Apache ZooKeeper, as we as Amazon SimpeDB [2], DynamoDB [1], MongoDB [4], and many others, support ony the weak (sometimes caed eventuay consistent ) and the strong (sometimes caed consistent ) read types. Often, instead of providing an expicit API for bounded reads, systems such as ZooKeeper make guarantees about the maximum aowed staeness of repicas. The choice among the reads may be expicity made by the user or automaticay done by the storage system based on higher eve user preferences (such as in MongoDB or Riak [3]). Next, we define two state update operaitons: singe-group transaction. This is a state-changing operation or transaction that invoves the data of a singe repication group g. It is executed by eader(g) on behaf of a cient c, and, for high avaiabiity, requires eader(g) to persist the state changes on a quorum (usuay a majority) of voting repicas. To this end, eader(g) sends messages to the voting repicas V(g) and waits for a quorum to respond. If the minima required set of affirmative responses is coected, the transaction commits and a commit message is sent to the cient as we as to the group repicas. muti-group transaction. When committing a transaction invoving data from mutipe groups, a distributed atomic commit protoco must be executed across the groups in order to either commit or abort the transaction atomicay in a invoved groups. Typicay, this protoco is twophase commit and the eader of one of the groups invoved in the transaction acts as the coordinator whie the others are participants. In order to be faut toerant, every step of the protoco must be agreed-upon by the members of each group, and not ony by its eader. For the the purposes of this paper, we assume the foowing simpe protoco: the coordinator eader eader(g) broadcasts a prepare message to participant eaders. Each participant eader eader(g ) checks ocay whether the transaction may be committed and if so persists its intention to commit to a quorum of voters V(g ). It then sends an ack message to eader(g). Once a participants have acked, eader(g) commits the transaction by persisting it to a quorum of V(g) and responds to the cient. It then sends commit messages to the participant eaders which in turn commit the operation in their respective groups by persisting the state-changes to a quorum. For exampe, VotDB [26], HyperDex [17], DynamoDB and Microsoft Oreans [14] support both transactions types, whereas PNUTS [15], ZooKeeper [19] and many NoSQL stores ony support variants of singe-group transactions. 3. TIER 1: LEADER PLACEMENT In this work, we focus on optimizing operation atency. From the description in Section 2, it is easy to see that operation atency is affected by the ocation of the cient, ocation of the repica cosest to the cient, ocations of the readwrite repicas, and finay, by the choice of the eader among the read-write repicas. Our first agorithm, described in this section, optimizes eader pacement without modifying server roes and ocations (V and R) given by the database andministrator. Consider a database with one thousand groups and five read-write repicas per group, each of which can become eader. It may seem that in order to optimize eader pacement for such database we have to consider 5000 different pacement configurations. Unfortunatey, this is not the case. Muti-group transactions invove severa groups, whose eaders execute a distributed atomic commit protoco. Changing the pacement of one group eader may therefore impact the optima pacement of the eaders of other groups. In fact, for our numerica exampe, in the worst case the number of different pacement options is In order to achieve a practica soution we must reduce the soution space drasticay. We chose to optimize eader pacement on the granuarity of a database instead of a singe group. Since a groups in a database are repicated in the same way, this method reduces the soution space to one of the read-write repica ocations for this database. In practice, whie our optimization agorithm outputs a singe ocation, the storage system may pace the different eaders of groups beonging to the database cose to this ocation, taking various constraints reated to oad baancing, faiure diversity, etc., into account. We denote the eader ocation produced by our agorithm for database in the i-th time interva by λ (i). For simpicity, we focus on optimizing the average operation atency; this metric is generaized in Section 3.1. Intuitivey, to minimize average operation atency we compute it for every possibe eader ocation (any read-write repica can potentiay become eader), and then choose the ocation yieding minimum average atency. Assuming that different cients in the same custer experience simiar atencies when communicating with servers, we ogicay group cients within each custer and consider cient custers rather than individua cients in our anaysis. Since different operation types may have different atency profies, we compute a weighted average. Specificay, at interva i we (a) determine the average atency t (i) α,c() of each type of operation α from every cient custer c, for each potentia eader ocation ranging over the set V() of read- 1492

4 write repica ocations as defined for database, and (b) quantify the number of operations n α,c (i) of each type α for each cient custer c. Finay, we choose a eader λ (i) that minimizes the foowing expression (for simpicity we omit the denominator of the weighted average): λ (i) = arg min V() {score(i) ()}, where score (i) () = α,c t(i) α,c() n (i) α,c. The ocation λ (i) can then serve as prediction for the optima ocation λ (i+1) in the (i+1)-st time interva. Next, we cacuate t (i) α,c() by considering the fow of different operation types described in Section 2. We assume that eader identities are exposed to the cients, and hence cients send strong reads and transactions directy to the reevant eaders. Denote the average roundtrip-time atency between nodes a and b in time interva i by rtt (i) a,b. For weak reads, the average atency is simpy rtt (i) c,nearest(c,r), and simiary for strong reads, given a candidate eader ocation, we get. For bounded reads, using the inearity of expectation, we have that the average atency is rtt (i) rtt (i) c, rtt (i) nearest(c,r), c,nearest(c,r) +, which corresponds to a roundtrip between the cient and the nearest repica and another roundtrip between that repica and the eader. For simpicity, we do not distinguish here between bounded reads served ocay versus those forwarded to the eader, but do impement the distinction. Cacuating singe-group transaction atency is sighty more compex, as it requires the eader to wait for a majority of voter responses, i.e., for the median fastest response. Unfortunatey, the median and average operations don t commute in genera. We found, however, that crosscuster atency distributions tend to be very narrow around their respective mean vaues and overap ony for tai atencies. They can therefore be approximated as fixed vaues for computing the average of the median. The median of average rtt atencies is therefore a very good approximation for the average of median rtt atencies. Hence, the average operation atency of singe-group transactions can be estimated as rtt (i) c, + q(i), where q(i) = median v V() {rtt (i),v }. The computation for muti-group transactions is made compex by the fact that every muti-group transaction invoves a different set of groups. Fortunatey, our decision to optimize per database (rather than per group) consideraby simpifies the cacuation. First, since we are ooking for the best singe ocation λ (i) for a group eaders in the database, we ony need to consider assignments which resut in the same pacement for a group eaders and the atency between these coocated eaders is effectivey negigibe. Second, since a groups in a database usuay share the same configuration, the set of atency distributions between the eaders and their respective voters is the same for a group eaders in the database. To everage this, et us recap the fow of muti-group transactions: The cient sends a message to one of the group eaders and waits for a response. The average roundtrip atency between the cient and the eader is simpy rtt (i) c,. The eader (coordinator) then contacts other eaders (participants). Since a eaders are in the same ocation this atency is negigibe. Each participant eader then sends a message to its voters and waits for a majority of responses, which takes q (i). Then, participant eaders contact the coordinator eader. Finay, the coordinator eader commits the transaction by sending it to the voters of its group and waits for a majority of responses, which again takes q (i), on average. Overa, the average atency of a muti-group commit can be estimated as rtt (i) c, + 2 q(i). To summarize, the score of a candidate eader ocation V, is given by Equation 1. Both n (i) and average rtt (i) atencies are measured for the ast observed (i-th) time interva and, as presented thus far, used to predict λ (i+1). Observe that there is a tradeoff between the chosen time interva ength and the accuracy of prediction. By choosing a short interva (e.g., one minute) our soution becomes very sensitive to workoad spikes, deteriorating prediction accuracy. By using ong intervas (e.g., one day) the prediction may be more accurate yet it may average out potentiay interesting workoad changes (such as diurna patterns). Instead of attempting to pick the best time interva ength (which may vary across different databases), in Equation 2 we introduce a decay parameter τ and compute score based on mutipe past intervas (and not just the i-th interva), weighting them according to recency using an exponentia moving average (for simpicity we again omit the denominator of the weighted average). score (i) (, R, q (i) ) = [(n (i) weak read,c rtt(i) c,nearest(c,r) c C + n (i) bounded read,c (rtt(i) c,nearest(c,r) + rtt(i) nearest(c,r), ) + n (i) strong read,c rtt(i) c, + n(i) singe-group transaction,c (rtt(i) c, + q(i) ) + n (i) muti-group transaction,c (rtt(i) c, + 2 q(i) )] (1) agg score (i) (, R, q (i) ) = = 1 τ agg score(i 1) (, R, q (i 1) ) + score (i) (, R, q (i) ) λ (i) = arg min {agg V score(i) (, R, q (i) )} Interva i = 1 is the first interva considered for our anaysis and agg score (0) (.,.,.) = 0. Note that Equation 1 can be generaized to account for mutipe repication poicies (configurations) per database. However a straightforward extension is exponentia in the number of configurations, as it has to account for muti-group transactions invoving every possibe subset of configurations, and every possibe eader pacement in each configuration. The design of a more efficient method is an interesting topic for future research. 3.1 Optimizing Tai Latency For some users, optimizing tai atency is more important than optimizing for the mean. Athough currenty our system optimizes for average atency, in the future, we pan to extend it to aow database owners to specify the desired percentie and optimize for that percentie when determining the best configuration for the database. When considering tai atency, we can no onger use the nice inearity properties we everaged so far. Beow we show how to extend the score cacuation in Equation 1. As input, instead of the average roundtrip-time atencies, we now need to know the roundtrip-time atency distribution H a,b between each pair of ocations a and b. We assume that these distributions are independent. For simpicity, assume that atencies are discretized as mutipes of 1ms. (2) 1493

5 When computing the atency of each operation type, instead of summing up averages we shoud compute the distribution of the sum of random variabes. As an exampe, consider the simpe case of a bounded read, which traves from a cient c to the cosest repica nearest(c, R), then from nearest(c, R) to the eader and back a the way to the cient. In order to find the atency distribution of this operation we perform a discrete convoution H c,nearest(c,r) H nearest(c,r), as foows: P r(t (i) bounded read,c () = x) = P r(rtt c,nearest(c,r) + rtt nearest(c,r), = x) = x P r(rtt c,nearest(c,r) = k, rtt nearest(c,r), = x k) = k=m x P r(rtt c,nearest(c,r) = k) P r(rtt nearest(c,r), = x k), k=m where m denotes the minimum possibe vaue of t (i) bounded read,c () and rtt is the random variabe corresponding to the atency (rather than the average atency). Once the distribution of the sum has been computed, the required percentie can be taken from this distribution. Computing the distribution of the quorum atency is more compex. The simpest numeric method is to perform a Monte Caro simuation, repeatedy samping the distributions H,v for v V and computing the median atency each time. For an anaytica soution, observe that the eader needs to coect majority 1 responses from other servers, where majority V +1 and assume that the eader s own 2 response arrives faster than any other response. The CDF of the maximum response time from any set of read-write repicas is simpy the product of the CDFs of response time for the individua repicas. For exampe, for 3 read-write repicas, v and w where is the candidate eader: P r(max(rtt,v, rtt,w ) x) = P r(rtt,v x, rtt,w x) = P r(rtt,v x) P r(rtt,w x) We can therefore construct the CDF of maximum response time for every subset of the read-write repicas. From these, using the incusion-excusion principe [5], we can compute the probabiity of the event that at east one subset of the read-write repicas, of cardinaity majority 1, has maximum response atency ess than x, for each x. But this event is equivaent to the event that the quorum s response time is ess than x, hence it gives us the CDF of the quorum response time. Continuing our exampe for 3 read-write repicas, we get: P r(q (i) < x) = P r(rtt,v x) + P r(rtt,w x) P r(max(rtt,v, rtt,w ) x) 4. TIER 2: LEADER AND REPLICA ROLES Our tier-1 optimization agorithm, described in the previous section, optimizes eader pacement whie keeping V and R fixed. In this section, we introduce our tier-2 optimization agorithm that determines the best voter ocations V from R as we as the best eader ocation from V. This agorithm does not modify R (this is the topic of Section 5). As we expain next, for reasons of efficiency we do not directy use the method described in Section 3, but rather the optimization objective in Equation 2. In order to evauate and compare different configurations we must take into account the best eader ocation possibe with the configuration. Given a configuration with R repica ocations, a brute-force approach (shown in Agorithm 1) is to enumerate ( R V ) configurations, corresponding to different possibe subsets of num voters = V read-write repicas, and for each one find the optima eader using the agorithm given in Section 3. Reca that our tier-1 agorithm evauates every read-write repica as a potentia eader by considering the resuting cost for every cient custer. For a typica database in the production system considered in Section 6, this agorithm woud resut in more than 26 miion computations per database and time interva. Agorithm 1 Brute-force agorithm for tier-2. 1: procedure tier-2-brute-force(r, num voters) 2: for each set V Rnum voters 3: for each repica V 4: q (i) = median v V {rtt (i),v } 5: score agg score (i) (, R, q (i) ) (Equation 2) 6: λ V arg min V {score } 7: score V agg score (i) (λ V, R, q (i) λ ) V 8: V opt = arg min V {score V } 9: return (λ Vopt, V opt) // Optima eader and quorum We propose a much more efficient aternative (Agorithm 2): instead of picking read-write repicas first and then the eader among them, we reverse the decision order and eiminate configurations that are ceary sub-optima due to their poor eader score. More precisey, for every candidate eader, out of a the repica ocations R (not just read-write), we find the k-th smaest rtt to other repicas, where k = num voters+1, and use it to cacuate score (i). We then 2 choose the eader λ (i) for which score(i) is minimized. Finay, we compute the set of voters for eader λ (i) by picking the num voters repicas with minimum rtt from the eader (in fact, we coud take just the k fastest repicas and pick the remaining num voters k repicas arbitrariy since quorum atency is determined by the fastest majority of votes). Agorithm 2 Efficient agorithm for tier-2. 1: procedure tier-2-efficient(r, num voters) 2: for each repica R 3: q (i) num voters+1 -th smaest rtt (i) 2,r, r R 4: score agg score (i) (, R, q (i) ) (Equation 2) 5: λ = arg min R {score } 6: max voter rtt num voters-th smaest rtt λ,r, r R 7: U λ k-cosest(λ, num voters, max voter rtt, R) 8: return (λ, U λ ) 9: procedure k-cosest(, num voters, max atency, R) 10: P < {r R rtt,r < max atency} 11: P = {r R rtt,r = max atency} 12: return P < {(num voters P < ) eements from P =} To see why Agorithm 2 returns the optima soution, et us consider the eader λ Vopt and set of voters V opt returned by Agorithm 1. Agorithm 2 evauates λ Vopt as candidate eader (ine 2). Optimaity foows from the fact that Agorithm 2 chooses voters for λ Vopt from a arger set of candidates (since V opt R) and therefore quorum atency for 1494

6 λ Vopt is necessariy smaer or equa in Agorithm 2 (ine 3) compared to Agorithm 1 (ine 4). Compexity of Agorithm 2. It takes O(R) to consider every repica as candidate eader and, for each candidate, another O(R) to find the k-th smaest rtt to other repicas (using a worst-case inear time seection agorithm). Invocation of k-cosest in the ast step of the agorithm takes O(R), overa yieding O(R 2 ) compexity, ceary better compared to the exponentia compexity of Agorithm 1. For a typica database in our storage system, Agorithm 2 requires roughy 2.8 thousand computations per database and time interva, which is 4 orders of magnitude ess than brute force. Faiure Diversity. Since progress in quorum-based repicated systems is usuay guaranteed ony as ong as a quorum of read-write repicas is avaiabe and can communicate in a timey manner, read-write repicas must be ocated in different faiure domains. To account for this fact, we sighty modify our agorithm as foows: instead of choosing the k- th smaest rtt for each eader candidate out of a set of R repicas, we pre-process the set for each candidate eader by bucketing repicas according to the different faiure domains and choosing the repica with smaest rtt from the candidate eader as a representative repica from each domain (bucket), fitering out the remaining repicas in each domain. We then seect the k-th smaest rtt from the reduced set of repicas. The inear time pre-processing does not increase the compexity of our agorithm and may potentiay speed up the execution of k-cosest. Note that in practice there may be mutipe different diversity constraints that one may want to consider, and further adjustments may be required. Our system recommends an aternative set of repicas for a given database preserving the faiure diversity eve currenty met by the database configuration. In the future, we pan to offer cients severa aternative configurations trading off increased faut toerance and operation atency. 5. TIER 3: REPLICA LOCATIONS, ROLES, AND LEADER In this section we expand the scope of our optimization and present two efficient agorithms to seect the best set of repicas R from the possibe ocations S, a set of voters V R and the best eader from V (one of the agorithms makes direct use of Agorithm 2). Unike the agorithms in Sections 3 and 4, which find the optima soution, the agorithms in this section are heuristics and we compare the achieved soutions with the optimum in Section 6.4. The agorithms in this section take the desired number of repicas as a parameter. How to determine this number is further expored in Section 6.4. Simiary to Section 4, a brute force agorithm is straightforward, but exponentia in compexity and highy impractica: such an agorithm coud consider every possibe subset of repicas R S, execute tier-2-efficient(r, V ), and choose the best combination of eader, voters and repicas. Reca that in Section 4 we reduced the search space by finding the best score for every possibe eader in inear time. This approach yieded an optima soution because the ocations of a repicas were fixed and thus, for each cient, nearest(c, R) and hence aso rtt c,nearest(c,r) did not depend on the choice of the set of voters or the eader. Our goa was to minimize the rest of the expression in Equation 1. Here, on the other hand, optimizing the function nearest is part of the probem. Given the ocation of a candidate eader, we cannot, for exampe, greediy choose the cosest repicas to to be voters since it may be better to trade off quorum atency for decreasing the atency between the cients and their cosest repicas (e.g., for a read-heavy database). Our probem is a variant of non-metric faciity ocation, which is known to be NP-Compete. We present two efficient heuristics for choosing repica ocations, both make use of Agorithm 3, a variant of the weighted K-Means agorithm. The agorithm assigns a weight w c to each cient custer c based on the tota number of operations performed by c: w c = α n (i) α,c = n (i) weak read,c + n(i) bounded read,c + n(i) strong read,c +n (i) singe-group transaction,c + n(i) muti-group transaction,c The goa of Agorithm 3 is to find a set of servers G (G S) such that cost(g) is minimized: cost(g) = c C w c rtt (i) c,nearest(c,g) (3) Agorithm 3 gets an initia set of repica ocations (cen- Agorithm 3 Weighted K-Means for choosing repica ocations. 1: // L fixed : set of fixed repica ocations, which can t be moved 2: // num repicas: tota number of repicas to be paced 3: procedure weighted-k-means(l fixed, num repicas) 4: // pick initia centroids 5: G L fixed 6: sort a cient custers c C by descending w c 7: whie G < num repicas and more cient custers remain 8: c next cient custer in C 9: if nearest(c, S) G then 10: add nearest(c, S) to G 11: new cost cost(g) 12: repeat 13: prev cost new cost 14: // custer cients according to nearest centroid 15: g G et C g {c g = nearest(c, G)} 16: // attempt to adjust centroids 17: for each g G \ L fixed 18: g v S s.t. c C g w c rtt (i) c,v is minimized 19: update centroid g to g 20: new cost cost(g) 21: unti new cost prev cost < threshod 22: return G troids) L fixed and the tota desired number of ocations num repicas as parameters. First, we choose initia ocations for the remaining centroids (ines 6-10) by pacing them cose to the heaviest cient custers (according to w c). Each centroid ocation g defines a set of cient custers C g for which g is the nearest centroid (ine 15). The remainder of Agorithm 3 tries to adjust the position of each centroid g in a way that minimizes cost (weighted roundtrip-time) for cients in C g. Note that the centroids in L fixed are not being moved. The agorithm terminates returning the set of centroids G once there is no sufficient improvement in the tota cost, i.e., cost(g). Reca that our goa is not ony to find good repica ocations, but aso find a quorum and a eader. Our two new agorithms differ in the order in which they perform these tasks. Agorithm 4 first paces a repicas in strategic ocations using Agorithm 3 and then invokes Agorithm 2 to determine the eader and voters from within the repicas. Agorithm 5, on the other hand, first sets the eader and a quorum of voters and then invokes Agorithm 2 to pace 1495

7 Agorithm 4 Agorithm KQ. 1: procedure KMeans-Quorum(num repicas, num voters) 2: G weighted-k-means(, num repicas) 3: (λ, V ) tier-2-efficient(g, num voters) 4: // Return the eader, set of voters and set of repicas 5: return (λ, V, G) the remaining repicas cose to the cients. More specificay, we go over a possibe eaders ocations in S and find the best quorum for this eader. This quorum is then considered as centroids during the invocation of Agorithm 3 but these centroids are pinned down and not moved by the agorithm. Agorithm 5 Agorithm QK. 1: procedure Quorum-KMeans(num repicas, num voters) num voters+1 2: majority 2 3: minority num voters majority 4: for each repica S 5: q (i) majority-th smaest rtt (i),s, s S 6: Q k-cosest(, majority, q (i) 7:, S) G weighted-k-means(q, num repicas) 8: score agg score (i) (, G, q (i) ) (Equation 2) 9: λ = arg min S {score } 10: O any minority ocations from G λ \ Q λ 11: // Return the eader, set of voters and set of repicas 12: return (λ, Q λ O, G λ ) Note that in Agorithm 5, unike in Agorithm 4, we know both the eader and the quorum atency when invoking Agorithm 3 and therefore in ine 18 of Agorithm 3 actuay use the cost function given in Equation 1 (with the change that the summation is done ony over cients in C g) instead of the simpified cost mode given in Equation 3. For simpicity, this is omitted from the pseudo-code. 6. EVALUATION In this section we describe the evauation of our optimization framework with one of Googe s arge-scae distributed storage systems. This particuar system supports the five representative operation types described in Section 2, which made it the perfect candidate for optimization. We impemented a system consisting of three toos: a data coection pipeine, an optimizer, and a simuator. The data coection pipeine fetches reevant inputs on the number and atencies of reevant operations from Googe s monitoring toos, as we as reevant data from the database schemas such as the network QoS cass used by each database, and then prepares it for consumption by the optimizer. The data is broken down into severa nonoverapping time intervas, within each interva by database, and within each database by cient custer and by operation type. The optimizer generates scores for each one of the requested optimization tiers on each one of the time intervas 1..i reported by the coection pipeine, using an exponentia moving average as demonstrated in Section 3 (Equation 2) with τ = 2 as the decay parameter. It then gives a pacement recommendation for each interva based on the previous ones. Finay, the simuator compares the optimizer s recommended pacement strategy for each interva with other reasonabe pacement heuristics as we as with the optima pacement for the time interva. Our experiments were carried out on machines with 12- core 3.50GHz Xeon(R) CPU and 32 GB RAM. The running times of our toos for tiers 1 and 2 for 48 time intervas on a production databases combined were under 1.5 minutes. In what foows we present experiments dedicated to each of the optimization tiers. 6.1 Leader Pacement In this section we show experimenta resuts demonstrating the benefit of optimizing eader pacement for the vast majority of databases in our storage system. Speedup potentia. In the foowing experiment, we scored the current configuration of each database and compared it with the configuration proposed by our optimizer. We anayzed the average operation atency of databases during one typica workday partitioned into 48 nonoverapping intervas of 30 minutes each. In our storage system, a database administrator can specify an optiona preferred eader ocation, and the storage system picks a ocation cose to it (it may not aways be possibe to use the preferred ocation due to ack of avaiabe resources or ongoing maintenance). When assigning a score to the current database configuration, we need to distinguish databases with and without the preferred eader setting. To each database with preferred eader set to ocation V() we assign the score score (i) = score (i) (, R, q (i) ) at each interva i = 1, 2,..., 48. For databases without a specified preferred eader, the group eaders are assumed to be spread uniformy across V() (according to our observations, this assumption cosey modes rea depoyments in our system). Accordingy, the score of such database configuration is the average of scores score (i) (, R, q (i) ) across V(): score (i) = 1 V() V() For each interva i = 1, 2,..., 48, we cacuate 1 score (i) (λ (i 1) score (i) (, R, q (i) ). (4), R, q (i 1) λ (i 1) )/score (i), the potentia reduction in atency when foowing the pacement recommendation of our optimizer, which paces the eader in interva i based on the preceding intervas 1... i 1. For each database, we cacuate the average atency reduction over a the vaues of i. Figures 1(a) and 1(b) demonstrate the effectiveness of optimizing eader pacement for databases with and without an existing preferred eader setting, respectivey. Observe that there is a significant divide between databases that manuay set the preferred eader and those that do not in terms of atency reduction when foowing our recommended eader pacement. We can see that typicay, administrators that choose to set the preferred eader, set it in a way matching the recommendation of our optimizer; this can be seen in Figure 1(a) which shows that over 75% of the databases of this kind are found in the first bucket [0, 1%], i.e., they cannot further benefit from our recommendation. This serves as a vaidation that our mode matches the intention of database administrators in a these cases. We see, however, that for some databases the manua setting is sub-optima, as evidenced by the existence of 10% outiers, the score of which is off by at east 10% from the optimum. Our recommendations can hep speed-up such outiers. We found, however, that ony 25% of a databases specify a preferred eader and with more new databases created, this 1496

8 Figure 1: Histogram of atency reduction for databases with and without a preferred eader setting. Bins on the x-axis denote % atency reduction compared to current pacement. The height of each bin (y-axis) is the percent of databases (with or without a preferred eader setting) for which the corresponding reduction in atency was measured. percentage diminishes further. This surprising fact further motivates the need for automatic optimizations. For the remaining 75% of the databases our too provides significant atency improvements, as shown in Figure 1(b). The average operation atency in many of the cases can be reduced by tens of percent. Over 17% of such databases can have their average operation atency by foowing our pacement recommendation. For some databases, atency is reduced by more than 90%. Optimizer output and recommendation osciation. Figure 2 shows a sampe output of our optimizer, which outputs the best eader ocation every 30 minutes and the atency overhead for aternative ocations, compared to the best one (for brevity, we show ony 2 additiona ocations). Notice the osciation in recommendations between custers 1 and 3 caused both by their simiar scores and by workoad changes between 00:30 02:00 and 03:00 05:00. Our agorithm mitigates minor workoad spikes by using a decay parameter τ which counters the spikes with historic scores. A second eve of defence shoud be depoyed which considers the costs and benefits of moving the eader to a different ocation. For exampe, moving the eader may not be worth whie if the optimizer predicts a 2% atency reduction. 22:30: opt 1, 2 nd best 2 = 4%, 3 rd best 3 = 9.1% 23:00: opt 1, 2 nd best 2 = 5.77%, 3 rd best 3 = 9.77% 23:30: opt 1, 2 nd best 3 = 5.2%, 3 rd best 3 = 23.53% 00:00: opt 1, 2 nd best 2 = 5.24%, 3 rd best 3 = 7.68% 00:30: opt 3, 2 nd best 1 = 5.59%, 3 rd best 2 = 13.07%... 02:00: opt 3, 2 nd best 1 = 14.32%, 3 rd best 2 = 23.42% 02:30: opt 1, 2 nd best 3 = 7.38%, 3 rd best 2 = 9.16% 03:00: opt 3, 2 nd best 1 = 22.6%, 3 rd best 2 = 33.09%... 05:00: opt 3, 2 nd best 1 = 11.49%, 3 rd best 2 = 23.46% 05:30: opt 3, 2 nd best 1 = 3.3%, 3 rd best 2 = 15.08% 06:00: opt 1, 2 nd best 3 = 0.92%, 3 rd best 2 = 11.73% Figure 2: Sampe output of the optimizer. Comparison with other pacement strategies. We use our simuator to compare four pacement poicies using historica storage activity data from one typica day, discretized into 48 intervas of 30 minutes each. For i = 2, 3, 4,..., 48 and each one of the strategies, the simuator sets the eader at time interva i, based on the prediction provided by the pacement strategies on interva i 1 and assigns score s (i) to that prediction based on the actua workoad data for interva i. We considered four strategies for each database : (optimized) pacing the eader at λ (i) (with decay τ = 2), as predicted by the optimizer using data from intervas preceding i, whose score on interva i is score (i) (λ (i 1), R, q (i 1) ), λ (i 1) (cosest-to-writes) pacing the eader staticay in a custer, wherefrom most of the transactions in interva i = 0 originated, with score score (i) (, R, q (i) ), (smaest-quorum) pacing the eader in a custer = (), where the average round-trip-time atency median v V() {rtt (i),v } from the eader to the majority of voters is minima, with score score (i) (, R, q (i) ), (average) random eader ocation across a the groups of the database, achieving the average score as in (4). We compare with the cosest-to-writes and smaest-quorum strategies, since they are sometimes empoyed by database administrators when setting the preferred eader, and with the average strategy, since it refects the performance of databases without the preferred eader setting, as expained in the previous experiment. The cosest-to-writes strategy is a common heuristic used aso in other systems (see Section 7). Our baseine is the optima orace strategy which sets the eader for interva i at (considering the i-th interva workoad when aposteriori determining the best eader ocation for interva i, using τ = ). Latency overhead s (i) /score (i) (λ (i), R, q(i) ) 1 λ (i) with respect to the optimum score (i) (λ (i), R, q(i) λ (i) λ (i) ) is cacuated for each strategy score s (i) on each interva i 2. Figures 3(a) and 3(b) demonstrate atency reductions (in percent) for two databases with no manua preferred eader ocation. The optimized strategy perfecty predicted the optimum for both databases. In genera, for a the databases, predictions were neary perfect, with sma deviations from optimum due to sudden workoad spikes. More than 90% of operations beonging to the database in Figure 3(a) were bounded reads, that is why the cosest-to-writes and the smaest-quorum poicies, both of which disregard the ocations of readers, underperform compared to the optimized strategy which considers cient ocations and a operation types. The smaest-quorum poicy is sighty better than cosest-to-writes due to its choice of a we-connected repica as eader. In Figure 3(b), 38% of operations are strong reads and 60% of operations are weak reads. Once again the optimized strategy prefecty predicts the optimum strategy and outperforms the average and the smaest-quorum strategies by a arge margin (more than 60% on average); the cosestto-writes strategy is not appicabe in this case, as there were virtuay no transactions in the considered database. 6.2 Evauation in Production We are working directy with customers to vaidate our modes in production. We present the resuts of one such experiment in Figure 4. For simpicity, in this experiment 1497

9 Figure 3: Latency overhead of various pacement strategies compared to the orace optimum score score (i) (λ (i), R, q(i) ). λ (i) we ony reconfigured the chosen database once, even though the optimizer outputs recommendations continuousy. We monitored the 50-th percentie atency (soid ine) as the database is reconfigured (at 16:00) causing a eaders to migrate to the ocation λ recommended by our optimizer (dashed curve shows the percent of eaders in ocation λ). We observe a reduction of 70% in atency when the migration competes (around 16:15), after which the atency sighty increased and stabiized at 40% of its initia vaue, exceeding the predicted improvement by a factor of 2. Even though our mode currenty optimizes mean atency, it is interesting to note that in this experiment we saw a reduction of 30% in 99-th percentie atency (however 90% atency did not improve). In another experiment with a different database, we measured a 33%, 25% and 15% speedup in 50-th, 90-th and 99-th percentie atencies, respectivey. Note that our too predicted a reduction of 39.7% in average atency, which is fairy cose to what was observed. The discrepancy between the predicted and the actua reduction can be ascribed to the fact that at any given point in time the number of eaders at the different ocations from V() is not exacty the same (though across a onger period of time on average, it is cose to uniform). We found that for the first database mentioned above, one of the ocations in V() was taken down for maintenance at the time of the experiment (eaders were eveny spread across the remaining ocations). For the second database, the predicted atency reduction was cacuated under the assumption that a the eaders have an equa probabiity of 20% to be in any one of the 5 possibe ocations, but in reaity about one-third of a the eaders were found in the same ocation. In the future we intend to measure the actua eader distribution across V() dynamicay and encorporate it in our mode. 6.3 Repica Roes Next, we evauate our tier-2 agorithm, that determines the optima repica types in addition to eader pacement. Before conducting our experiments, we intuitivey expected to find databases with workfows exhibiting the foow-thesun phenomenon. 2 For exampe, we expected to see cients in the US and in Europe with intense activity during daytime and reduced activity at night, such that the overa center of activity osciates between US and Europe every 12 hours. We found, however, that often the traffic originating from US-based cients is greater than that originating from non-us cients even during night time in the US, therefore the center of activity aways remains in the US. This is demonstrated for one database in Figure 5, which shows that European traffic amounts to 35% of US traffic during 120 consecutive hours. Figure 4: Production experiment with one database. Figure depicts drop in 50-th percentie atency (soid ine) aong with migration of eaders to recommended ocation (dashed ine). Latency base (100% mark, eft y-axes) is chosen as average atency over 3 hours preceding start of experiment (1pm 4pm). Figure 5: Europe traffic as a percentage of US traffic over 120 hours, for a singe database. Nevertheess, we discovered diurna patterns between US East Coast and West Coast, as shown in Figure 6, where we pot the ratio between the number of operations originating in the East Coast and the number of operations from cients on the West Coast across 48 consecutive hours with one database, overaid with the eader ocations as suggested by our optimizer. Deineated by vertica ines are points at which our optimizer suggested to switch eader pacement from a custer on one coast to a custer on the other coast. The reader can readiy notice the correation between ratios arger than 1 and optimizer recommendations for eader pacement on the East Coast (as we as between ratios smaer than 1 and recommendations for the West Coast). 2 Apache ZooKeeper users have a simiar intuition [6]. 1498

10 Figure 6: Ratio of East Coast to West Coast traffic, for a singe database. Vertica ines denote times at which recommended eader pacement changed from East Coast to West Coast or vise versa. The charts in Figure 7 show the reduction in atency of tier-2 and tier-1 optimizations versus the current score, for two different databases. Figure 7(a) features a database which does not set the preferred eader, 98% of operations in which are strong reads, for which the optimization of tier- 2 was consideraby better than that of tier-1. The eader pacement in our tier-1 optimization was chosen in Centra US, whereas in tier-2 it migrated to the Pacific Northwest. This reduction in atency ooks paradoxica at first, considering the fact that the ocations of the voters and quorum are ony supposed to affect the atency of transactions, which are virtuay nonexistent in this database. This phenomenon is readiy expained by the fact that our optimization in tier-2 aows us to consider a the repicas in R as potentia eader candidates, instead of just the pre-determined set of read-write repicas considered by our tier-1 optimization. Indeed, the Pacific Northwest repica was initiay configured as a read-ony repica and thus coud not function as eader, whereas in tier-2, where we can convert it to a readwrite repica, it has become a egitimate candidate (and an eventua winner ), thereby bringing about the surprising reduction in atency. Figure 7(b) shows a different case, where both tier-1 and tier-2 optimizations suggested the same eader pacement, but tier-2 chose a different quorum, due to which the average operation atency was cut by an additiona 15% compared to tier-1. This database aso does not set a preferred eader. About 57% of operations are strong reads and additiona 42% are muti-group transactions; the atter operations significanty benefited from a new, better connected quorum of repicas. In this case tier-2 approximatey doubes the reduction in atency achieved by tier-1. Note that in a the experiments above we first anayzed the eve of faiure diversity currenty preserved by the database configuration and ony suggested aternative configurations maintaining the same diversity eve. 6.4 Repica Locations In tier-3 of optimization, we experimented with the performance of the KQ and QK heuristics (see Section 5). We start by comparing their performance with that of the exhaustive search preserving the faiure diversity constraints of the current database configuration. Figure 8 shows the average ratio between the score given by the optimizer to the exhaustive search and the scores of QK and KQ heuris- Figure 7: Latency reduction due to tier-1 and 2 optimizations across 3 days of workoad data for two seected databases. tics, as a function of the tota number of repicas, across 12 argest (by the amount of traffic) databases in our system, when the number of voters V was fixed at 3. On the same chart, we aso pot the average ratio between the score of the exhaustive search and the best of two heuristics across the same 12 argest databases. For some databases, KQ is better than QK, whereas for others the QK outperforms KQ, resuting in a perhaps surprising fenomena where the average Best(QK,KQ) score is better than both the average KQ and average QK score. For R {6, 7}, KQ was consistenty better than QK, that is why Best(QK,KQ) coincides with KQ at that point. The performance of QK on the chart is worse on average that than of KQ because of the fact that most of the considered databases are read-heavy and the reativey sma number of repicas considered, of which 2 are wasted by QK on the quorum. For such databases it is worthwhie to spread out the repicas to pace them as cose to most of the cients as possibe, which is where KQ exces in comparison with QK. We notice that aready with R = 5 repicas, the best of the two heuristics performs within 5% margin of the optimum produced by exhaustive search, with the added benefit of being substantiay faster. For R = 7, both QK and KQ, which are poynomia (in R ), generated resuts for a 12 databases within seconds, whereas the exponentia exhaustive search took severa orders of magnitude onger. Next, we compare KQ and QK, specificay interested in identifying workoads where each of the two agorithms shoud be preferred over the other. In the foowing experiment, run with V = 5 and R = 7 we broke down a databases into buckets by the percentage of transactions among a operations and compared the two agorithms for databases in each bucket. Figure 9 shows a positive correation between the percentage of transactions and the superiority of QK, which, for databases with more than 60% transactions performs better by more than 80% compared with KQ. A second experiment in which the databases were broken down into buckets by the percentage of weak reads 1499

11 Figure 8: Score of the exhaustive search in percent of the score of KQ, QK and best of two heuristics (with V = 3), as a function of R. shows a strong correation between that percentage and the superiority of KQ; resuts of this experiment, which appear in Figure 10, demonstrate that for read-heavy databases the speedup of KQ versus QK can be as high as 23% on average. Simiar experiments with breakdowns of databases by percentages of bounded reads and strong reads did not yied a concusive outcome. In the foowing experiment, we set V = 3 and et R range between 4 and 13. We then measure the scores of both QK and KQ heuristics using workoad from one day for one database. Figure 11 demonstrates, for each heuristic and R {4, 5, 6,..., 13}, its average atency sowdown in percent versus the score obtained with 13 repicas, which equas the optima score for any tier-3 optimization with 3 voters (obtained using an exhaustive search). We can readiy see that both heuristics fatten out very soon; specificay, with R = 11, both are within 13.5% margin from the optima score. This demonstrates the deminishing returns of adding more repicas initiay each new repica haves the average operation atency, whie adding the 12th or 13th repica barey makes any difference. Figure 11: Sowdown of QK and KQ heuristics in percent from the optimum. Figure 9: Speedup of QK vs. KQ heuristic as a function of the percentage of transactions in a database. Figure 10: Speedup of KQ vs. QK heuristic as a function of the percentage of weak reads. How many repicas do you need? Whereas the number of read-write repicas is usuay set by an administrator to meet certain faut toerance goas, the tota number of repicas is usuay more fexibe. The cost of adding / moving / maintaining a repica is often significant as it requires aocating resources, copying data, and potentiay depoying other reevant services if coocation dependencies exist. At minimum, the number of repicas shoud be sufficient to withstand the expected database oad. But often, additiona repicas are added cose to the cients in order to reduce atency. Our framework can hep expore the cost/benefit tradeoff of adding such repicas by examining the potentia atency gains, and can determine their ocations. 7. RELATED WORK It has ong been reaized that distributed systems need to be dynamic, i.e., adjust their membership and other configuration parameters over time. Many storage systems [11, 18, 24] use an auxiiary coordination service such as Chubby [13] or ZooKeeper [19] to coordinate reconfiguration whie others use the system itsef [23, 25]. See [22, 12] for a tutoria on different approaches for reconfiguration of repicated state-machines (i.e., Paxos-ike systems) and [8] for a survey on reconfiguring strongy consistent key-vaue stores. Much fewer works provide insights on how to determine the best storage configuration at runtime. Since LAN and WAN environments pose very different chaenges, beow we focus on storage systems that dynamicay reconfigure in WAN. PNUTS [15] and Megastore [10] pace master/eader repicas cose to the writers. Earier works propose other heuristics, e.g., that the current eader shoud hand off eadership to another repica if that repica forwards more requests to the eader than it receives from esewhere [28]. These heuristics may work we for some workoads but not for others. For exampe, in Section 6.1 we show that pacing the eader cose to the origin of the majority of writes performs poory on our production workoads, which are mosty read dominant (and yet invove the eader). Furthermore, unike in [28], we consider network atencies and instead of ooking at the aggregate number of requests (or just one request type, such as writes), we consider the detaied fow of each request type and perform an optimization for the entire workoad. In this work we formay state optimaity criteria and our soution achieves optima eader pacement. Adaptive repication mechanisms in PNUTS [20] and Nomad [27] dynamicay create repicas based on ocay observed reads. In Nomad, for exampe, a repica is created 1500

12 at a given ocation when an object is read more than a certain number of times from that ocation, over a certain period of time, or at a certain rate. Authors of [20] state that they considered more exact methods but decided to use oca heuristics since efficienty acquiring, tracking and coecting access statistics from around the word is a compex and expensive process. In this work we everage Googe s monitoring infrastructure to dynamicay and accuratey track the workoad of each database, as we as network atencies. We demonstrate that a soution optimizing the entire workoad can be both fast and practica. Voey [7] proposes a heuristic for pacing appication data across data centers whie minimizing cient atency as we as synchronization atency arrising from data inter-dependency. The Voey agorithm does not support data repication and was not evauated with repicated state. The authors briefy propose to mode repicas as distinct data items that may have a certain amount of inter-item communication. Note, however, that with repication each cient request is ony sent to one of the repicas; unike Voey, our tier-3 agorithm takes such workoad partitioning into account when pacing the repicas. Furthermore, unike Voey, our cost mode considers muipe types of cient requests with different fows and we compare our repica pacement heuristics with the optimum achieved by an exhaustive search using production workoads. Tuba [9] is an extension of Microsoft Azure Storage that provides geo-repicated key-vaue store and automaticay reconfigures its master and set of repicas based on the workoad. Unike Tuba, our agorithms do not require any changes to the storage system. Tuba uses exhaustive search to enumerate a pacement options and choose the best one. It was evauated with three storage ocations using a synthetic workoad. We tried exhaustive search, but it was not practica for our Googe scae storage system. A highy optimized exhaustive search agorithm for repica pacement (Section 6.4), akin to the exhaustive search in Tuba, took more than a day to compete and was ony sighty better than our heuristic: up to 5% better for 5 repicas per group and ess than 1% for arger configurations. In contrast, our optima agorithms for choosing eader and repica roes (tiers 1 and 2) and heuristic methods for repica pacement (tier 3) terminated in ess than 2 minutes for a the databases combined. 8. CONCLUSION Athough mechanisms exist for changing the repication poicy of distributed storage systems at runtime, system administrators are usuay entrusted with determining the best configuration manuay. We deveoped a new workoaddriven optimization framework that dynamicay and automaticay determines the optima configuration for eader and quorum based systems. Our system optimizes three aspects of the configuration: 1) eader ocation, 2) roes of different servers in the repication protoco, and 3) repica ocations. We show that by just appying the first optimization tier to a arge-scae distributed storage system used internay in Googe, we can reduce the atency of 17% of the databases by more than haf, incuding some databases with a speed-up over 90%. We demonstrate that the second optimization tier further reduces atency by up to 50% in some cases. Finay, we evauated and compared different strategies for seecting repica ocations and showed that they are cose to optima. 9. REFERENCES [1] Amazon dynamo. [2] Amazon simpe. [3] Basho riak. [4] Mongo. [5] Incusion excusion principe. wiki/incusion-excusion_principe#in_probabiity, Retrieved February 10, [6] Zookeeper feature request. https: //issues.apache.org/jira/browse/zookeeper-2027, Retrieved February 10, [7] S. Agarwa et a. Voey: Automated data pacement for geo-distributed coud services. USENIX NSDI, Berkeey, CA, USA, [8] M. K. Aguiera, I. Keidar, D. Makhi, J.-P. Martin, and A. Shraer. Reconfiguring repicated atomic storage: A tutoria. Bu. of EATCS, 102, [9] M. S. Ardekani and D. B. Terry. A sef-configurabe geo-repicated coud storage system. USENIX OSDI, pages , Oct [10] J. Baker et a. Megastore: Providing scaabe, highy avaiabe storage for interactive services. CIDR, [11] M. Baakrishnan, D. Makhi, J. D. Davis, V. Prabhakaran, M. Wei, and T. Wobber. CORFU: A distributed shared og. ACM Trans. Comput. Syst., 31(4):10, [12] K. Birman, D. Makhi, and R. van Renesse. Virtuay synchronous methodoogy for dynamic service repication. Technica Report 151, MSR, Nov [13] M. Burrows. The chubby ock service for oosey-couped distributed systems. In OSDI, pages , [14] S. Bykov, A. Geer, G. Kiot, J. R. Larus, R. Pandya, and J. Thein. Oreans: coud computing for everyone. In ACM SOCC, [15] B. Cooper et a. PNUTS: Yahoo! s hosted data serving patform. Proc. VLDB Endow., 1(2), Aug [16] J. Corbett et a. Spanner: Googe s gobay distributed database. ACM Trans. Comput. Syst., 31(3), Aug [17] R. Escriva, B. Wong, and E. G. Sirer. Hyperdex: A distributed, searchabe key-vaue store. In ACM SIGCOMM, [18] S. Ghemawat, H. Gobioff, and S.-T. Leung. The googe fie system. In SOSP, pages 29 43, [19] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free coordination for internet-scae systems. In USENIX ATC, [20] S. Kadambi et a. Where in the word is my data? PVLDB, 4(11): , [21] L. Lamport. The part-time pariament. ACM Trans. Comput. Syst., 16(2): , [22] L. Lamport, D. Makhi, and L. Zhou. Reconfiguring a state machine. SIGACT News, 41(1):63 73, Mar [23] J. Lorch et a. The smart way to migrate repicated statefu services. In EuroSys, [24] J. MacCormick et a. Boxwood: Abstractions as the foundation for storage infrastructure. In OSDI, [25] A. Shraer, B. Reed, D. Makhi, and F. Junqueira. Dynamic reconfiguration of primary/backup custers. USENIX ATC, [26] M. Stonebraker and A. Weisberg. The vot main memory DBMS. IEEE Data Eng. Bu., 36(2), [27] N. Tran, M. K. Aguiera, and M. Baakrishnan. Onine migration for geo-distributed storage systems. USENIX ATC, pages 15 15, Berkeey, CA, USA, [28] O. Wofson, S. Jajodia, and Y. Huang. An adaptive data repication agorithm. ACM Trans. Database Syst., 22(2): , June [29] Z. Yin et a. An empirica study on configuration errors in commercia and open source systems. In SOSP,