Take me to your leader! Online Optimization of Distributed Storage Configurations

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Take me to your leader! Online Optimization of Distributed Storage Configurations"

Transcription

1 Take me to your eader! Onine Optimization of Distributed Storage Configurations Artyom Sharov Aexander Shraer Arif Merchant Murray Stokey {shraex, aamerchant, Googe, Inc. ABSTRACT The configuration of a distributed storage system typicay incudes, among other parameters, the set of servers and their roes in the repication protoco. Athough mechanisms for changing the configuration at runtime exist, it is usuay eft to system administrators to manuay determine the best configuration and periodicay reconfigure the system, often by tria and error. This paper describes a new workoad-driven optimization framework that dynamicay determines the optima configuration at runtime. We focus on optimizing eader and quorum based repication schemes and divide the framework into three optimization tiers, dynamicay optimizing different configuration aspects: 1) eader pacement, 2) roes of different servers in the repication protoco, and 3) repica ocations. We showcase our optimization framework by appying it to a arge-scae distributed storage system used internay in Googe and demonstrate that most cient appications significanty benefit from using our framework, reducing average operation atency by up to 94%. 1. INTRODUCTION Storage is changing from being mosty in-house and oca to become a fuy gobay-distributed service. Coud storage services such as Amazon S3, Microsoft Azure, and Googe Coud Storage, form the underpinnings of many Internet services with cients distributed a over the word. Typicay, appications need continuous avaiabiity and reasonaby good data access atency, which transates into storing mutipe data repicas in different geographic ocations. Distributed storage systems usuay provide consistency across repicas of data (or metadata) using seriaization or confict resoution protocos. Distributed atomic commit or consensus-based protocos are often used when strong consistency is required, whie simper protocos suffice when Computer Science Department, Technion, Israe. Work done as part of a summer internship in Googe s Distributed Storage Anaytics team. This work is icensed under the Creative Commons Attribution- NonCommercia-NoDerivs 3.0 Unported License. To view a copy of this icense, visit Obtain permission prior to any use beyond those covered by the icense. Contact copyright hoder by emaiing Artices from this voume were invited to present their resuts at the 41st Internationa Conference on Very Large Data Bases, August 31st - September 4th 2015, Kohaa Coast, Hawaii. Proceedings of the VLDB Endowment, Vo. 8, No. 12 Copyright 2015 VLDB Endowment /15/08. repicas need ony preserve weak or eventua consistency. Such protocos typicay define mutipe possibe roes for the repicas, such as a eader or master repica coordinating updates, and appoint some of the repicas to participate in the commit protoco. The performance of such a system depends on the configuration: how many repicas, where they are ocated, and which roes they serve; for optima performance, the configuration must be tuned to the workoads. Different appications using the same storage service may have competey different workoads, for exampe a ogging system may use the storage mosty for writes and have reativey few cients, whie an appication responsibe for access contro may be read heavy. The workoads can be extremey variabe, both in the short term for exampe, oad may come from different parts of the word at different times of the day for a socia appication and in the ong term the administrators of the service may reconfigure their servers periodicay, causing different oad patterns. Long term workoad variation coud aso be due to organic changes in the demands on the Internet service itsef; for exampe, if a shopping service becomes more popuar in a region, its demands on the underying storage system may shift. The coud storage service must adapt to these changes seamessy. In fact, easticity is an integra part of coud computing and one of its most attractive premises. Reconfiguring a distributed storage system at run-time whie preserving its consistency and avaiabiity is a very chaenging probem and misconfigurations have been cited as a primary cause for production faiures [29]. Due to its practica significance the probem has received abundant attention in recent years both in academia and in the industry (see Section 7), focusing mainy on the design and impementation of efficient and non-disruptive mechanisms for reconfiguration. Yet, itte insight exists on how to set poicies and use reconfiguration mechanisms in order to improve system performance. For exampe, dynamic reconfiguration APIs have recenty been added to the Apache ZooKeeper distributed coordination system [25, 19] and users have since been asking for automatic workoad-based configuration management, e.g. [6]. At Googe, site reiabiity engineers (SREs) are masters in the back art of determining depoyment poicies and tuning system parameters. However, hand-tuning a coud storage system that supports hundreds of distinct workoads is difficut and prone to misconfigurations. We describe the design and impementation of a workoaddriven framework for automaticay and dynamicay optimizing the repication poicy of distributed storage systems. We showcase our framework by optimizing a arge-scae dis- 1490

2 tributed storage system used internay in Googe. In this work we focus on operation atency as the main optimization objective. We assume that the various aspects of oad distribution and baancing are addressed by the underying storage system, as is the case for our storage system and for many other massivey scaabe systems. Section 2 contains an overview of our storage mode. In short, users define databases which are then partitioned and repicated [1, 10, 14, 15, 16, 17, 26]. The repication poicy is defined by the database administrator, usuay an SRE responsibe for a specific cient appication. For exampe, if most writers are expected to reside in western Europe, the administrator wi ikey pace repicas in European ocations, and make some of them voters so that a quorum required to commit state changes can be coected in Europe without using expensive cross-continenta network inks. One of the voters (eected dynamicay) acts as a eader and coordinates the repication. Since eaders are invoved in many of the operations, their ocation significanty affects atency. Section 3 describes the first tier of our optimization framework, which, given repica ocations and roes, optimizes eader pacement. Our agorithm uses a detaied characterization of the reevant operations supported by the storage system. Unike many of the previousy proposed methods, such as pacing the eader cose to the writers (previousy suggested, for exampe, in the context of Googe Megastore or Yahoo! PNUTS), which may work for one workoad but not for another, we everage Googe s monitoring infrastructure to dynamicay and accuratey track the goba workoad of each database, as we as network atencies. Our evauation, using both simuations with production workoads and production experiments, presented in Section 6, shows that our method is fast, accurate and significanty outperforms previousy proposed common sense heuristics. We show that it resuts in a substantia speed-up for the vast majority of databases in our system, having the operation atency for 17% of the databases and reducing the atency by up to 94% for others. The goa of the second optimization tier, presented in Section 4 is, given repica ocations, to determine the best repica roes in addition to the optima eader pacement. Whie the number of voter repicas is usuay determined based on avaiabiity requirements, their ocations are fexibe. For exampe, if 2 simutaneous repica faiures must be toerated, 5 voting repicas wi be used and our system determines which 5 repicas are to be voters based on the observed workoad. Usuay in such systems the eader is one of the voters and thus optimizing voter pacement not ony affects the atency of commits but aso of other operations invoving the eader. Our evauation in Section 6 confirms the benefits of dynamic roe assignment for both write and read heavy workoads, achieving a speed-up of up to 50% on top of the first optimization tier, for some databases. Finay, in Section 5 we present two new agorithms used to determine repica ocations (in addition to choosing repica roes and a eader), which constitutes our third optimization tier. For exampe, if a database suddeny experiences a surge in cient operations coming from Asia, we wi detect the change in workoad and identify the best repica ocations to minimize atency. The agorithm additionay determines which among the repicas shoud be voters and which voter shoud be eader. Section 6 shows that the two agorithms are near optima and identifies workoads where each of the agorithms is preferabe. We aso show how these agorithms can be used to determine the desired number of repicas. The optimization agorithms are dynamic the best configuration for a given time interva is determined considering the workoad in previous time intervas, weighted according to recency. The agorithms can thus react to workoad changes without being overy sensitive to spikes. In summary, the main contribution of this paper is the design, impementation and evauation of an optimization framework for dynamicay optimizing the repication poicy of eader-based storage systems. Our system frees administrators from manuay and periodicay trying to adjust database repication based on the currenty observed workoads, which is a very difficut task since such systems usuay support hundreds of distinct workoads and mutipe operation types each invoving a series of message exchanges. Our system uses monitoring information to perform the optimization automaticay, aowing a detaied and separate consideration of each workoad. Our evauation demonstrates dramatic atency improvements for a production distributed storage system used in Googe. 2. STORAGE MODEL We assume a common distributed storage mode combining partitioning and repication to achieve scaabiity and faut toerance: users (administrators) define databases, each database is sharded into mutipe partitions, and each partition is repicated separatey [1, 10, 14, 15, 16, 17, 26]. We ca these partitions repication groups. A singe repication poicy, defined by the database administrator, governs a repication groups in a database and determines the configuration of each group the number of repicas, their ocations and their roes in the repication protoco 1. A repica can be either read-write or read-ony. Both are fu repicas and can serve reads ocay. Whereas read-write repicas vote on commits, read-ony repicas are notified of state changes ony after they occur. Such categorization exists in many systems, such as acceptors and earners in Paxos [21] or participants and observers in ZooKeeper [19]. One of the voting repicas in every repication group is chosen as the group eader. The eader is chosen dynamicay and when it fais, a different eader is eected. Leaders are typicay responsibe for coordinating repication in their group, for ocking and concurrency contro to support transactions, and participate in transaction invoving mutipe groups. We denote the voting repicas of a group g by V(g). Reca that a groups in a database share the same configuration, and hence are repicated across the same ocations (i.e., custers or datacenters). Since we are soey interested in the ocation of the repicas, we simpy use V() to refer to V(g) for any group g in a database. Simiary, R() refers to the set of a repicas (read-write and read-ony). We omit the index and g when it is cear from the context. A cient is denoted by c, the repica cosest to the cient (in terms of network atency) by nearest(c, R) (note that it is taken from the set R), and the eader of a group g by eader(g). Finay, we denote the set of cient ocations by C and the universe of potentia repica ocations by S (note that R S). 1 The storage system we used for evauation supports mutipe repication poicies per database, however few databases make use of this feature and a such databases perform manua configuration tuning. 1491

3 Operations are invoked by cients and mutipe operations may be executed together as part of transactions. Transactions may invove one or more groups within a database. Next, we identify five representative operations, most commony supported (under various names) in repicated distributed storage systems. Our framework optimizes operation atency. As such it requires knowedge of the fow of each operation. Hence, for each operation, we give a possibe message fow which we use to showcase our optimization framework in the foowing sections. This fow is simpified and actua impementations may use various optimizations, which may in turn require adjustments to the optimization formuation. However, such optimizations are orthogona to the contributions of this paper. weak read. A read from one of the repicas (typicay the repica cosest to the cient). As repicas may ag behind the eader, the read can return stae vaues. Typicay, a cient c sends a request to nearest(c, R), which repies with its oca copy of the requested data. bounded read. This read typicay incudes a bound on the staeness of returned vaues. If nearest(c, R) is not sufficienty up to date, it forwards the request to the eader eader(g), which responds with the atest version of the data. Once updated, nearest(c, R) responds to the cient. strong read. This read returns the ast committed vaue, typicay returned by the eader. In some systems, cients read directy from the eader; in others, the read is reayed through nearest(c, R). For exampe, Yahoo! PNUTS [15] supports a three types of reads. Apache ZooKeeper, as we as Amazon SimpeDB [2], DynamoDB [1], MongoDB [4], and many others, support ony the weak (sometimes caed eventuay consistent ) and the strong (sometimes caed consistent ) read types. Often, instead of providing an expicit API for bounded reads, systems such as ZooKeeper make guarantees about the maximum aowed staeness of repicas. The choice among the reads may be expicity made by the user or automaticay done by the storage system based on higher eve user preferences (such as in MongoDB or Riak [3]). Next, we define two state update operaitons: singe-group transaction. This is a state-changing operation or transaction that invoves the data of a singe repication group g. It is executed by eader(g) on behaf of a cient c, and, for high avaiabiity, requires eader(g) to persist the state changes on a quorum (usuay a majority) of voting repicas. To this end, eader(g) sends messages to the voting repicas V(g) and waits for a quorum to respond. If the minima required set of affirmative responses is coected, the transaction commits and a commit message is sent to the cient as we as to the group repicas. muti-group transaction. When committing a transaction invoving data from mutipe groups, a distributed atomic commit protoco must be executed across the groups in order to either commit or abort the transaction atomicay in a invoved groups. Typicay, this protoco is twophase commit and the eader of one of the groups invoved in the transaction acts as the coordinator whie the others are participants. In order to be faut toerant, every step of the protoco must be agreed-upon by the members of each group, and not ony by its eader. For the the purposes of this paper, we assume the foowing simpe protoco: the coordinator eader eader(g) broadcasts a prepare message to participant eaders. Each participant eader eader(g ) checks ocay whether the transaction may be committed and if so persists its intention to commit to a quorum of voters V(g ). It then sends an ack message to eader(g). Once a participants have acked, eader(g) commits the transaction by persisting it to a quorum of V(g) and responds to the cient. It then sends commit messages to the participant eaders which in turn commit the operation in their respective groups by persisting the state-changes to a quorum. For exampe, VotDB [26], HyperDex [17], DynamoDB and Microsoft Oreans [14] support both transactions types, whereas PNUTS [15], ZooKeeper [19] and many NoSQL stores ony support variants of singe-group transactions. 3. TIER 1: LEADER PLACEMENT In this work, we focus on optimizing operation atency. From the description in Section 2, it is easy to see that operation atency is affected by the ocation of the cient, ocation of the repica cosest to the cient, ocations of the readwrite repicas, and finay, by the choice of the eader among the read-write repicas. Our first agorithm, described in this section, optimizes eader pacement without modifying server roes and ocations (V and R) given by the database andministrator. Consider a database with one thousand groups and five read-write repicas per group, each of which can become eader. It may seem that in order to optimize eader pacement for such database we have to consider 5000 different pacement configurations. Unfortunatey, this is not the case. Muti-group transactions invove severa groups, whose eaders execute a distributed atomic commit protoco. Changing the pacement of one group eader may therefore impact the optima pacement of the eaders of other groups. In fact, for our numerica exampe, in the worst case the number of different pacement options is In order to achieve a practica soution we must reduce the soution space drasticay. We chose to optimize eader pacement on the granuarity of a database instead of a singe group. Since a groups in a database are repicated in the same way, this method reduces the soution space to one of the read-write repica ocations for this database. In practice, whie our optimization agorithm outputs a singe ocation, the storage system may pace the different eaders of groups beonging to the database cose to this ocation, taking various constraints reated to oad baancing, faiure diversity, etc., into account. We denote the eader ocation produced by our agorithm for database in the i-th time interva by λ (i). For simpicity, we focus on optimizing the average operation atency; this metric is generaized in Section 3.1. Intuitivey, to minimize average operation atency we compute it for every possibe eader ocation (any read-write repica can potentiay become eader), and then choose the ocation yieding minimum average atency. Assuming that different cients in the same custer experience simiar atencies when communicating with servers, we ogicay group cients within each custer and consider cient custers rather than individua cients in our anaysis. Since different operation types may have different atency profies, we compute a weighted average. Specificay, at interva i we (a) determine the average atency t (i) α,c() of each type of operation α from every cient custer c, for each potentia eader ocation ranging over the set V() of read- 1492

4 write repica ocations as defined for database, and (b) quantify the number of operations n α,c (i) of each type α for each cient custer c. Finay, we choose a eader λ (i) that minimizes the foowing expression (for simpicity we omit the denominator of the weighted average): λ (i) = arg min V() {score(i) ()}, where score (i) () = α,c t(i) α,c() n (i) α,c. The ocation λ (i) can then serve as prediction for the optima ocation λ (i+1) in the (i+1)-st time interva. Next, we cacuate t (i) α,c() by considering the fow of different operation types described in Section 2. We assume that eader identities are exposed to the cients, and hence cients send strong reads and transactions directy to the reevant eaders. Denote the average roundtrip-time atency between nodes a and b in time interva i by rtt (i) a,b. For weak reads, the average atency is simpy rtt (i) c,nearest(c,r), and simiary for strong reads, given a candidate eader ocation, we get. For bounded reads, using the inearity of expectation, we have that the average atency is rtt (i) rtt (i) c, rtt (i) nearest(c,r), c,nearest(c,r) +, which corresponds to a roundtrip between the cient and the nearest repica and another roundtrip between that repica and the eader. For simpicity, we do not distinguish here between bounded reads served ocay versus those forwarded to the eader, but do impement the distinction. Cacuating singe-group transaction atency is sighty more compex, as it requires the eader to wait for a majority of voter responses, i.e., for the median fastest response. Unfortunatey, the median and average operations don t commute in genera. We found, however, that crosscuster atency distributions tend to be very narrow around their respective mean vaues and overap ony for tai atencies. They can therefore be approximated as fixed vaues for computing the average of the median. The median of average rtt atencies is therefore a very good approximation for the average of median rtt atencies. Hence, the average operation atency of singe-group transactions can be estimated as rtt (i) c, + q(i), where q(i) = median v V() {rtt (i),v }. The computation for muti-group transactions is made compex by the fact that every muti-group transaction invoves a different set of groups. Fortunatey, our decision to optimize per database (rather than per group) consideraby simpifies the cacuation. First, since we are ooking for the best singe ocation λ (i) for a group eaders in the database, we ony need to consider assignments which resut in the same pacement for a group eaders and the atency between these coocated eaders is effectivey negigibe. Second, since a groups in a database usuay share the same configuration, the set of atency distributions between the eaders and their respective voters is the same for a group eaders in the database. To everage this, et us recap the fow of muti-group transactions: The cient sends a message to one of the group eaders and waits for a response. The average roundtrip atency between the cient and the eader is simpy rtt (i) c,. The eader (coordinator) then contacts other eaders (participants). Since a eaders are in the same ocation this atency is negigibe. Each participant eader then sends a message to its voters and waits for a majority of responses, which takes q (i). Then, participant eaders contact the coordinator eader. Finay, the coordinator eader commits the transaction by sending it to the voters of its group and waits for a majority of responses, which again takes q (i), on average. Overa, the average atency of a muti-group commit can be estimated as rtt (i) c, + 2 q(i). To summarize, the score of a candidate eader ocation V, is given by Equation 1. Both n (i) and average rtt (i) atencies are measured for the ast observed (i-th) time interva and, as presented thus far, used to predict λ (i+1). Observe that there is a tradeoff between the chosen time interva ength and the accuracy of prediction. By choosing a short interva (e.g., one minute) our soution becomes very sensitive to workoad spikes, deteriorating prediction accuracy. By using ong intervas (e.g., one day) the prediction may be more accurate yet it may average out potentiay interesting workoad changes (such as diurna patterns). Instead of attempting to pick the best time interva ength (which may vary across different databases), in Equation 2 we introduce a decay parameter τ and compute score based on mutipe past intervas (and not just the i-th interva), weighting them according to recency using an exponentia moving average (for simpicity we again omit the denominator of the weighted average). score (i) (, R, q (i) ) = [(n (i) weak read,c rtt(i) c,nearest(c,r) c C + n (i) bounded read,c (rtt(i) c,nearest(c,r) + rtt(i) nearest(c,r), ) + n (i) strong read,c rtt(i) c, + n(i) singe-group transaction,c (rtt(i) c, + q(i) ) + n (i) muti-group transaction,c (rtt(i) c, + 2 q(i) )] (1) agg score (i) (, R, q (i) ) = = 1 τ agg score(i 1) (, R, q (i 1) ) + score (i) (, R, q (i) ) λ (i) = arg min {agg V score(i) (, R, q (i) )} Interva i = 1 is the first interva considered for our anaysis and agg score (0) (.,.,.) = 0. Note that Equation 1 can be generaized to account for mutipe repication poicies (configurations) per database. However a straightforward extension is exponentia in the number of configurations, as it has to account for muti-group transactions invoving every possibe subset of configurations, and every possibe eader pacement in each configuration. The design of a more efficient method is an interesting topic for future research. 3.1 Optimizing Tai Latency For some users, optimizing tai atency is more important than optimizing for the mean. Athough currenty our system optimizes for average atency, in the future, we pan to extend it to aow database owners to specify the desired percentie and optimize for that percentie when determining the best configuration for the database. When considering tai atency, we can no onger use the nice inearity properties we everaged so far. Beow we show how to extend the score cacuation in Equation 1. As input, instead of the average roundtrip-time atencies, we now need to know the roundtrip-time atency distribution H a,b between each pair of ocations a and b. We assume that these distributions are independent. For simpicity, assume that atencies are discretized as mutipes of 1ms. (2) 1493

5 When computing the atency of each operation type, instead of summing up averages we shoud compute the distribution of the sum of random variabes. As an exampe, consider the simpe case of a bounded read, which traves from a cient c to the cosest repica nearest(c, R), then from nearest(c, R) to the eader and back a the way to the cient. In order to find the atency distribution of this operation we perform a discrete convoution H c,nearest(c,r) H nearest(c,r), as foows: P r(t (i) bounded read,c () = x) = P r(rtt c,nearest(c,r) + rtt nearest(c,r), = x) = x P r(rtt c,nearest(c,r) = k, rtt nearest(c,r), = x k) = k=m x P r(rtt c,nearest(c,r) = k) P r(rtt nearest(c,r), = x k), k=m where m denotes the minimum possibe vaue of t (i) bounded read,c () and rtt is the random variabe corresponding to the atency (rather than the average atency). Once the distribution of the sum has been computed, the required percentie can be taken from this distribution. Computing the distribution of the quorum atency is more compex. The simpest numeric method is to perform a Monte Caro simuation, repeatedy samping the distributions H,v for v V and computing the median atency each time. For an anaytica soution, observe that the eader needs to coect majority 1 responses from other servers, where majority V +1 and assume that the eader s own 2 response arrives faster than any other response. The CDF of the maximum response time from any set of read-write repicas is simpy the product of the CDFs of response time for the individua repicas. For exampe, for 3 read-write repicas, v and w where is the candidate eader: P r(max(rtt,v, rtt,w ) x) = P r(rtt,v x, rtt,w x) = P r(rtt,v x) P r(rtt,w x) We can therefore construct the CDF of maximum response time for every subset of the read-write repicas. From these, using the incusion-excusion principe [5], we can compute the probabiity of the event that at east one subset of the read-write repicas, of cardinaity majority 1, has maximum response atency ess than x, for each x. But this event is equivaent to the event that the quorum s response time is ess than x, hence it gives us the CDF of the quorum response time. Continuing our exampe for 3 read-write repicas, we get: P r(q (i) < x) = P r(rtt,v x) + P r(rtt,w x) P r(max(rtt,v, rtt,w ) x) 4. TIER 2: LEADER AND REPLICA ROLES Our tier-1 optimization agorithm, described in the previous section, optimizes eader pacement whie keeping V and R fixed. In this section, we introduce our tier-2 optimization agorithm that determines the best voter ocations V from R as we as the best eader ocation from V. This agorithm does not modify R (this is the topic of Section 5). As we expain next, for reasons of efficiency we do not directy use the method described in Section 3, but rather the optimization objective in Equation 2. In order to evauate and compare different configurations we must take into account the best eader ocation possibe with the configuration. Given a configuration with R repica ocations, a brute-force approach (shown in Agorithm 1) is to enumerate ( R V ) configurations, corresponding to different possibe subsets of num voters = V read-write repicas, and for each one find the optima eader using the agorithm given in Section 3. Reca that our tier-1 agorithm evauates every read-write repica as a potentia eader by considering the resuting cost for every cient custer. For a typica database in the production system considered in Section 6, this agorithm woud resut in more than 26 miion computations per database and time interva. Agorithm 1 Brute-force agorithm for tier-2. 1: procedure tier-2-brute-force(r, num voters) 2: for each set V Rnum voters 3: for each repica V 4: q (i) = median v V {rtt (i),v } 5: score agg score (i) (, R, q (i) ) (Equation 2) 6: λ V arg min V {score } 7: score V agg score (i) (λ V, R, q (i) λ ) V 8: V opt = arg min V {score V } 9: return (λ Vopt, V opt) // Optima eader and quorum We propose a much more efficient aternative (Agorithm 2): instead of picking read-write repicas first and then the eader among them, we reverse the decision order and eiminate configurations that are ceary sub-optima due to their poor eader score. More precisey, for every candidate eader, out of a the repica ocations R (not just read-write), we find the k-th smaest rtt to other repicas, where k = num voters+1, and use it to cacuate score (i). We then 2 choose the eader λ (i) for which score(i) is minimized. Finay, we compute the set of voters for eader λ (i) by picking the num voters repicas with minimum rtt from the eader (in fact, we coud take just the k fastest repicas and pick the remaining num voters k repicas arbitrariy since quorum atency is determined by the fastest majority of votes). Agorithm 2 Efficient agorithm for tier-2. 1: procedure tier-2-efficient(r, num voters) 2: for each repica R 3: q (i) num voters+1 -th smaest rtt (i) 2,r, r R 4: score agg score (i) (, R, q (i) ) (Equation 2) 5: λ = arg min R {score } 6: max voter rtt num voters-th smaest rtt λ,r, r R 7: U λ k-cosest(λ, num voters, max voter rtt, R) 8: return (λ, U λ ) 9: procedure k-cosest(, num voters, max atency, R) 10: P < {r R rtt,r < max atency} 11: P = {r R rtt,r = max atency} 12: return P < {(num voters P < ) eements from P =} To see why Agorithm 2 returns the optima soution, et us consider the eader λ Vopt and set of voters V opt returned by Agorithm 1. Agorithm 2 evauates λ Vopt as candidate eader (ine 2). Optimaity foows from the fact that Agorithm 2 chooses voters for λ Vopt from a arger set of candidates (since V opt R) and therefore quorum atency for 1494

6 λ Vopt is necessariy smaer or equa in Agorithm 2 (ine 3) compared to Agorithm 1 (ine 4). Compexity of Agorithm 2. It takes O(R) to consider every repica as candidate eader and, for each candidate, another O(R) to find the k-th smaest rtt to other repicas (using a worst-case inear time seection agorithm). Invocation of k-cosest in the ast step of the agorithm takes O(R), overa yieding O(R 2 ) compexity, ceary better compared to the exponentia compexity of Agorithm 1. For a typica database in our storage system, Agorithm 2 requires roughy 2.8 thousand computations per database and time interva, which is 4 orders of magnitude ess than brute force. Faiure Diversity. Since progress in quorum-based repicated systems is usuay guaranteed ony as ong as a quorum of read-write repicas is avaiabe and can communicate in a timey manner, read-write repicas must be ocated in different faiure domains. To account for this fact, we sighty modify our agorithm as foows: instead of choosing the k- th smaest rtt for each eader candidate out of a set of R repicas, we pre-process the set for each candidate eader by bucketing repicas according to the different faiure domains and choosing the repica with smaest rtt from the candidate eader as a representative repica from each domain (bucket), fitering out the remaining repicas in each domain. We then seect the k-th smaest rtt from the reduced set of repicas. The inear time pre-processing does not increase the compexity of our agorithm and may potentiay speed up the execution of k-cosest. Note that in practice there may be mutipe different diversity constraints that one may want to consider, and further adjustments may be required. Our system recommends an aternative set of repicas for a given database preserving the faiure diversity eve currenty met by the database configuration. In the future, we pan to offer cients severa aternative configurations trading off increased faut toerance and operation atency. 5. TIER 3: REPLICA LOCATIONS, ROLES, AND LEADER In this section we expand the scope of our optimization and present two efficient agorithms to seect the best set of repicas R from the possibe ocations S, a set of voters V R and the best eader from V (one of the agorithms makes direct use of Agorithm 2). Unike the agorithms in Sections 3 and 4, which find the optima soution, the agorithms in this section are heuristics and we compare the achieved soutions with the optimum in Section 6.4. The agorithms in this section take the desired number of repicas as a parameter. How to determine this number is further expored in Section 6.4. Simiary to Section 4, a brute force agorithm is straightforward, but exponentia in compexity and highy impractica: such an agorithm coud consider every possibe subset of repicas R S, execute tier-2-efficient(r, V ), and choose the best combination of eader, voters and repicas. Reca that in Section 4 we reduced the search space by finding the best score for every possibe eader in inear time. This approach yieded an optima soution because the ocations of a repicas were fixed and thus, for each cient, nearest(c, R) and hence aso rtt c,nearest(c,r) did not depend on the choice of the set of voters or the eader. Our goa was to minimize the rest of the expression in Equation 1. Here, on the other hand, optimizing the function nearest is part of the probem. Given the ocation of a candidate eader, we cannot, for exampe, greediy choose the cosest repicas to to be voters since it may be better to trade off quorum atency for decreasing the atency between the cients and their cosest repicas (e.g., for a read-heavy database). Our probem is a variant of non-metric faciity ocation, which is known to be NP-Compete. We present two efficient heuristics for choosing repica ocations, both make use of Agorithm 3, a variant of the weighted K-Means agorithm. The agorithm assigns a weight w c to each cient custer c based on the tota number of operations performed by c: w c = α n (i) α,c = n (i) weak read,c + n(i) bounded read,c + n(i) strong read,c +n (i) singe-group transaction,c + n(i) muti-group transaction,c The goa of Agorithm 3 is to find a set of servers G (G S) such that cost(g) is minimized: cost(g) = c C w c rtt (i) c,nearest(c,g) (3) Agorithm 3 gets an initia set of repica ocations (cen- Agorithm 3 Weighted K-Means for choosing repica ocations. 1: // L fixed : set of fixed repica ocations, which can t be moved 2: // num repicas: tota number of repicas to be paced 3: procedure weighted-k-means(l fixed, num repicas) 4: // pick initia centroids 5: G L fixed 6: sort a cient custers c C by descending w c 7: whie G < num repicas and more cient custers remain 8: c next cient custer in C 9: if nearest(c, S) G then 10: add nearest(c, S) to G 11: new cost cost(g) 12: repeat 13: prev cost new cost 14: // custer cients according to nearest centroid 15: g G et C g {c g = nearest(c, G)} 16: // attempt to adjust centroids 17: for each g G \ L fixed 18: g v S s.t. c C g w c rtt (i) c,v is minimized 19: update centroid g to g 20: new cost cost(g) 21: unti new cost prev cost < threshod 22: return G troids) L fixed and the tota desired number of ocations num repicas as parameters. First, we choose initia ocations for the remaining centroids (ines 6-10) by pacing them cose to the heaviest cient custers (according to w c). Each centroid ocation g defines a set of cient custers C g for which g is the nearest centroid (ine 15). The remainder of Agorithm 3 tries to adjust the position of each centroid g in a way that minimizes cost (weighted roundtrip-time) for cients in C g. Note that the centroids in L fixed are not being moved. The agorithm terminates returning the set of centroids G once there is no sufficient improvement in the tota cost, i.e., cost(g). Reca that our goa is not ony to find good repica ocations, but aso find a quorum and a eader. Our two new agorithms differ in the order in which they perform these tasks. Agorithm 4 first paces a repicas in strategic ocations using Agorithm 3 and then invokes Agorithm 2 to determine the eader and voters from within the repicas. Agorithm 5, on the other hand, first sets the eader and a quorum of voters and then invokes Agorithm 2 to pace 1495

7 Agorithm 4 Agorithm KQ. 1: procedure KMeans-Quorum(num repicas, num voters) 2: G weighted-k-means(, num repicas) 3: (λ, V ) tier-2-efficient(g, num voters) 4: // Return the eader, set of voters and set of repicas 5: return (λ, V, G) the remaining repicas cose to the cients. More specificay, we go over a possibe eaders ocations in S and find the best quorum for this eader. This quorum is then considered as centroids during the invocation of Agorithm 3 but these centroids are pinned down and not moved by the agorithm. Agorithm 5 Agorithm QK. 1: procedure Quorum-KMeans(num repicas, num voters) num voters+1 2: majority 2 3: minority num voters majority 4: for each repica S 5: q (i) majority-th smaest rtt (i),s, s S 6: Q k-cosest(, majority, q (i) 7:, S) G weighted-k-means(q, num repicas) 8: score agg score (i) (, G, q (i) ) (Equation 2) 9: λ = arg min S {score } 10: O any minority ocations from G λ \ Q λ 11: // Return the eader, set of voters and set of repicas 12: return (λ, Q λ O, G λ ) Note that in Agorithm 5, unike in Agorithm 4, we know both the eader and the quorum atency when invoking Agorithm 3 and therefore in ine 18 of Agorithm 3 actuay use the cost function given in Equation 1 (with the change that the summation is done ony over cients in C g) instead of the simpified cost mode given in Equation 3. For simpicity, this is omitted from the pseudo-code. 6. EVALUATION In this section we describe the evauation of our optimization framework with one of Googe s arge-scae distributed storage systems. This particuar system supports the five representative operation types described in Section 2, which made it the perfect candidate for optimization. We impemented a system consisting of three toos: a data coection pipeine, an optimizer, and a simuator. The data coection pipeine fetches reevant inputs on the number and atencies of reevant operations from Googe s monitoring toos, as we as reevant data from the database schemas such as the network QoS cass used by each database, and then prepares it for consumption by the optimizer. The data is broken down into severa nonoverapping time intervas, within each interva by database, and within each database by cient custer and by operation type. The optimizer generates scores for each one of the requested optimization tiers on each one of the time intervas 1..i reported by the coection pipeine, using an exponentia moving average as demonstrated in Section 3 (Equation 2) with τ = 2 as the decay parameter. It then gives a pacement recommendation for each interva based on the previous ones. Finay, the simuator compares the optimizer s recommended pacement strategy for each interva with other reasonabe pacement heuristics as we as with the optima pacement for the time interva. Our experiments were carried out on machines with 12- core 3.50GHz Xeon(R) CPU and 32 GB RAM. The running times of our toos for tiers 1 and 2 for 48 time intervas on a production databases combined were under 1.5 minutes. In what foows we present experiments dedicated to each of the optimization tiers. 6.1 Leader Pacement In this section we show experimenta resuts demonstrating the benefit of optimizing eader pacement for the vast majority of databases in our storage system. Speedup potentia. In the foowing experiment, we scored the current configuration of each database and compared it with the configuration proposed by our optimizer. We anayzed the average operation atency of databases during one typica workday partitioned into 48 nonoverapping intervas of 30 minutes each. In our storage system, a database administrator can specify an optiona preferred eader ocation, and the storage system picks a ocation cose to it (it may not aways be possibe to use the preferred ocation due to ack of avaiabe resources or ongoing maintenance). When assigning a score to the current database configuration, we need to distinguish databases with and without the preferred eader setting. To each database with preferred eader set to ocation V() we assign the score score (i) = score (i) (, R, q (i) ) at each interva i = 1, 2,..., 48. For databases without a specified preferred eader, the group eaders are assumed to be spread uniformy across V() (according to our observations, this assumption cosey modes rea depoyments in our system). Accordingy, the score of such database configuration is the average of scores score (i) (, R, q (i) ) across V(): score (i) = 1 V() V() For each interva i = 1, 2,..., 48, we cacuate 1 score (i) (λ (i 1) score (i) (, R, q (i) ). (4), R, q (i 1) λ (i 1) )/score (i), the potentia reduction in atency when foowing the pacement recommendation of our optimizer, which paces the eader in interva i based on the preceding intervas 1... i 1. For each database, we cacuate the average atency reduction over a the vaues of i. Figures 1(a) and 1(b) demonstrate the effectiveness of optimizing eader pacement for databases with and without an existing preferred eader setting, respectivey. Observe that there is a significant divide between databases that manuay set the preferred eader and those that do not in terms of atency reduction when foowing our recommended eader pacement. We can see that typicay, administrators that choose to set the preferred eader, set it in a way matching the recommendation of our optimizer; this can be seen in Figure 1(a) which shows that over 75% of the databases of this kind are found in the first bucket [0, 1%], i.e., they cannot further benefit from our recommendation. This serves as a vaidation that our mode matches the intention of database administrators in a these cases. We see, however, that for some databases the manua setting is sub-optima, as evidenced by the existence of 10% outiers, the score of which is off by at east 10% from the optimum. Our recommendations can hep speed-up such outiers. We found, however, that ony 25% of a databases specify a preferred eader and with more new databases created, this 1496

8 Figure 1: Histogram of atency reduction for databases with and without a preferred eader setting. Bins on the x-axis denote % atency reduction compared to current pacement. The height of each bin (y-axis) is the percent of databases (with or without a preferred eader setting) for which the corresponding reduction in atency was measured. percentage diminishes further. This surprising fact further motivates the need for automatic optimizations. For the remaining 75% of the databases our too provides significant atency improvements, as shown in Figure 1(b). The average operation atency in many of the cases can be reduced by tens of percent. Over 17% of such databases can have their average operation atency by foowing our pacement recommendation. For some databases, atency is reduced by more than 90%. Optimizer output and recommendation osciation. Figure 2 shows a sampe output of our optimizer, which outputs the best eader ocation every 30 minutes and the atency overhead for aternative ocations, compared to the best one (for brevity, we show ony 2 additiona ocations). Notice the osciation in recommendations between custers 1 and 3 caused both by their simiar scores and by workoad changes between 00:30 02:00 and 03:00 05:00. Our agorithm mitigates minor workoad spikes by using a decay parameter τ which counters the spikes with historic scores. A second eve of defence shoud be depoyed which considers the costs and benefits of moving the eader to a different ocation. For exampe, moving the eader may not be worth whie if the optimizer predicts a 2% atency reduction. 22:30: opt 1, 2 nd best 2 = 4%, 3 rd best 3 = 9.1% 23:00: opt 1, 2 nd best 2 = 5.77%, 3 rd best 3 = 9.77% 23:30: opt 1, 2 nd best 3 = 5.2%, 3 rd best 3 = 23.53% 00:00: opt 1, 2 nd best 2 = 5.24%, 3 rd best 3 = 7.68% 00:30: opt 3, 2 nd best 1 = 5.59%, 3 rd best 2 = 13.07%... 02:00: opt 3, 2 nd best 1 = 14.32%, 3 rd best 2 = 23.42% 02:30: opt 1, 2 nd best 3 = 7.38%, 3 rd best 2 = 9.16% 03:00: opt 3, 2 nd best 1 = 22.6%, 3 rd best 2 = 33.09%... 05:00: opt 3, 2 nd best 1 = 11.49%, 3 rd best 2 = 23.46% 05:30: opt 3, 2 nd best 1 = 3.3%, 3 rd best 2 = 15.08% 06:00: opt 1, 2 nd best 3 = 0.92%, 3 rd best 2 = 11.73% Figure 2: Sampe output of the optimizer. Comparison with other pacement strategies. We use our simuator to compare four pacement poicies using historica storage activity data from one typica day, discretized into 48 intervas of 30 minutes each. For i = 2, 3, 4,..., 48 and each one of the strategies, the simuator sets the eader at time interva i, based on the prediction provided by the pacement strategies on interva i 1 and assigns score s (i) to that prediction based on the actua workoad data for interva i. We considered four strategies for each database : (optimized) pacing the eader at λ (i) (with decay τ = 2), as predicted by the optimizer using data from intervas preceding i, whose score on interva i is score (i) (λ (i 1), R, q (i 1) ), λ (i 1) (cosest-to-writes) pacing the eader staticay in a custer, wherefrom most of the transactions in interva i = 0 originated, with score score (i) (, R, q (i) ), (smaest-quorum) pacing the eader in a custer = (), where the average round-trip-time atency median v V() {rtt (i),v } from the eader to the majority of voters is minima, with score score (i) (, R, q (i) ), (average) random eader ocation across a the groups of the database, achieving the average score as in (4). We compare with the cosest-to-writes and smaest-quorum strategies, since they are sometimes empoyed by database administrators when setting the preferred eader, and with the average strategy, since it refects the performance of databases without the preferred eader setting, as expained in the previous experiment. The cosest-to-writes strategy is a common heuristic used aso in other systems (see Section 7). Our baseine is the optima orace strategy which sets the eader for interva i at (considering the i-th interva workoad when aposteriori determining the best eader ocation for interva i, using τ = ). Latency overhead s (i) /score (i) (λ (i), R, q(i) ) 1 λ (i) with respect to the optimum score (i) (λ (i), R, q(i) λ (i) λ (i) ) is cacuated for each strategy score s (i) on each interva i 2. Figures 3(a) and 3(b) demonstrate atency reductions (in percent) for two databases with no manua preferred eader ocation. The optimized strategy perfecty predicted the optimum for both databases. In genera, for a the databases, predictions were neary perfect, with sma deviations from optimum due to sudden workoad spikes. More than 90% of operations beonging to the database in Figure 3(a) were bounded reads, that is why the cosest-to-writes and the smaest-quorum poicies, both of which disregard the ocations of readers, underperform compared to the optimized strategy which considers cient ocations and a operation types. The smaest-quorum poicy is sighty better than cosest-to-writes due to its choice of a we-connected repica as eader. In Figure 3(b), 38% of operations are strong reads and 60% of operations are weak reads. Once again the optimized strategy prefecty predicts the optimum strategy and outperforms the average and the smaest-quorum strategies by a arge margin (more than 60% on average); the cosestto-writes strategy is not appicabe in this case, as there were virtuay no transactions in the considered database. 6.2 Evauation in Production We are working directy with customers to vaidate our modes in production. We present the resuts of one such experiment in Figure 4. For simpicity, in this experiment 1497

9 Figure 3: Latency overhead of various pacement strategies compared to the orace optimum score score (i) (λ (i), R, q(i) ). λ (i) we ony reconfigured the chosen database once, even though the optimizer outputs recommendations continuousy. We monitored the 50-th percentie atency (soid ine) as the database is reconfigured (at 16:00) causing a eaders to migrate to the ocation λ recommended by our optimizer (dashed curve shows the percent of eaders in ocation λ). We observe a reduction of 70% in atency when the migration competes (around 16:15), after which the atency sighty increased and stabiized at 40% of its initia vaue, exceeding the predicted improvement by a factor of 2. Even though our mode currenty optimizes mean atency, it is interesting to note that in this experiment we saw a reduction of 30% in 99-th percentie atency (however 90% atency did not improve). In another experiment with a different database, we measured a 33%, 25% and 15% speedup in 50-th, 90-th and 99-th percentie atencies, respectivey. Note that our too predicted a reduction of 39.7% in average atency, which is fairy cose to what was observed. The discrepancy between the predicted and the actua reduction can be ascribed to the fact that at any given point in time the number of eaders at the different ocations from V() is not exacty the same (though across a onger period of time on average, it is cose to uniform). We found that for the first database mentioned above, one of the ocations in V() was taken down for maintenance at the time of the experiment (eaders were eveny spread across the remaining ocations). For the second database, the predicted atency reduction was cacuated under the assumption that a the eaders have an equa probabiity of 20% to be in any one of the 5 possibe ocations, but in reaity about one-third of a the eaders were found in the same ocation. In the future we intend to measure the actua eader distribution across V() dynamicay and encorporate it in our mode. 6.3 Repica Roes Next, we evauate our tier-2 agorithm, that determines the optima repica types in addition to eader pacement. Before conducting our experiments, we intuitivey expected to find databases with workfows exhibiting the foow-thesun phenomenon. 2 For exampe, we expected to see cients in the US and in Europe with intense activity during daytime and reduced activity at night, such that the overa center of activity osciates between US and Europe every 12 hours. We found, however, that often the traffic originating from US-based cients is greater than that originating from non-us cients even during night time in the US, therefore the center of activity aways remains in the US. This is demonstrated for one database in Figure 5, which shows that European traffic amounts to 35% of US traffic during 120 consecutive hours. Figure 4: Production experiment with one database. Figure depicts drop in 50-th percentie atency (soid ine) aong with migration of eaders to recommended ocation (dashed ine). Latency base (100% mark, eft y-axes) is chosen as average atency over 3 hours preceding start of experiment (1pm 4pm). Figure 5: Europe traffic as a percentage of US traffic over 120 hours, for a singe database. Nevertheess, we discovered diurna patterns between US East Coast and West Coast, as shown in Figure 6, where we pot the ratio between the number of operations originating in the East Coast and the number of operations from cients on the West Coast across 48 consecutive hours with one database, overaid with the eader ocations as suggested by our optimizer. Deineated by vertica ines are points at which our optimizer suggested to switch eader pacement from a custer on one coast to a custer on the other coast. The reader can readiy notice the correation between ratios arger than 1 and optimizer recommendations for eader pacement on the East Coast (as we as between ratios smaer than 1 and recommendations for the West Coast). 2 Apache ZooKeeper users have a simiar intuition [6]. 1498

10 Figure 6: Ratio of East Coast to West Coast traffic, for a singe database. Vertica ines denote times at which recommended eader pacement changed from East Coast to West Coast or vise versa. The charts in Figure 7 show the reduction in atency of tier-2 and tier-1 optimizations versus the current score, for two different databases. Figure 7(a) features a database which does not set the preferred eader, 98% of operations in which are strong reads, for which the optimization of tier- 2 was consideraby better than that of tier-1. The eader pacement in our tier-1 optimization was chosen in Centra US, whereas in tier-2 it migrated to the Pacific Northwest. This reduction in atency ooks paradoxica at first, considering the fact that the ocations of the voters and quorum are ony supposed to affect the atency of transactions, which are virtuay nonexistent in this database. This phenomenon is readiy expained by the fact that our optimization in tier-2 aows us to consider a the repicas in R as potentia eader candidates, instead of just the pre-determined set of read-write repicas considered by our tier-1 optimization. Indeed, the Pacific Northwest repica was initiay configured as a read-ony repica and thus coud not function as eader, whereas in tier-2, where we can convert it to a readwrite repica, it has become a egitimate candidate (and an eventua winner ), thereby bringing about the surprising reduction in atency. Figure 7(b) shows a different case, where both tier-1 and tier-2 optimizations suggested the same eader pacement, but tier-2 chose a different quorum, due to which the average operation atency was cut by an additiona 15% compared to tier-1. This database aso does not set a preferred eader. About 57% of operations are strong reads and additiona 42% are muti-group transactions; the atter operations significanty benefited from a new, better connected quorum of repicas. In this case tier-2 approximatey doubes the reduction in atency achieved by tier-1. Note that in a the experiments above we first anayzed the eve of faiure diversity currenty preserved by the database configuration and ony suggested aternative configurations maintaining the same diversity eve. 6.4 Repica Locations In tier-3 of optimization, we experimented with the performance of the KQ and QK heuristics (see Section 5). We start by comparing their performance with that of the exhaustive search preserving the faiure diversity constraints of the current database configuration. Figure 8 shows the average ratio between the score given by the optimizer to the exhaustive search and the scores of QK and KQ heuris- Figure 7: Latency reduction due to tier-1 and 2 optimizations across 3 days of workoad data for two seected databases. tics, as a function of the tota number of repicas, across 12 argest (by the amount of traffic) databases in our system, when the number of voters V was fixed at 3. On the same chart, we aso pot the average ratio between the score of the exhaustive search and the best of two heuristics across the same 12 argest databases. For some databases, KQ is better than QK, whereas for others the QK outperforms KQ, resuting in a perhaps surprising fenomena where the average Best(QK,KQ) score is better than both the average KQ and average QK score. For R {6, 7}, KQ was consistenty better than QK, that is why Best(QK,KQ) coincides with KQ at that point. The performance of QK on the chart is worse on average that than of KQ because of the fact that most of the considered databases are read-heavy and the reativey sma number of repicas considered, of which 2 are wasted by QK on the quorum. For such databases it is worthwhie to spread out the repicas to pace them as cose to most of the cients as possibe, which is where KQ exces in comparison with QK. We notice that aready with R = 5 repicas, the best of the two heuristics performs within 5% margin of the optimum produced by exhaustive search, with the added benefit of being substantiay faster. For R = 7, both QK and KQ, which are poynomia (in R ), generated resuts for a 12 databases within seconds, whereas the exponentia exhaustive search took severa orders of magnitude onger. Next, we compare KQ and QK, specificay interested in identifying workoads where each of the two agorithms shoud be preferred over the other. In the foowing experiment, run with V = 5 and R = 7 we broke down a databases into buckets by the percentage of transactions among a operations and compared the two agorithms for databases in each bucket. Figure 9 shows a positive correation between the percentage of transactions and the superiority of QK, which, for databases with more than 60% transactions performs better by more than 80% compared with KQ. A second experiment in which the databases were broken down into buckets by the percentage of weak reads 1499

11 Figure 8: Score of the exhaustive search in percent of the score of KQ, QK and best of two heuristics (with V = 3), as a function of R. shows a strong correation between that percentage and the superiority of KQ; resuts of this experiment, which appear in Figure 10, demonstrate that for read-heavy databases the speedup of KQ versus QK can be as high as 23% on average. Simiar experiments with breakdowns of databases by percentages of bounded reads and strong reads did not yied a concusive outcome. In the foowing experiment, we set V = 3 and et R range between 4 and 13. We then measure the scores of both QK and KQ heuristics using workoad from one day for one database. Figure 11 demonstrates, for each heuristic and R {4, 5, 6,..., 13}, its average atency sowdown in percent versus the score obtained with 13 repicas, which equas the optima score for any tier-3 optimization with 3 voters (obtained using an exhaustive search). We can readiy see that both heuristics fatten out very soon; specificay, with R = 11, both are within 13.5% margin from the optima score. This demonstrates the deminishing returns of adding more repicas initiay each new repica haves the average operation atency, whie adding the 12th or 13th repica barey makes any difference. Figure 11: Sowdown of QK and KQ heuristics in percent from the optimum. Figure 9: Speedup of QK vs. KQ heuristic as a function of the percentage of transactions in a database. Figure 10: Speedup of KQ vs. QK heuristic as a function of the percentage of weak reads. How many repicas do you need? Whereas the number of read-write repicas is usuay set by an administrator to meet certain faut toerance goas, the tota number of repicas is usuay more fexibe. The cost of adding / moving / maintaining a repica is often significant as it requires aocating resources, copying data, and potentiay depoying other reevant services if coocation dependencies exist. At minimum, the number of repicas shoud be sufficient to withstand the expected database oad. But often, additiona repicas are added cose to the cients in order to reduce atency. Our framework can hep expore the cost/benefit tradeoff of adding such repicas by examining the potentia atency gains, and can determine their ocations. 7. RELATED WORK It has ong been reaized that distributed systems need to be dynamic, i.e., adjust their membership and other configuration parameters over time. Many storage systems [11, 18, 24] use an auxiiary coordination service such as Chubby [13] or ZooKeeper [19] to coordinate reconfiguration whie others use the system itsef [23, 25]. See [22, 12] for a tutoria on different approaches for reconfiguration of repicated state-machines (i.e., Paxos-ike systems) and [8] for a survey on reconfiguring strongy consistent key-vaue stores. Much fewer works provide insights on how to determine the best storage configuration at runtime. Since LAN and WAN environments pose very different chaenges, beow we focus on storage systems that dynamicay reconfigure in WAN. PNUTS [15] and Megastore [10] pace master/eader repicas cose to the writers. Earier works propose other heuristics, e.g., that the current eader shoud hand off eadership to another repica if that repica forwards more requests to the eader than it receives from esewhere [28]. These heuristics may work we for some workoads but not for others. For exampe, in Section 6.1 we show that pacing the eader cose to the origin of the majority of writes performs poory on our production workoads, which are mosty read dominant (and yet invove the eader). Furthermore, unike in [28], we consider network atencies and instead of ooking at the aggregate number of requests (or just one request type, such as writes), we consider the detaied fow of each request type and perform an optimization for the entire workoad. In this work we formay state optimaity criteria and our soution achieves optima eader pacement. Adaptive repication mechanisms in PNUTS [20] and Nomad [27] dynamicay create repicas based on ocay observed reads. In Nomad, for exampe, a repica is created 1500

12 at a given ocation when an object is read more than a certain number of times from that ocation, over a certain period of time, or at a certain rate. Authors of [20] state that they considered more exact methods but decided to use oca heuristics since efficienty acquiring, tracking and coecting access statistics from around the word is a compex and expensive process. In this work we everage Googe s monitoring infrastructure to dynamicay and accuratey track the workoad of each database, as we as network atencies. We demonstrate that a soution optimizing the entire workoad can be both fast and practica. Voey [7] proposes a heuristic for pacing appication data across data centers whie minimizing cient atency as we as synchronization atency arrising from data inter-dependency. The Voey agorithm does not support data repication and was not evauated with repicated state. The authors briefy propose to mode repicas as distinct data items that may have a certain amount of inter-item communication. Note, however, that with repication each cient request is ony sent to one of the repicas; unike Voey, our tier-3 agorithm takes such workoad partitioning into account when pacing the repicas. Furthermore, unike Voey, our cost mode considers muipe types of cient requests with different fows and we compare our repica pacement heuristics with the optimum achieved by an exhaustive search using production workoads. Tuba [9] is an extension of Microsoft Azure Storage that provides geo-repicated key-vaue store and automaticay reconfigures its master and set of repicas based on the workoad. Unike Tuba, our agorithms do not require any changes to the storage system. Tuba uses exhaustive search to enumerate a pacement options and choose the best one. It was evauated with three storage ocations using a synthetic workoad. We tried exhaustive search, but it was not practica for our Googe scae storage system. A highy optimized exhaustive search agorithm for repica pacement (Section 6.4), akin to the exhaustive search in Tuba, took more than a day to compete and was ony sighty better than our heuristic: up to 5% better for 5 repicas per group and ess than 1% for arger configurations. In contrast, our optima agorithms for choosing eader and repica roes (tiers 1 and 2) and heuristic methods for repica pacement (tier 3) terminated in ess than 2 minutes for a the databases combined. 8. CONCLUSION Athough mechanisms exist for changing the repication poicy of distributed storage systems at runtime, system administrators are usuay entrusted with determining the best configuration manuay. We deveoped a new workoaddriven optimization framework that dynamicay and automaticay determines the optima configuration for eader and quorum based systems. Our system optimizes three aspects of the configuration: 1) eader ocation, 2) roes of different servers in the repication protoco, and 3) repica ocations. We show that by just appying the first optimization tier to a arge-scae distributed storage system used internay in Googe, we can reduce the atency of 17% of the databases by more than haf, incuding some databases with a speed-up over 90%. We demonstrate that the second optimization tier further reduces atency by up to 50% in some cases. Finay, we evauated and compared different strategies for seecting repica ocations and showed that they are cose to optima. 9. REFERENCES [1] Amazon dynamo. [2] Amazon simpe. [3] Basho riak. [4] Mongo. [5] Incusion excusion principe. wiki/incusion-excusion_principe#in_probabiity, Retrieved February 10, [6] Zookeeper feature request. https: //issues.apache.org/jira/browse/zookeeper-2027, Retrieved February 10, [7] S. Agarwa et a. Voey: Automated data pacement for geo-distributed coud services. USENIX NSDI, Berkeey, CA, USA, [8] M. K. Aguiera, I. Keidar, D. Makhi, J.-P. Martin, and A. Shraer. Reconfiguring repicated atomic storage: A tutoria. Bu. of EATCS, 102, [9] M. S. Ardekani and D. B. Terry. A sef-configurabe geo-repicated coud storage system. USENIX OSDI, pages , Oct [10] J. Baker et a. Megastore: Providing scaabe, highy avaiabe storage for interactive services. CIDR, [11] M. Baakrishnan, D. Makhi, J. D. Davis, V. Prabhakaran, M. Wei, and T. Wobber. CORFU: A distributed shared og. ACM Trans. Comput. Syst., 31(4):10, [12] K. Birman, D. Makhi, and R. van Renesse. Virtuay synchronous methodoogy for dynamic service repication. Technica Report 151, MSR, Nov [13] M. Burrows. The chubby ock service for oosey-couped distributed systems. In OSDI, pages , [14] S. Bykov, A. Geer, G. Kiot, J. R. Larus, R. Pandya, and J. Thein. Oreans: coud computing for everyone. In ACM SOCC, [15] B. Cooper et a. PNUTS: Yahoo! s hosted data serving patform. Proc. VLDB Endow., 1(2), Aug [16] J. Corbett et a. Spanner: Googe s gobay distributed database. ACM Trans. Comput. Syst., 31(3), Aug [17] R. Escriva, B. Wong, and E. G. Sirer. Hyperdex: A distributed, searchabe key-vaue store. In ACM SIGCOMM, [18] S. Ghemawat, H. Gobioff, and S.-T. Leung. The googe fie system. In SOSP, pages 29 43, [19] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free coordination for internet-scae systems. In USENIX ATC, [20] S. Kadambi et a. Where in the word is my data? PVLDB, 4(11): , [21] L. Lamport. The part-time pariament. ACM Trans. Comput. Syst., 16(2): , [22] L. Lamport, D. Makhi, and L. Zhou. Reconfiguring a state machine. SIGACT News, 41(1):63 73, Mar [23] J. Lorch et a. The smart way to migrate repicated statefu services. In EuroSys, [24] J. MacCormick et a. Boxwood: Abstractions as the foundation for storage infrastructure. In OSDI, [25] A. Shraer, B. Reed, D. Makhi, and F. Junqueira. Dynamic reconfiguration of primary/backup custers. USENIX ATC, [26] M. Stonebraker and A. Weisberg. The vot main memory DBMS. IEEE Data Eng. Bu., 36(2), [27] N. Tran, M. K. Aguiera, and M. Baakrishnan. Onine migration for geo-distributed storage systems. USENIX ATC, pages 15 15, Berkeey, CA, USA, [28] O. Wofson, S. Jajodia, and Y. Huang. An adaptive data repication agorithm. ACM Trans. Database Syst., 22(2): , June [29] Z. Yin et a. An empirica study on configuration errors in commercia and open source systems. In SOSP,

Advanced ColdFusion 4.0 Application Development - 3 - Server Clustering Using Bright Tiger

Advanced ColdFusion 4.0 Application Development - 3 - Server Clustering Using Bright Tiger Advanced CodFusion 4.0 Appication Deveopment - CH 3 - Server Custering Using Bri.. Page 1 of 7 [Figures are not incuded in this sampe chapter] Advanced CodFusion 4.0 Appication Deveopment - 3 - Server

More information

Secure Network Coding with a Cost Criterion

Secure Network Coding with a Cost Criterion Secure Network Coding with a Cost Criterion Jianong Tan, Murie Médard Laboratory for Information and Decision Systems Massachusetts Institute of Technoogy Cambridge, MA 0239, USA E-mai: {jianong, medard}@mit.edu

More information

Fast Robust Hashing. ) [7] will be re-mapped (and therefore discarded), due to the load-balancing property of hashing.

Fast Robust Hashing. ) [7] will be re-mapped (and therefore discarded), due to the load-balancing property of hashing. Fast Robust Hashing Manue Urueña, David Larrabeiti and Pabo Serrano Universidad Caros III de Madrid E-89 Leganés (Madrid), Spain Emai: {muruenya,darra,pabo}@it.uc3m.es Abstract As statefu fow-aware services

More information

WHITE PAPER BEsT PRAcTIcEs: PusHIng ExcEl BEyond ITs limits WITH InfoRmATIon optimization

WHITE PAPER BEsT PRAcTIcEs: PusHIng ExcEl BEyond ITs limits WITH InfoRmATIon optimization Best Practices: Pushing Exce Beyond Its Limits with Information Optimization WHITE Best Practices: Pushing Exce Beyond Its Limits with Information Optimization Executive Overview Microsoft Exce is the

More information

Load Balancing in Distributed Web Server Systems with Partial Document Replication *

Load Balancing in Distributed Web Server Systems with Partial Document Replication * Load Baancing in Distributed Web Server Systems with Partia Document Repication * Ling Zhuo Cho-Li Wang Francis C. M. Lau Department of Computer Science and Information Systems The University of Hong Kong

More information

Fixed income managers: evolution or revolution

Fixed income managers: evolution or revolution Fixed income managers: evoution or revoution Traditiona approaches to managing fixed interest funds rey on benchmarks that may not represent optima risk and return outcomes. New techniques based on separate

More information

Chapter 3: e-business Integration Patterns

Chapter 3: e-business Integration Patterns Chapter 3: e-business Integration Patterns Page 1 of 9 Chapter 3: e-business Integration Patterns "Consistency is the ast refuge of the unimaginative." Oscar Wide In This Chapter What Are Integration Patterns?

More information

Australian Bureau of Statistics Management of Business Providers

Australian Bureau of Statistics Management of Business Providers Purpose Austraian Bureau of Statistics Management of Business Providers 1 The principa objective of the Austraian Bureau of Statistics (ABS) in respect of business providers is to impose the owest oad

More information

Math: Fundamentals 100

Math: Fundamentals 100 Math: Fundamentas 100 Wecome to the Tooing University. This course is designed to be used in conjunction with the onine version of this cass. The onine version can be found at http://www.tooingu.com. We

More information

Teamwork. Abstract. 2.1 Overview

Teamwork. Abstract. 2.1 Overview 2 Teamwork Abstract This chapter presents one of the basic eements of software projects teamwork. It addresses how to buid teams in a way that promotes team members accountabiity and responsibiity, and

More information

GREEN: An Active Queue Management Algorithm for a Self Managed Internet

GREEN: An Active Queue Management Algorithm for a Self Managed Internet : An Active Queue Management Agorithm for a Sef Managed Internet Bartek Wydrowski and Moshe Zukerman ARC Specia Research Centre for Utra-Broadband Information Networks, EEE Department, The University of

More information

Early access to FAS payments for members in poor health

Early access to FAS payments for members in poor health Financia Assistance Scheme Eary access to FAS payments for members in poor heath Pension Protection Fund Protecting Peope s Futures The Financia Assistance Scheme is administered by the Pension Protection

More information

Design Considerations

Design Considerations Chapter 2: Basic Virtua Private Network Depoyment Page 1 of 12 Chapter 2: Basic Virtua Private Network Depoyment Before discussing the features of Windows 2000 tunneing technoogy, it is important to estabish

More information

HYBRID FUZZY LOGIC PID CONTROLLER. Abstract

HYBRID FUZZY LOGIC PID CONTROLLER. Abstract HYBRID FUZZY LOGIC PID CONTROLLER Thomas Brehm and Kudip S. Rattan Department of Eectrica Engineering Wright State University Dayton, OH 45435 Abstract This paper investigates two fuzzy ogic PID controers

More information

Pricing Internet Services With Multiple Providers

Pricing Internet Services With Multiple Providers Pricing Internet Services With Mutipe Providers Linhai He and Jean Warand Dept. of Eectrica Engineering and Computer Science University of Caifornia at Berkeey Berkeey, CA 94709 inhai, wr@eecs.berkeey.edu

More information

eg Enterprise vs. a Big 4 Monitoring Soution: Comparing Tota Cost of Ownership Restricted Rights Legend The information contained in this document is confidentia and subject to change without notice. No

More information

ELEVATING YOUR GAME FROM TRADE SPEND TO TRADE INVESTMENT

ELEVATING YOUR GAME FROM TRADE SPEND TO TRADE INVESTMENT Initiatives Strategic Mapping Success in The Food System: Discover. Anayze. Strategize. Impement. Measure. ELEVATING YOUR GAME FROM TRADE SPEND TO TRADE INVESTMENT Foodservice manufacturers aocate, in

More information

Art of Java Web Development By Neal Ford 624 pages US$44.95 Manning Publications, 2004 ISBN: 1-932394-06-0

Art of Java Web Development By Neal Ford 624 pages US$44.95 Manning Publications, 2004 ISBN: 1-932394-06-0 IEEE DISTRIBUTED SYSTEMS ONLINE 1541-4922 2005 Pubished by the IEEE Computer Society Vo. 6, No. 5; May 2005 Editor: Marcin Paprzycki, http://www.cs.okstate.edu/%7emarcin/ Book Reviews: Java Toos and Frameworks

More information

The growth of online Internet services during the past decade has increased the

The growth of online Internet services during the past decade has increased the IEEE DS Onine, Voume 2, Number 3 March 2001 Strategies for CORBA Middeware-Based Load Baancing Ossama Othman, Caros O'Ryan, and Dougas C. Schmidt University of Caifornia, Irvine The growth of onine Internet

More information

Distribution of Income Sources of Recent Retirees: Findings From the New Beneficiary Survey

Distribution of Income Sources of Recent Retirees: Findings From the New Beneficiary Survey Distribution of Income Sources of Recent Retirees: Findings From the New Beneficiary Survey by Linda Drazga Maxfied and Virginia P. Rena* Using data from the New Beneficiary Survey, this artice examines

More information

NCH Software FlexiServer

NCH Software FlexiServer NCH Software FexiServer This user guide has been created for use with FexiServer Version 1.xx NCH Software Technica Support If you have difficuties using FexiServer pease read the appicabe topic before

More information

The growth of online Internet services during the past decade has

The growth of online Internet services during the past decade has IEEE DS Onine, Voume 2, Number 4 Designing an Adaptive CORBA Load Baancing Service Using TAO Ossama Othman, Caros O'Ryan, and Dougas C. Schmidt University of Caifornia, Irvine The growth of onine Internet

More information

Scheduling in Multi-Channel Wireless Networks

Scheduling in Multi-Channel Wireless Networks Scheduing in Muti-Channe Wireess Networks Vartika Bhandari and Nitin H. Vaidya University of Iinois at Urbana-Champaign, USA vartikab@acm.org, nhv@iinois.edu Abstract. The avaiabiity of mutipe orthogona

More information

Pay-on-delivery investing

Pay-on-delivery investing Pay-on-deivery investing EVOLVE INVESTment range 1 EVOLVE INVESTMENT RANGE EVOLVE INVESTMENT RANGE 2 Picture a word where you ony pay a company once they have deivered Imagine striking oi first, before

More information

Setting Up Your Internet Connection

Setting Up Your Internet Connection 4 CONNECTING TO CHANCES ARE, you aready have Internet access and are using the Web or sending emai. If you downoaded your instaation fies or instaed esigna from the web, you can be sure that you re set

More information

TCP/IP Gateways and Firewalls

TCP/IP Gateways and Firewalls Gateways and Firewas 1 Gateways and Firewas Prof. Jean-Yves Le Boudec Prof. Andrzej Duda ICA, EPFL CH-1015 Ecubens http://cawww.epf.ch Gateways and Firewas Firewas 2 o architecture separates hosts and

More information

Order-to-Cash Processes

Order-to-Cash Processes TMI170 ING info pat 2:Info pat.qxt 01/12/2008 09:25 Page 1 Section Two: Order-to-Cash Processes Gregory Cronie, Head Saes, Payments and Cash Management, ING O rder-to-cash and purchase-topay processes

More information

Introduction the pressure for efficiency the Estates opportunity

Introduction the pressure for efficiency the Estates opportunity Heathy Savings? A study of the proportion of NHS Trusts with an in-house Buidings Repair and Maintenance workforce, and a discussion of eary experiences of Suppies efficiency initiatives Management Summary

More information

Bite-Size Steps to ITIL Success

Bite-Size Steps to ITIL Success 7 Bite-Size Steps to ITIL Success Pus making a Business Case for ITIL! Do you want to impement ITIL but don t know where to start? 7 Bite-Size Steps to ITIL Success can hep you to decide whether ITIL can

More information

Comparison of Traditional and Open-Access Appointment Scheduling for Exponentially Distributed Service Time

Comparison of Traditional and Open-Access Appointment Scheduling for Exponentially Distributed Service Time Journa of Heathcare Engineering Vo. 6 No. 3 Page 34 376 34 Comparison of Traditiona and Open-Access Appointment Scheduing for Exponentiay Distributed Service Chongjun Yan, PhD; Jiafu Tang *, PhD; Bowen

More information

arxiv:1506.05851v1 [cs.ai] 18 Jun 2015

arxiv:1506.05851v1 [cs.ai] 18 Jun 2015 Smart Pacing for Effective Onine Ad Campaign Optimization Jian Xu, Kuang-chih Lee, Wentong Li, Hang Qi, and Quan Lu Yahoo Inc. 7 First Avenue, Sunnyvae, Caifornia 9489 {xuian,kcee,wentong,hangqi,qu}@yahoo-inc.com

More information

Business schools are the academic setting where. The current crisis has highlighted the need to redefine the role of senior managers in organizations.

Business schools are the academic setting where. The current crisis has highlighted the need to redefine the role of senior managers in organizations. c r o s os r oi a d s REDISCOVERING THE ROLE OF BUSINESS SCHOOLS The current crisis has highighted the need to redefine the roe of senior managers in organizations. JORDI CANALS Professor and Dean, IESE

More information

Life Contingencies Study Note for CAS Exam S. Tom Struppeck

Life Contingencies Study Note for CAS Exam S. Tom Struppeck Life Contingencies Study Note for CAS Eam S Tom Struppeck (Revised 9/19/2015) Introduction Life contingencies is a term used to describe surviva modes for human ives and resuting cash fows that start or

More information

Best Practices for Push & Pull Using Oracle Inventory Stock Locators. Introduction to Master Data and Master Data Management (MDM): Part 1

Best Practices for Push & Pull Using Oracle Inventory Stock Locators. Introduction to Master Data and Master Data Management (MDM): Part 1 SPECIAL CONFERENCE ISSUE THE OFFICIAL PUBLICATION OF THE Orace Appications USERS GROUP spring 2012 Introduction to Master Data and Master Data Management (MDM): Part 1 Utiizing Orace Upgrade Advisor for

More information

WHITE PAPER UndERsTAndIng THE VAlUE of VIsUAl data discovery A guide To VIsUAlIzATIons

WHITE PAPER UndERsTAndIng THE VAlUE of VIsUAl data discovery A guide To VIsUAlIzATIons Understanding the Vaue of Visua Data Discovery A Guide to Visuaizations WHITE Tabe of Contents Executive Summary... 3 Chapter 1 - Datawatch Visuaizations... 4 Chapter 2 - Snapshot Visuaizations... 5 Bar

More information

INDUSTRIAL PROCESSING SITES COMPLIANCE WITH THE NEW REGULATORY REFORM (FIRE SAFETY) ORDER 2005

INDUSTRIAL PROCESSING SITES COMPLIANCE WITH THE NEW REGULATORY REFORM (FIRE SAFETY) ORDER 2005 INDUSTRIAL PROCESSING SITES COMPLIANCE WITH THE NEW REGULATORY REFORM (FIRE SAFETY) ORDER 2005 Steven J Manchester BRE Fire and Security E-mai: manchesters@bre.co.uk The aim of this paper is to inform

More information

READING A CREDIT REPORT

READING A CREDIT REPORT Name Date CHAPTER 6 STUDENT ACTIVITY SHEET READING A CREDIT REPORT Review the sampe credit report. Then search for a sampe credit report onine, print it off, and answer the questions beow. This activity

More information

Pricing and Revenue Sharing Strategies for Internet Service Providers

Pricing and Revenue Sharing Strategies for Internet Service Providers Pricing and Revenue Sharing Strategies for Internet Service Providers Linhai He and Jean Warand Department of Eectrica Engineering and Computer Sciences University of Caifornia at Berkeey {inhai,wr}@eecs.berkeey.edu

More information

Wide-Area Traffic Management for. Cloud Services

Wide-Area Traffic Management for. Cloud Services Wide-Area Traffic Management for Coud Services Joe Wenjie Jiang A Dissertation Presented to the Facuty of Princeton University in Candidacy for the Degree of Doctor of Phiosophy Recommended for Acceptance

More information

The eg Suite Enabing Rea-Time Monitoring and Proactive Infrastructure Triage White Paper Restricted Rights Legend The information contained in this document is confidentia and subject to change without

More information

Chapter 2 Traditional Software Development

Chapter 2 Traditional Software Development Chapter 2 Traditiona Software Deveopment 2.1 History of Project Management Large projects from the past must aready have had some sort of project management, such the Pyramid of Giza or Pyramid of Cheops,

More information

Precise assessment of partial discharge in underground MV/HV power cables and terminations

Precise assessment of partial discharge in underground MV/HV power cables and terminations QCM-C-PD-Survey Service Partia discharge monitoring for underground power cabes Precise assessment of partia discharge in underground MV/HV power cabes and terminations Highy accurate periodic PD survey

More information

Enabling Direct Interest-Aware Audience Selection

Enabling Direct Interest-Aware Audience Selection Enabing Direct Interest-Aware Audience Seection ABSTRACT Arie Fuxman Microsoft Research Mountain View, CA arief@microsoft.com Zhenhui Li University of Iinois Urbana-Champaign, Iinois zi28@uiuc.edu Advertisers

More information

Sorting, Merge Sort and the Divide-and-Conquer Technique

Sorting, Merge Sort and the Divide-and-Conquer Technique Inf2B gorithms and Data Structures Note 7 Sorting, Merge Sort and the Divide-and-Conquer Technique This and a subsequent next ecture wi mainy be concerned with sorting agorithms. Sorting is an extremey

More information

Network/Communicational Vulnerability

Network/Communicational Vulnerability Automated teer machines (ATMs) are a part of most of our ives. The major appea of these machines is convenience The ATM environment is changing and that change has serious ramifications for the security

More information

PERFORMANCE ANALYSIS OF GANG SCHEDULING IN A PARTITIONABLE PARALLEL SYSTEM

PERFORMANCE ANALYSIS OF GANG SCHEDULING IN A PARTITIONABLE PARALLEL SYSTEM PERFORMANCE ANALYSIS OF GANG SCHEDULING IN A PARTITIONABLE PARALLEL SYSTEM Heen D. Karatza Department of Informatics Aristote University of Thessaoniki 54124 Thessaoniki, Greece E-mai: karatza@csd.auth.gr

More information

Leakage detection in water pipe networks using a Bayesian probabilistic framework

Leakage detection in water pipe networks using a Bayesian probabilistic framework Probabiistic Engineering Mechanics 18 (2003) 315 327 www.esevier.com/ocate/probengmech Leakage detection in water pipe networks using a Bayesian probabiistic framework Z. Pouakis, D. Vaougeorgis, C. Papadimitriou*

More information

Multi-Robot Task Scheduling

Multi-Robot Task Scheduling Proc of IEEE Internationa Conference on Robotics and Automation, Karsruhe, Germany, 013 Muti-Robot Tas Scheduing Yu Zhang and Lynne E Parer Abstract The scheduing probem has been studied extensivey in

More information

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013 1

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013 1 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013 1 Scaabe Muti-Cass Traffic Management in Data Center Backbone Networks Amitabha Ghosh, Sangtae Ha, Edward Crabbe, and Jennifer

More information

CONTRIBUTION OF INTERNAL AUDITING IN THE VALUE OF A NURSING UNIT WITHIN THREE YEARS

CONTRIBUTION OF INTERNAL AUDITING IN THE VALUE OF A NURSING UNIT WITHIN THREE YEARS Dehi Business Review X Vo. 4, No. 2, Juy - December 2003 CONTRIBUTION OF INTERNAL AUDITING IN THE VALUE OF A NURSING UNIT WITHIN THREE YEARS John N.. Var arvatsouakis atsouakis DURING the present time,

More information

Betting Strategies, Market Selection, and the Wisdom of Crowds

Betting Strategies, Market Selection, and the Wisdom of Crowds Betting Strategies, Market Seection, and the Wisdom of Crowds Wiemien Kets Northwestern University w-kets@keogg.northwestern.edu David M. Pennock Microsoft Research New York City dpennock@microsoft.com

More information

A New Statistical Approach to Network Anomaly Detection

A New Statistical Approach to Network Anomaly Detection A New Statistica Approach to Network Anomay Detection Christian Caegari, Sandrine Vaton 2, and Michee Pagano Dept of Information Engineering, University of Pisa, ITALY E-mai: {christiancaegari,mpagano}@ietunipiit

More information

With the arrival of Java 2 Micro Edition (J2ME) and its industry

With the arrival of Java 2 Micro Edition (J2ME) and its industry Knowedge-based Autonomous Agents for Pervasive Computing Using AgentLight Fernando L. Koch and John-Jues C. Meyer Utrecht University Project AgentLight is a mutiagent system-buiding framework targeting

More information

IBM Security QRadar SIEM

IBM Security QRadar SIEM IBM Security QRadar SIEM Boost threat protection and compiance with an integrated investigative reporting system Highights Integrate og management and network threat protection technoogies within a common

More information

Qualifications, professional development and probation

Qualifications, professional development and probation UCU Continuing Professiona Deveopment Quaifications, professiona deveopment and probation Initia training and further education teaching quaifications Since September 2007 a newy appointed FE ecturers,

More information

(12) United States Patent Rune

(12) United States Patent Rune (12) United States Patent Rune US006304913B1 (10) Patent N0.: (45) Date of Patent: US 6,304,913 B1 on. 16, 2001 (54) INTERNET SYSTEM AND METHOD FOR SELECTING A CLOSEST SERVER FROM A PLURALITY OF ALTERNATIVE

More information

Recognition of Prior Learning

Recognition of Prior Learning Recognition of Prior Learning Information Guideines for Students EXTENDED CAMPUS This subject materia is issued by Cork Institute of Technoogy on the understanding that: Cork Institute of Technoogy expressy

More information

3.5 Pendulum period. 2009-02-10 19:40:05 UTC / rev 4d4a39156f1e. g = 4π2 l T 2. g = 4π2 x1 m 4 s 2 = π 2 m s 2. 3.5 Pendulum period 68

3.5 Pendulum period. 2009-02-10 19:40:05 UTC / rev 4d4a39156f1e. g = 4π2 l T 2. g = 4π2 x1 m 4 s 2 = π 2 m s 2. 3.5 Pendulum period 68 68 68 3.5 Penduum period 68 3.5 Penduum period Is it coincidence that g, in units of meters per second squared, is 9.8, very cose to 2 9.87? Their proximity suggests a connection. Indeed, they are connected

More information

Integrating Risk into your Plant Lifecycle A next generation software architecture for risk based

Integrating Risk into your Plant Lifecycle A next generation software architecture for risk based Integrating Risk into your Pant Lifecyce A next generation software architecture for risk based operations Dr Nic Cavanagh 1, Dr Jeremy Linn 2 and Coin Hickey 3 1 Head of Safeti Product Management, DNV

More information

Finance 360 Problem Set #6 Solutions

Finance 360 Problem Set #6 Solutions Finance 360 Probem Set #6 Soutions 1) Suppose that you are the manager of an opera house. You have a constant margina cost of production equa to $50 (i.e. each additiona person in the theatre raises your

More information

AN APPROACH TO THE STANDARDISATION OF ACCIDENT AND INJURY REGISTRATION SYSTEMS (STAIRS) IN EUROPE

AN APPROACH TO THE STANDARDISATION OF ACCIDENT AND INJURY REGISTRATION SYSTEMS (STAIRS) IN EUROPE AN APPROACH TO THE STANDARDSATON OF ACCDENT AND NJURY REGSTRATON SYSTEMS (STARS) N EUROPE R. Ross P. Thomas Vehice Safety Research Centre Loughborough University B. Sexton Transport Research Laboratory

More information

SPOTLIGHT. A year of transformation

SPOTLIGHT. A year of transformation WINTER ISSUE 2014 2015 SPOTLIGHT Wecome to the winter issue of Oasis Spotight. These newsetters are designed to keep you upto-date with news about the Oasis community. This quartery issue features an artice

More information

PREFACE. Comptroller General of the United States. Page i

PREFACE. Comptroller General of the United States. Page i - I PREFACE T he (+nera Accounting Office (GAO) has ong beieved that the federa government urgenty needs to improve the financia information on which it bases many important decisions. To run our compex

More information

The guaranteed selection. For certainty in uncertain times

The guaranteed selection. For certainty in uncertain times The guaranteed seection For certainty in uncertain times Making the right investment choice If you can t afford to take a ot of risk with your money it can be hard to find the right investment, especiay

More information

Virtual trunk simulation

Virtual trunk simulation Virtua trunk simuation Samui Aato * Laboratory of Teecommunications Technoogy Hesinki University of Technoogy Sivia Giordano Laboratoire de Reseaux de Communication Ecoe Poytechnique Federae de Lausanne

More information

Learning from evaluations Processes and instruments used by GIZ as a learning organisation and their contribution to interorganisational learning

Learning from evaluations Processes and instruments used by GIZ as a learning organisation and their contribution to interorganisational learning Monitoring and Evauation Unit Learning from evauations Processes and instruments used by GIZ as a earning organisation and their contribution to interorganisationa earning Contents 1.3Learning from evauations

More information

NCH Software MoneyLine

NCH Software MoneyLine NCH Software MoneyLine This user guide has been created for use with MoneyLine Version 2.xx NCH Software Technica Support If you have difficuties using MoneyLine pease read the appicabe topic before requesting

More information

Lecture 7 Datalink Ethernet, Home. Datalink Layer Architectures

Lecture 7 Datalink Ethernet, Home. Datalink Layer Architectures Lecture 7 Dataink Ethernet, Home Peter Steenkiste Schoo of Computer Science Department of Eectrica and Computer Engineering Carnegie Meon University 15-441 Networking, Spring 2004 http://www.cs.cmu.edu/~prs/15-441

More information

Human Capital & Human Resources Certificate Programs

Human Capital & Human Resources Certificate Programs MANAGEMENT CONCEPTS Human Capita & Human Resources Certificate Programs Programs to deveop functiona and strategic skis in: Human Capita // Human Resources ENROLL TODAY! Contract Hoder Contract GS-02F-0010J

More information

SELECTING THE SUITABLE ERP SYSTEM: A FUZZY AHP APPROACH. Ufuk Cebeci

SELECTING THE SUITABLE ERP SYSTEM: A FUZZY AHP APPROACH. Ufuk Cebeci SELECTING THE SUITABLE ERP SYSTEM: A FUZZY AHP APPROACH Ufuk Cebeci Department of Industria Engineering, Istanbu Technica University, Macka, Istanbu, Turkey - ufuk_cebeci@yahoo.com Abstract An Enterprise

More information

Breakeven analysis and short-term decision making

Breakeven analysis and short-term decision making Chapter 20 Breakeven anaysis and short-term decision making REAL WORLD CASE This case study shows a typica situation in which management accounting can be hepfu. Read the case study now but ony attempt

More information

A Supplier Evaluation System for Automotive Industry According To Iso/Ts 16949 Requirements

A Supplier Evaluation System for Automotive Industry According To Iso/Ts 16949 Requirements A Suppier Evauation System for Automotive Industry According To Iso/Ts 16949 Requirements DILEK PINAR ÖZTOP 1, ASLI AKSOY 2,*, NURSEL ÖZTÜRK 2 1 HONDA TR Purchasing Department, 41480, Çayırova - Gebze,

More information

Design and Analysis of a Hidden Peer-to-peer Backup Market

Design and Analysis of a Hidden Peer-to-peer Backup Market Design and Anaysis of a Hidden Peer-to-peer Backup Market Sven Seuken, Denis Chares, Max Chickering, Mary Czerwinski Kama Jain, David C. Parkes, Sidd Puri, and Desney Tan December, 2015 Abstract We present

More information

Key Features of Life Insurance

Key Features of Life Insurance Key Features of Life Insurance Life Insurance Key Features The Financia Conduct Authority is a financia services reguator. It requires us, Aviva, to give you this important information to hep you to decide

More information

APIS Software Training /Consulting

APIS Software Training /Consulting APIS Software Training /Consuting IQ-Software Services APIS Informationstechnoogien GmbH The information contained in this document is subject to change without prior notice. It does not represent any

More information

Measuring operational risk in financial institutions

Measuring operational risk in financial institutions Measuring operationa risk in financia institutions Operationa risk is now seen as a major risk for financia institutions. This paper considers the various methods avaiabe to measure operationa risk, and

More information

Traffic classification-based spam filter

Traffic classification-based spam filter Traffic cassification-based spam fiter Ni Zhang 1,2, Yu Jiang 3, Binxing Fang 1, Xueqi Cheng 1, Li Guo 1 1 Software Division, Institute of Computing Technoogy, Chinese Academy of Sciences, 100080, Beijing,

More information

Overview of Health and Safety in China

Overview of Health and Safety in China Overview of Heath and Safety in China Hongyuan Wei 1, Leping Dang 1, and Mark Hoye 2 1 Schoo of Chemica Engineering, Tianjin University, Tianjin 300072, P R China, E-mai: david.wei@tju.edu.cn 2 AstraZeneca

More information

Perfect competition. By the end of this chapter, you should be able to: 7 Perfect competition. 1 Microeconomics. 1 Microeconomics

Perfect competition. By the end of this chapter, you should be able to: 7 Perfect competition. 1 Microeconomics. 1 Microeconomics erfect 7 12 By the end of this chapter, you shoud be abe to: HL expain the assumptions of perfect HL distinguish between the demand curve for the industry and for the firm in perfect HL expain how the

More information

TERM INSURANCE CALCULATION ILLUSTRATED. This is the U.S. Social Security Life Table, based on year 2007.

TERM INSURANCE CALCULATION ILLUSTRATED. This is the U.S. Social Security Life Table, based on year 2007. This is the U.S. Socia Security Life Tabe, based on year 2007. This is avaiabe at http://www.ssa.gov/oact/stats/tabe4c6.htm. The ife eperiences of maes and femaes are different, and we usuay do separate

More information

The Use of Cooling-Factor Curves for Coordinating Fuses and Reclosers

The Use of Cooling-Factor Curves for Coordinating Fuses and Reclosers he Use of ooing-factor urves for oordinating Fuses and Recosers arey J. ook Senior Member, IEEE S& Eectric ompany hicago, Iinois bstract his paper describes how to precisey coordinate distribution feeder

More information

AUSTRALIA S GAMBLING INDUSTRIES - INQUIRY

AUSTRALIA S GAMBLING INDUSTRIES - INQUIRY Mr Gary Banks Chairman Productivity Commission PO Box 80 BELCONNEN ACT 2616 Dear Mr Banks AUSTRALIA S GAMBLING INDUSTRIES - INQUIRY I refer to the Issues Paper issued September 1998 seeking submissions

More information

Market Design & Analysis for a P2P Backup System

Market Design & Analysis for a P2P Backup System Market Design & Anaysis for a P2P Backup System Sven Seuken Schoo of Engineering & Appied Sciences Harvard University, Cambridge, MA seuken@eecs.harvard.edu Denis Chares, Max Chickering, Sidd Puri Microsoft

More information

Business Banking. A guide for franchises

Business Banking. A guide for franchises Business Banking A guide for franchises Hep with your franchise business, right on your doorstep A true understanding of the needs of your business: that s what makes RBS the right choice for financia

More information

Face Hallucination and Recognition

Face Hallucination and Recognition Face Haucination and Recognition Xiaogang Wang and Xiaoou Tang Department of Information Engineering, The Chinese University of Hong Kong {xgwang1, xtang}@ie.cuhk.edu.hk http://mmab.ie.cuhk.edu.hk Abstract.

More information

TMI ING Guide to Financial Supply Chain Optimisation 29. Creating Opportunities for Competitive Advantage. Section Four: Supply Chain Finance

TMI ING Guide to Financial Supply Chain Optimisation 29. Creating Opportunities for Competitive Advantage. Section Four: Supply Chain Finance TMI171 ING info pat :Info pat.qxt 19/12/2008 17:02 Page 29 ING Guide to Financia Suppy Chain Optimisation Creating Opportunities for Competitive Advantage Section Four: Suppy Chain Finance Introduction

More information

SNMP Reference Guide for Avaya Communication Manager

SNMP Reference Guide for Avaya Communication Manager SNMP Reference Guide for Avaya Communication Manager 03-602013 Issue 1.0 Feburary 2007 2006 Avaya Inc. A Rights Reserved. Notice Whie reasonabe efforts were made to ensure that the information in this

More information

Let s get usable! Usability studies for indexes. Susan C. Olason. Study plan

Let s get usable! Usability studies for indexes. Susan C. Olason. Study plan Let s get usabe! Usabiity studies for indexes Susan C. Oason The artice discusses a series of usabiity studies on indexes from a systems engineering and human factors perspective. The purpose of these

More information

Pricing and hedging of variable annuities

Pricing and hedging of variable annuities Cutting Edge Pricing and hedging of variabe annuities Variabe annuity products are unit-inked investments with some form of guarantee, traditionay sod by insurers or banks into the retirement and investment

More information

Leadership & Management Certificate Programs

Leadership & Management Certificate Programs MANAGEMENT CONCEPTS Leadership & Management Certificate Programs Programs to deveop expertise in: Anaytics // Leadership // Professiona Skis // Supervision ENROLL TODAY! Contract oder Contract GS-02F-0010J

More information

Simultaneous Routing and Power Allocation in CDMA Wireless Data Networks

Simultaneous Routing and Power Allocation in CDMA Wireless Data Networks Simutaneous Routing and Power Aocation in CDMA Wireess Data Networks Mikae Johansson *,LinXiao and Stephen Boyd * Department of Signas, Sensors and Systems Roya Institute of Technoogy, SE 00 Stockhom,

More information

Normalization of Database Tables. Functional Dependency. Examples of Functional Dependencies: So Now what is Normalization? Transitive Dependencies

Normalization of Database Tables. Functional Dependency. Examples of Functional Dependencies: So Now what is Normalization? Transitive Dependencies ISM 602 Dr. Hamid Nemati Objectives The idea Dependencies Attributes and Design Understand concepts normaization (Higher-Leve Norma Forms) Learn how to normaize tabes Understand normaization and database

More information

Federal Financial Management Certificate Program

Federal Financial Management Certificate Program MANAGEMENT CONCEPTS Federa Financia Management Certificate Program Training to hep you achieve the highest eve performance in: Accounting // Auditing // Budgeting // Financia Management ENROLL TODAY! Contract

More information

Efficient Data Partitioning Model for Heterogeneous Graphs in the Cloud

Efficient Data Partitioning Model for Heterogeneous Graphs in the Cloud Efficient Data Partitioning Mode for Heterogeneous Graphs in the Coud Kisung Lee Georgia Institute of Technoogy ksee@gatech.edu Ling Liu Georgia Institute of Technoogy ingiu@cc.gatech.edu ABSTRACT As the

More information

Chapter 3: JavaScript in Action Page 1 of 10. How to practice reading and writing JavaScript on a Web page

Chapter 3: JavaScript in Action Page 1 of 10. How to practice reading and writing JavaScript on a Web page Chapter 3: JavaScript in Action Page 1 of 10 Chapter 3: JavaScript in Action In this chapter, you get your first opportunity to write JavaScript! This chapter introduces you to JavaScript propery. In addition,

More information

Spatio-Temporal Asynchronous Co-Occurrence Pattern for Big Climate Data towards Long-Lead Flood Prediction

Spatio-Temporal Asynchronous Co-Occurrence Pattern for Big Climate Data towards Long-Lead Flood Prediction Spatio-Tempora Asynchronous Co-Occurrence Pattern for Big Cimate Data towards Long-Lead Food Prediction Chung-Hsien Yu, Dong Luo, Wei Ding, Joseph Cohen, David Sma and Shafiqu Isam Department of Computer

More information

Undergraduate Studies in. Education and International Development

Undergraduate Studies in. Education and International Development Undergraduate Studies in Education and Internationa Deveopment Wecome Wecome to the Schoo of Education and Lifeong Learning at Aberystwyth University. Over 100 years ago, Aberystwyth was the first university

More information

GreenTE: Power-Aware Traffic Engineering

GreenTE: Power-Aware Traffic Engineering GreenTE: Power-Aware Traffic Engineering Mingui Zhang zmg6@mais.tsinghua.edu.cn Cheng Yi yic@emai.arizona.edu Bin Liu iub@tsinghua.edu.cn Beichuan Zhang bzhang@arizona.edu Abstract Current network infrastructures

More information

Vendor Performance Measurement Using Fuzzy Logic Controller

Vendor Performance Measurement Using Fuzzy Logic Controller The Journa of Mathematics and Computer Science Avaiabe onine at http://www.tjmcs.com The Journa of Mathematics and Computer Science Vo.2 No.2 (2011) 311-318 Performance Measurement Using Fuzzy Logic Controer

More information

effect on major accidents

effect on major accidents An Investigation into a weekend (or bank hoiday) effect on major accidents Nicoa C. Heaey 1 and Andrew G. Rushton 2 1 Heath and Safety Laboratory, Harpur Hi, Buxton, Derbyshire, SK17 9JN 2 Hazardous Instaations

More information