Replica selection in Apache Cassandra

Transcription

1 DEGREE PROJECT, IN COMPUTER SCIENCE, SECOND LEVEL STOCKHOLM, SWEDEN 2015 Replica selection in Apache Cassandra REDUCING THE TAIL LATENCY FOR READS USING THE C3 ALGORITHM SOFIE THORSEN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION (CSC)

2 Replica selection in Apache Cassandra Reducing the tail latency for reads using the C3 algorithm Val av replikor i Apache Cassandra SOFIE THORSEN [email protected] Master s Thesis at CSC Supervisor: Per Austrin Examiner: Johan Håstad Employer: Spotify

3

4 Abstract Keeping response times low is crucial in order to provide a good user experience. Especially the tail latency proves to be a challenge to keep low as size, complexity and overall use of services scale up. In this thesis we look at reducing the tail latency for reads in the Apache Cassandra database system by implementing the new replica selection algorithm called C3, recently developed by Lalith Suresh, Marco Canini, Stefan Schmid and Anja Feldmann. Through extensive benchmarks with several stress tools, we find that C3 indeed decreases the tail latencies of Cassandra on generated load. However, when evaluating C3 on production load, results does not show any particular improvement. We argue that this is mostly due to the variable size records in the data set and token awareness in the production client. We also present a client-side implementation of C3 in the DataStax Java driver in an attempt to remove the caveat of token aware clients. The client-side implementation did give positive results, but as the benchmark results showed a lot of variance we deem the results to be too inconclusive to confirm that the implementation works as intended. We conclude that the serverside C3 algorithm will work e ectively for systems with homogeneous row sizes where the clients are not token aware.

5 Sammanfattning Val av replikor i Apache Cassandra För att kunna erbjuda en bra användarupplevelse så är det av högsta vikt att hålla responstiden låg. Speciellt svanslatensen är en utmaning att hålla låg då dagens applikationer växer både i storlek, komplexitet och användning. I denna rapport undersöker vi svanslatensen vid läsning i databassystemet Apache Cassandra och huruvida den går att förbättra. Detta genom att implementera den nya selektionsalgoritmen för replikor, kallad C3, nyligen framtagen av Lalith Suresh, Marco Canini, Stefan Schmid och Anja Feldmann. Genom utförliga tester med flera olika stressverktyg så finner vi att C3 verkligen förbättrar Cassandras svanslatenser på genererad last. Dock så visade använding av C3 på produktionslast ingen större förbättring. Vi hävdar att detta framförallt beror på en variabel storlek på datasetet och att produktionsklienten är tokenmedveten. Vi presenterar också en klientimplementation av C3 i Java-drivrutinen från DataStax, i ett försök att åtgärda problemet med tokenmedventa klienter. Klientimplementationen av C3 gav positiva resultat, men då testresultaten uppvisade stor varians så anser vi att resultaten är för osäkra för att kunna bekräfta att implentationen fungerar så som den är avsedd. Vi drar slutsatsen att C3, implementerad på servern, fungerar e ektivt på system med homogen storlek på datat och där klienter ej är tokenmedvetna.

6 Contents Acknowledgements 1 Introduction Problem statement Background Terminology and definitions Load balancing and replica selection Percentiles and tail latency CAP theorem Eventual consistency SQL NoSQL Accrual failure detection Exponentially weighted moving averages (EMWA) RAID Apache Cassandra Load balancing techniques in distributed systems The power of d choices Join-Shortest-Queue Join-Idle-Queue Speculative retries Tied requests The C3 algorithm Replica ranking Rate control Notes on the C3 implementation Method Tools for testing The cassandra-stress tool The Yahoo Cloud Serving Benchmark The Java driver stress tool

7 3.1.4 Darkloading Test environment setup Testing on generated load Testing on production load Implementation Implementing C3 in Cassandra Implementing C3 in the DataStax Java driver Naive implementation Benchmarking with YCSB Benchmarking with cassandra-stress Benchmarking with the java-driver stress tool Darkloading Results Benchmarking with YCSB Benchmarking with cassandra-stress Benchmarking with the java-driver stress tool Performance of the C3 client Darkloading Performance with token awareness Performance with round robin Discussion Performance of server side C YCSB vs. cassandra-stress Darkloading Performance of client side C Conclusion A Results from benchmarks 37 A.1 YCSB A.2 cassandra-stress A.3 java-driver stress A.4 Darkloading A.4.1 Token aware A.4.2 Round robin Bibliography 41

8 Acknowledgements I want to thank Lalith Suresh and Marco Canini for continuously discussing thoughts and sharing ideas throughout this project. I also want to thank Jimmy Mårdell for his support and expertise with the quirkiness and caveats that Cassandra presents, as well for volunteering to be my supervisor in the first place.

9

10 Chapter 1 Introduction For all service-oriented applications, fast response times are vital for a good user experience. To examine the exact impact of server delays, Amazon and Google conducted experiments where they added extra delays on every query before sending back results to users [21]. One of their findings was that an extra delay of only 500 milliseconds per query resulted in a 1.2% loss of users and revenue, with the e ect persisting even after the delay was removed. However, keeping response times low is not an easy task. As Google reported [12], especially the tail latency is challenging to keep low as size, complexity and overall use of services scale up. When serving a single user request, multiple servers can be involved. Bad latency on a few machines then quickly results in higher overall latencies, and the more machines, the worse the tail latency. To illustrate why, consider a client making a request to a single server. Suppose that the server has an acceptable response time in 99% of the requests, but the last 1% of the requests takes a second or more to serve. This scenario is not too bad, as it only means that one client gets a slightly slower response every now and then. Consider instead a hundred servers like this and that a request requires a response from all servers. This will greatly change the responsiveness of the system. From 1% of the requests being slow, suddenly 63% 1 of the requests will take more than a second to serve. It is then apparent that the tail latency must be taken seriously in order to provide a good service. Apache Cassandra is the database of choice at Spotify for end user facing features. Spotify runs more than 80 Cassandra clusters on over 650 servers, managing im- 1 Assuming independence between response times, the probability that at least one response takes more than a second is

11 CHAPTER 1. INTRODUCTION portant data such as playlists, music collections, account information, user/artist followers and more. Since an end user request often involves reading from several databases, poor tail latencies will a ect the user experience negatively for a large number of users. In this thesis a replica selection algorithm for usage with Cassandra was implemented and evaluated, with focus on reducing the tail latency for reads. 1.1 Problem statement The data in Cassandra is replicated to several nodes in the cluster to provide high availability. The performance of the nodes in the cluster varies over time though, for instance due to internal data maintenance operations and Java garbage collections. When data is read, a replica selection algorithm in Cassandra determines which node in the cluster the request should be sent to. The built in replica selection algorithm provides good median latency, but the tail latency is often an order of magnitude worse than the median, which leads to the following question: Can the tail latency for reads in Cassandra be reduced in practice by using a more sophisticated replica selection algorithm? 2

12 Chapter 2 Background 2.1 Terminology and definitions In this section we discuss concepts and technology necessary to follow the thesis. The reader familiar with the concepts can skip this section Load balancing and replica selection Load balancing is the process of distributing workload across multiple computing resources, such as servers. Replica selection is a form of load balancing as it tries to balance requests across the set of nodes that own the requested data Percentiles and tail latency A percentile is a statistical measure that indicates the value below which a given percentage of observations in a group of observations fall. For example, the 95th percentile is the smallest value which is greater than or equal to 95% of the observations. In the context of latencies, percentiles are important measures when analyzing data. For example, if only using mean and median in analysis, outliers can remain hidden. In contrast, the maximum value gives a pessimistic view since it can be distorted by a single data point. Consider the graph in Figure 2.1, showing latencies over time. If only presenting mean and median latency, crucial information is lost. After 5 hours, the 99th percentile shows a spike that is not noticeable in the mean or median. After 8.5 hours the 99th percentile shows that 1% of the users are experiencing more than 800 ms latencies, while the mean is only 75 ms. The higher percentiles, commonly 95-99th, are often referred to as the tail latency. 3

13 CHAPTER 2. BACKGROUND 1, Mean Median 99th Latency (ms) Time (hours) Figure 2.1: Latencies over time CAP theorem The CAP theorem, also known as Brewer s theorem [14], states that for a distributed computer system it is impossible to simultaneously provide all three of the following: Consistency - all nodes see the same data at the same time. Availability - every request receives a response about whether it succeeded or failed. Partition tolerance - the system continues to operate despite arbitrary message loss or failure of part of the system Eventual consistency Eventual consistency is a consistency model used in distributed systems to achieve high availability. The consistency model informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value SQL Structured Query Language (SQL) is a special-purpose programming language designed for managing data held in a traditional relational database management system (RDBMS). The data model in a relational database uses tables with rows and 4

14 2.1. TERMINOLOGY AND DEFINITIONS columns, with rows containing information about one specific entity and columns being the separate data points. For example, a row could represent a specific car, in which the columns are Model, Color and so on. The tables can have relationships between each other and the data is queried using SQL NoSQL NoSQL 1 databases are an alternative to the tabular relations used in relational databases. The motivation for this approach includes simplicity of design, horizontal scaling and availability. The data structures used by NoSQL databases (e.g. column, document, key-value or graph) di ers from those used in relational databases, making some operations faster in NoSQL and others faster in relational databases. The suitability of a particular database, regardless of it being relational or NoSQL, depends on the problem it must solve. There are many di erent distributed NoSQL databases and their functionality can di er a lot depending on which two properties from the CAP theorem they support Accrual failure detection In distributed systems, a failure detector is an application or a subsystem that is responsible for detecting slow or failing nodes. This mechanism is important to detect situations where the system would perform better by excluding the culprit node or putting it on probation. To decide if a node is subject for exclusion/probation a suspicion level is used. For example, traditional failure detectors use boolean information as the suspicion level: a node is simply suspected or not suspected. Accrual failure detectors are a class of failure detectors where the information is a value on a continuous scale rather than a boolean value. The higher the value, the higher the confidence that the monitored node has failed. If an actual crash occurs, the output of the accrual failure detector will accumulate over time and tend towards infinity (hence the name). This model provides more flexibility as the application itself can decide an appropriate suspicion threshold. Note that a low threshold means quick detection in the event of a real crash, but also increases the likelihood of incorrect suspicion. On the other hand, a high threshold makes less mistakes but makes the failure detector slower to detect failing nodes. Hayashibara et al. describe an implementation of such an accrual failure detector in [17], called the Ï accrual failure detector. In the Ï failure detector the arrival times of heartbeats 2 are used to approximate the probabilistic distribution of future heartbeat messages. With this information, a value Ï is calculated with a scale that changes dynamically to match recent network conditions. 1 interpreted as Not only SQL, to emphasize that they may also support SQL-like languages. 2 A periodic signal generated by hardware/software for activation or synchronization purposes. 5

15 CHAPTER 2. BACKGROUND Exponentially weighted moving averages (EMWA) A moving average (also known as rolling average or running average) is a technique used to analyze trends in a data set by creating a series of averages of di erent subsets of the full data set. Given a sequence of numbers and a fixed subset size, the first element of the moving average sequence is obtained by taking the average of the initial fixed subset of the number sequence. Then the subset is modified by excluding the first number of the series and including the next number following the original subset in the series. This creates a new averaged subset of numbers. More mathematically formulated: Given a sequence {a i } N i=1, an n-moving average is a new sequence {s i} N n+1 i=1 defined from the a i sequence by taking the mean of subsequences of n terms: s i = 1 n i+n 1 ÿ j=i a j The sequences S n giving n-moving averages then are: S 2 = 1 2 (a 1 + a 2,a 2 + a 3,...,a n 1 + a n ) S 3 = 1 3 (a 1 + a 2 + a 3,a 2 + a 3 + a 4,...,,a n 2 + a n 1 + a n ) and so on. An example of di erent moving averages can be seen in Figure Figure 2.2: The 2-(red), 3-(green), and 4-(blue) moving averages for 20 data points. 6

16 2.1. TERMINOLOGY AND DEFINITIONS An exponentially weighted moving average (EMWA), instead of using the average of a fixed subset of data points, applies weighting factors to the data points. The weighting for each older data point decreases exponentially, never reaching zero. The EMWA for a series Y can be calculated as: S 1 = Y 1 for t>1:s t = Y t +(1 ) S t 1 Where represents the degree of weighting decrease, a constant smoothing factor between 0 and 1. A higher value of discounts older observations faster. Y t is the value at a time period t, and S t is the value of the EMWA at a time period t RAID RAID 3 is a virtualization technology for data storage which combines multiple disk drives into one logical unit. Data is distributed across the drives in di erent ways called RAID levels, depending on the specific level of redundancy and performance wanted. The di erent schemes are named by the word RAID followed by a number (e.g. RAID 0, RAID 1). Each scheme provides di erent balance between the key goals: reliability, availability, performance and capacity. RAID 10, or RAID 1+0 is a scheme where throughput and latency are prioritized and is therefore the preferable RAID level for I/O intense applications such as databases Apache Cassandra Apache Cassandra, born at Facebook [18] and built on ideas from Amazon s Dynamo [13] and Google s BigTable [3], is an open source NoSQL distributed database system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. DataStax DataStax is a computer software company whose business model centers around selling an enterprise distribution of the Cassandra project which includes extensions to Cassandra, analytics and search functionality. DataStax also employ more than ninety percent of the Cassandra committers. 3 Originally redundant array of inexpensive disks, now commonly redundant array of independent disks. 7

17 CHAPTER 2. BACKGROUND Replication To ensure fault tolerance and reliability, Cassandra stores copies of data, called replicas, on multiple nodes. The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is one copy of each row on one node. A factor of two means two copies of each row, where each copy is on a di erent node [7]. When a client read or write request is issued, it can go to any node in the cluster since all nodes in Cassandra are peers. When a client connects to a node, that node serves as the coordinator for that particular client operation. What the coordinator then does is to act as a proxy between the client application and the nodes that own the requested data. The coordinator is responsible for determining which node should get the request based on the cluster configuration and replica placement strategy. Partitioners and tokens A partitioner in Cassandra determines how the data is distributed across the nodes in a cluster, including replicas. In essence, the partitioner is a hash function for deriving a token, representing a row from its partion key 4 [9]. The basic idea is that each node in the Cassandra cluster is assigned a token that determines what data in the cluster it is responsible for [2]. The tokens assigned to a node needs to be distributed throughout the entire possible range of tokens. As a simple example, consider a cluster with four nodes and a possible token range of Then you would want the tokens for the nodes to be 0, 20, 40, 60, making each node responsible for an equal portion of the data. Data consistency As Cassandra sacrifices consistency for availability and partition tolerance, making it an AP system in the CAP theorem sense, replicas may not always be synchronized. Cassandra extends the concept of eventual consistency by o ering tunable consistency, meaning that the client application can decide how consistent the requested data must be. In the context of read requests, the consistency level specifies how many replicas must respond to a read request before data is returned to the client application. Examples of consistency levels can be seen in Table The partition key is the first column declared in the PRIMARY KEY definition. Each row of data is uniquely identified by the partition key. 8

18 2.1. TERMINOLOGY AND DEFINITIONS Level ALL QUORUM ONE TWO Description Returns the data after all replicas has responded. The read operation fails if a replica does not respond. Returns the data once a quorum, i.e. a majority, of replicas has responded. Return the data from the closest replica. Return the data from the two closest replicas. Table 2.1: Examples of read consistency levels. To minimize the amount of data sent over the network when doing reads with a consistency level above ONE, Cassandra makes use of digest requests. A digest request is just like a regular read request except that instead of the node actually sending the data it only returns a digest, i.e. a hash of the data. The intent is to discover whether two or more nodes agree on what the current data is, without actually sending the data over the network and therefore save bandwidth. Cassandra sends one data request to one replica and digest requests to the remaining replicas. Note that the digest queried nodes still will do all the work of fetching data, they will just not return it. Replica selection In order for the coordinator node to route requests e ciently it makes use of a snitch. A snitch informs Cassandra about the network topology and determines which data centers and racks nodes belong to. This information allows Cassandra to distribute replicas according to the replication strategy [11] by grouping machines into data centers and racks. In addition, all snitches also use a dynamic snitch layer that provides an adaptive behaviour when performing reads [24]. It uses an accrual failure detection mechanism, based on the Ï failure detector discussed in section 2.1.7, to calculate a per node threshold that takes into account network performance, workload and historical latency conditions. This information is used to detect failing or slow nodes, but also for calculating the best host in terms of latency, i.e. selecting the best replica. However, calculating the best host is expensive. If too much CPU time is spent on calculations it would become counterproductive as it would sacrifice overall read throughput. The dynamic snitch therefore adopts two separate operations. One is receiving the updates, which is cheap, and the other is calculating scores for each host which is more expensive. In the update part latencies of the hosts are sampled and weighted with EMWA:s. The calculation part in turn iterates through the recorded latencies of each host to 9

19 CHAPTER 2. BACKGROUND find the worst latency as a measure for the scoring. After finding the worst latency it makes a second pass over the hosts and score them against the maximum value. This calculation has been configured to only run once every 100 ms to reduce the cost. As hosts can not inform the system of their recovery once put on probation, all computed scores are reset once every ten minutes as well. Client drivers and token awareness To enable communication between client applications and a Cassandra cluster, multiple client drivers for Cassandra exists. Cassandra supports two communication protocols, the legacy Thrift interface [22], and the newer native binary protocol that enables use of the Cassandra Query Language (CQL) [6], resembling SQL. Di erent drivers can therefore use di erent protocols. Popular drivers includes Astyanax, which uses the Thrift interface, and the Java driver from DataStax which only supports CQL. As these drivers can get the token information from the nodes during initialization, they can be configured to be token aware. This means that the client driver can make a qualified choice about which nodes to issue requests to, based on the data requested. 2.2 Load balancing techniques in distributed systems There exists numerous ideas and techniques to improve load balancing in distributed systems. The problem is often to decide on a good trade-o between exchanging a lot of communication between servers and clients and making guesses and approximations on the tra c. Intuitively, more information makes it easier to do good decisions, but information passing can be costly. This section briefly discusses previous work, ideas and algorithms for load balancing techniques in distributed systems, not necessarily with focus on improving the tail latency The power of d choices Consider a system with n requests and n servers to serve them. If each request is dispatched independently and uniformly at random to a server then the maximum log n load, or the largest number of requests at any server, is approximately log log n. Suppose instead that each request gets placed sequentially onto the least loaded (in terms of number of requests enqueued on a server) of d Ø 2 servers chosen independently and uniformly at random. It has then been shown that with high log log n log d probability 5 the maximum load is instead only + C, wherec is a constant factor [1] [20]. This means that getting two choices instead of just one leads to an exponential improvement in the maximum load. 5 High probability means here at least 1 1, where n is the number of requests. n 10

20 2.2. LOAD BALANCING TECHNIQUES IN DISTRIBUTED SYSTEMS This result demonstrates the power of two choices, which is a commonly used property in load balancing strategies. When referring to this idea the common way to denote it is by SQ(d), meaning shortest-queue-of-d-choices Join-Shortest-Queue The Join-Shortest-Queue (JSQ) algorithm is a popular routing policy used in processor sharing server clusters. In JSQ, an incoming request gets dispatched to the server with the least number of currently active requests. Ties are broken by chosing randomly between the two servers. JSQ therefore tries to load balance across servers by reducing the chance of one server having multiple jobs while another server has none. This is a greedy policy since the incoming request prefers sharing a server with as few jobs as possible. Figure 2.3 illustrates the algorithm, with the clients at the top, A-C being servers with their respective queues and pending jobs. An interesting result that was shown by Gupta et al. [15], is that the performance of JSQ on a processor sharing system shows near insensitivity to di erences on the job size distribution. This is di erent from similar routing policies like Least-Work- Left (send the job to the host with the least total work) or Round-Robin which are highly sensitive to the job size distribution. JSQ is not optimal 6, but was still shown to have great performance in comparison to algorithms with much higher complexity. A potential drawback with JSQ though, is that as the system grows, the amount of communication over the network between dispatchers and servers could get overwhelming given that each of the distributed dispatchers will need to obtain the number of jobs at every server before every job assignment Join-Idle-Queue The Join-Idle-Queue (JIQ) algorithm, described in [19], tries to decouple detection of lightly loaded servers from the job assignment. The idea is to have idle processors inform the dispatchers as they become idle, without interfering with job arrivals. This removes the load balancing work from request processing. JIQ consists of two parts, the primary and the secondary load balancing problem, which communicate via a data structure called an I-queue. An I-queue is a list of processors that have reported themselves as idle. When a processor becomes idle it joins an I-queue based on a load balancing algorithm. Two load balancing algorithms for this purpose was considered in [19]: Random and SQ(d). With 6 In the optimal solution, each incoming job is assigned as to minimize the mean response time for all jobs currently in the system, assuming there are 0 future arrivals. 11

21 CHAPTER 2. BACKGROUND Figure 2.3: The join-shortest-queue algorithm. Clients prefer the server with the shortest queue. JIQ-Random an idle processor joins an I-queue uniformly at random, and with JIQ-SQ(d) an idle processor chooses d random I-queues and joins the one with the shortest queue length. If a client do not have any servers in its I-queue it will in turn make a choice based on the SQ(d) algorithm. Figure 2.4 illustrates the algorithm, again with the clients at the top with their respective I-queues, A-F being servers with their respective queues and pending jobs. It is worth noting that JIQ-Random has the additional advantage of having a oneway communication, without requiring messages from the I-queues. Lu et al. showed three interesting results: JIQ-Random outperforms traditional SQ(2), in respect to mean response time. JIQ-SQ(2) achieves close to the minimum possible mean response time. Both JIQ-Random and JIQ-SQ(2) are near-insensitive to job size distribution with processor sharing in a finite system Speculative retries Speculative retries, also denoted eager retries and hedged requests [12] is the process of sending requests to several servers and use the one that responds first. The client initially sends one request to the server that is believed to perform the best, but falls back on sending a secondary request after a delay. The client cancels remaining outstanding requests once a result is received. 12

22 2.2. LOAD BALANCING TECHNIQUES IN DISTRIBUTED SYSTEMS Figure 2.4: The join-idle-queue algorithm. Servers join an I-queue based on the power of d choices algorithm. If a client do not have any servers in its I-queue it will in turn make a choice based on the power of d choices algorithm. Implementing speculative retries adds some overhead, but can still give latencyreduction e ects while increasing load only modestly. A way to achieve this is by waiting to send a second request until the first one has been outstanding for more than the 95th or 99th percentile expected latency for that type of request. This limits the additional load to only a couple of percents (~1-5%) while substantially reducing the tail latency, since the pending request might be a several second timeout for example. Speculative retries was implemented in Cassandra with the default of sending the next request in the 99th percentile [10] Tied requests Dean and Barroso [12] stated that instead of letting the client choose according to the SQ(d) algorithm, you should let the request be sent to multiple servers simultaneously while making sure that the servers are allowed to communicate updates on the status of the request with each other. These requests where servers use status updates are called tied request. As soon as one server starts processing a request, it sends a cancellation message to the other servers ( ties ), which keeps the client out of the loop for the cancel logic. The corresponding requests, if still enqueued on the other servers can then be aborted immediately or be deprioritized. 13

23 CHAPTER 2. BACKGROUND 2.3 The C3 algorithm The C3 algorithm, described in [23], is a replica selection algorithm for Cassandra usage. Suresh et al. argue that replica selection is an overlooked process which should be a cause for concern. They argue that putting mechanisms such as speculative retries on top of bad replica selection may increase system utilization for little benefit. C3 tries to solve the problem by using two concepts. Firstly it uses additional feedback from server nodes in order for the clients to rank them and prefer faster ones. Secondly, the clients implement a rate control mechanism to prevent nodes from being overwhelmed. A note worth making is that a client in the C3 design is actually the coordinator node in Cassandra, so the entire algorithm is implemented server side. The current implementation is in Cassandra version Replica ranking In the C3 replica ranking, the clients ranks the server nodes using a scoring function, just like the dynamic snitch, with the score working as a measure of latency to expect from the node in question. Clients prefer lower scores which corresponds to faster nodes for each request. Instead of only using the latency, the C3 scoring function tries to minimize the product of the job queue size 7 q s and the service time 1/µ s (the time to fetch the requested rows) across every server s. Along with each response to a client, the servers send back additional information about their queue sizes and service times. The queue size is recorded after a request has been served and when the response is about to be returned. To make a better forecast, the values are smoothed with EWMA:s, denoting the new values q s and µ s. In addition to these values, the response time R s (i.e the di erence between the latency for the entire request and the service time) is also recorded and smoothed. To account for other clients in the system as well as ongoing requests, each client also maintain, for each server s, an instantaneous count of its outstanding requests os s (requests for which a response is yet to be received). It is assumed that each client knows how many other clients there are in the system (n). The clients then make an estimate of the queue size of each server as: ˆq s = os s n + q s +1 (2.1) 7 The job queue size refers to the number of pending requests at a server. 14

24 2.3. THE C3 ALGORITHM where the os s n term is referred to as the concurrency compensation. The idea behind the concurrency compensation is that clients will account for the scenario of multiple clients concurrently issuing requests to the same server. The clients with a higher value of os s will therefore give a higher estimate of the queue size at s and rank it lower than a client with fewer requests to s. This results in clients that have a higher demand will be more likely to rank s lower than clients with a lighter demand. Using this estimation, clients compute the queue size to service rate ratio ( ˆq s / µ s ) of each server and rank them accordingly. However, a function linear in ˆq s is not su cient as it would demand a rather large increase in queue size in order for a client to switch back to a slower server again, which could result in accumulation of jobs at the faster nodes. Instead, C3 penalizes longer queue lengths by raising the ˆq s term to a higher power, b: (ˆq s ) b / µ s. For higher values of b, clients are less greedy about preferring a server with a lower service time as the (q s ) b term will dominate the scoring function more strongly. In C3, b is set to 3, yielding a cubic function. This results in a final scoring function: s = R s +(ˆq s ) 3 / µ s (2.2) where R s and µ s are the EWMA:s of the response time and service rate and ˆq s is the queue size estimate described in equation Rate control To prevent exceeding server capacity, clients incorporate a rate limiting mechanism inspired by the congestion control in the CUBIC TCP implementation [16]. This mechanism is decentralized as clients do not inform each other of their demands of a server. Every client uses a rate limiter for each server which limits the number of requests sent within a configured time window of ms. The limit is referred to as the sending rate (srate). By letting the clients track the number of responses received from a server in the ms interval (the receive rate, rrate) the rate limiter adapts and adjusts srate to match the rrate of the server. When a client receives a response from a server s, the client compares the current srate and rrate for s. Ifsrate is found to be lower than rrate, the client increases its rate according to a cubic function: 15

25 CHAPTER 2. BACKGROUND srate Ω A Û B 3 R0 T 3 + R 0 (2.3) where T is the elapsed time since the last rate decrease, and R 0 is the rate at the time of the last rate decrease. If the rrate is lower than the srate, theclient instead decreases its srate multiplicatively by, in C3 set to 0.2. The value represents a scaling factor and is used to set the desired duration of the saddle region. Additionally a cap for the step size is set by a parameter s max. The scaling factor in C3 is set to 100 milliseconds and the cap size is set to 10. To get a better understanding of the properties of the rate controlling function, consider Figure 2.5. The proposed benefits with using this function is mostly the configurable saddle region. While the sending rate is significantly lower than the saturation rate, the client will increase the rate aggressively (low rate region). When the sending rate is close to the perceived saturation point of the server, that is, R 0, the client stabilizes its sending rate and increases it conservatively (saddle region). Lastly, when the client has spent enough time in the stable region, it will again increase its rate aggressively, probing for more capacity (optimistic probing region). Rate (requests per ms) R 0 Saddle region Optimistic probing region Low rate region T (ms) Figure 2.5: Growth curve for the rate control function Notes on the C3 implementation Some notes are worth making regarding the C3 algorithm. Firstly, C3 will always route requests solely based on the replica scoring. This means that if the coordinator already has the requested data locally, it might route the request to a remote node, if that node has a better score than the coordinator itself. 16

26 2.3. THE C3 ALGORITHM Secondly, although all replicas get sorted, C3 will stop processing as soon as it has found the best replica that is not limited and put it first in queue for request processing. This means that when using consistency level QUORUM, i.e. sending multiple requests, only the data request will be rate limited, leaving the digest una ected by the rate limiting part of C3. 17

27

28 Chapter 3 Method Evaluating performance is not a trivial task. While the focus in this thesis was on improving the tail latency, it was important to not achieve this by sacrificing the average case performance, i.e. the average latency of a request. A good starting point was to implement C3 in Cassandra (the version that Spotify uses), to try and verify if the performance gains seen by the C3 authors in version 2.0.0, could also be seen in the newer version, despite the version gap. 3.1 Tools for testing This section describes tools used while implementing the algorithm and evaluating Cassandra performance. In the process of benchmarking, guidelines and advice from DataStax [8] was adhered to The cassandra-stress tool The cassandra-stress tool is a stress testing utility for Cassandra clusters written in Java which is included in the Cassandra installation [5]. It has three modes of operations: inserting data, reading data and indexed range slicing. For the purpose of this thesis the read mode is what was used for analysis. During a run, the cassandra-stress tool reports information at a configurable interval. Example output can be seen below: total, interval_op_rate, interval_key_rate, latency,95th,99.9th, elapsed_time ,1057,1057,15.4,36.4,571.6, ,1620,1620,10.5,32.9,475.8, ,2071,2071,4.0,29.4,380.6, ,2436,2436,2.5,27.1,378.1,

29 CHAPTER 3. METHOD Here, each line reports data for the interval between the last elapsed time and current elapsed time (default is 10 seconds). The columns of interest are in particular, latency, 95th and 99.9th. The latency column describes the average latency in milliseconds for each operation during that interval. The 95th and 99.9th columns describe the percentiles, i.e. 95% and 99.9th% of the time the latency was less than the number displayed. The cassandra-stress tool is highly configurable, for example it is possible to specify the number of threads, read and write consistency and size of the records The Yahoo Cloud Serving Benchmark The Yahoo Cloud Serving Benchmark (YCSB) is a framework for benchmarking various cloud serving systems [4]. The YCSB client is a workload generator, and the core workloads included in the installation is a set of workload scenarios to be executed by the generator. Just like the cassandra-stress tool, the YCSB client is highly configurable. For example it is possible to specify the number of threads, read and write consistency, size of the records and format of the output. Below is example output where the format is a time series:... [READ], 40, [READ], 50, [READ], 60, [READ], 70, [READ], 80, Here, each line reports the average read latency (in microseconds) at an interval of ten milliseconds The Java driver stress tool The Java driver stress tool is a simple example application that uses the DataStax Java driver to stress test Cassandra - which also stress test the Java driver as a result. The example tool is by no means a complete stress application and supports only a very limited number of stress scenarios Darkloading To test new versions of Cassandra, Spotify makes use of Darkloading. Darkloading is the process of duplicating the tra c of a certain system and replay it on another system, to compare the performance. 20

30 3.2. TEST ENVIRONMENT SETUP This is done by snooping on the tra duplicate request to another system. c to the original system and then make a 3.2 Test environment setup In the process of evaluating the performance of di erent Cassandra versions, the task was divided into two parts. The first was evaluating performance by using stress tools such as cassandra-stress, YCSB and the Java driver stress tool which generate the workload and tra c by itself. The other part was evaluating performance on production workload and tra c, which was obtained with the Darkloading strategy. Testing on dedicated hardware is preferable as it removes the uncertainty of skewed results due to resource sharing. Therefore, dedicated hardware was used for both cases. For the Cassandra cluster, machines suited for databases was provisioned, with 16 cores, 32 GB of RAM and spinning disks in a RAID 10 configuration. For the machines which send the tra c, dedicated service machines with 32 cores and 64 GB of RAM was used instead. When di erent benchmarks were conducted it was deemed interesting to test both consistency level ONE and QUORUM. Testing with speculative retries both enabled and disabled was tried, but as this did not yield any interesting results 1 it was omitted as a testing parameter Testing on generated load When testing on generated workload there were two things in particular desirable to achieve. The first was that enough data was inserted to ensure that the entire dataset does not fit in memory. The other part was running the test long enough, since a cluster has very bad performance at the start of a run (due to the Java Virtual Machine warming up). Due to this the first 15% of all recorded values was discarded to only record values when the cluster performance had stabilized. The 15% breakpoint was not thoroughly analyzed, but was simply decided appropriate when looking at the raw output from test runs Testing on production load To try and make the comparison between di erent Cassandra versions as fair as possible, the same production tra c was used in each test run. The data was sampled from the real service and saved to file, making it possible to replay the same data multiple times. 1 A slight improvement could be seen in the higher percentiles, but as this improved performance equally across di erent versions, it was deemed irrelevant. 21

31 CHAPTER 3. METHOD As the tra c was replayed at a fixed rate (in production the rate varies over the day) it only made sense to compare test runs against each other and not against the real production cluster performance. 22

32 Chapter 4 Implementation 4.1 Implementing C3 in Cassandra As Spotify uses Cassandra (and above) for new applications, their development environment is also suited for those versions. Due to the fact that Cassandra and Cassandra are incompatible, C3 was instead implemented directly in Cassandra , making the comparisons and cluster setup easier in the Spotify environment. The implementation did not need much additional reworking of the newer code 1, making the process simple. 4.2 Implementing C3 in the DataStax Java driver As previously mentioned, the entire C3 algorithm is implemented server side. However, a client implementation may be preferable as many newer Cassandra client drivers are token aware, meaning that the coordinator node will be able to serve the requested data directly. By implementing C3 in the client, we can send the request to the best replica in the first step, removing the need of going through the coordinator node just to rank the replicas. With that in mind, the C3 algorithm was implemented in the DataStax Java driver. The Java driver was chosen since it is actively maintained, uses the newer communication protocol and also since it has good support for implementing new load balancing policies. There were some impediments along the way though. Firstly, the queue size and service time as recorded by the server could not be used as this is an extension in the C3 server code. This means that the replica scoring only used metrics as seen by the clients which might have had a significant impact on the performance. 1 To make C3 work in , moving some method calls was su cient. 23

33 CHAPTER 4. IMPLEMENTATION Secondly, as the driver code is substantially di erent from the server code, the parameters set in the C3 server code might not have been suitable values for the client Naive implementation To decide which hosts to send a request to, the driver makes use of a load balancing policy. For each request, the load balancing policy is responsible for returning an iterator containing the hosts to query. This served as a suitable place to implement the replica scoring part of C3. Therefore a new policy called HostScoringPolicy was implemented, responsible for the logic of ranking hosts. As mentioned earlier, the scoring function was simplified as the metrics from the servers used in the original C3 version were not available. The metrics used in the client-side ranking are the latency for the entire request (L s ), the queue size (q s ), and the outstanding requests to a host (os s ), all as seen by the client. Just like the server implementation, EMWA:s was used to smooth the values. The client version of ˆq s is therefore defined just as before: ˆq s = os s n + q s +1 (4.1) But with the di erence that the queue size here is recorded from the client perspective and not by the server itself as in the original C3 implementation. This results in the final client scoring function: s = L s +(ˆq s ) 3 L s (4.2) Here we can notice the big di erence that we do not have the service time metric, leaving us with the entire latency of the request as the only measure. The rate limiting part of C3 was however easily plugged in as the functionality is self contained and not relying on external metrics. 4.3 Benchmarking with YCSB To confirm that C3 performs as suggested, as well as verify that the implementation worked as intended, it was desirable to reproduce the results presented 24

34 4.4. BENCHMARKING WITH CASSANDRA-STRESS by Suresh et al. in [23]. To achieve this, the YCSB framework was used, just like in the original paper. The test scenario with a read-heavy workload (95% reads, 5% writes) was chosen to be reproduced. In the original experiment 15 Cassandra nodes were used, with a replication factor of million records of 1KB each were inserted across the nodes, yielding approximately 100 GB of data per machine. Since the test setup only had 8 Cassandra nodes the record count was modified to be similar to the load in the original experiment. Therefore 250 million records of 1KB each was inserted, yielding near to 100 GB of data per machine. Just like the original test scenario three YCSB instances were used (running on separate machines) each running 40 threads, yielding a total of 120 generators. Then for each Cassandra version and consistency level, just like in the original test, two million rows were read, five times. The duration of a read run was about minutes depending on consistency level. 4.4 Benchmarking with cassandra-stress As the cassandra-stress tool already comes packaged together with the Cassandra installation, C3 was also tested with this tool, to gain further confidence about the performance of C3. The deployment again consisted of the 8 Cassandra nodes, and one separate service machine, running the cassandra-stress tool with the default of 50 threads. 250 million records of 1KB each were inserted across the cluster. Due to a design choice in the cassandra-stress tool million rows were read. The duration of a read run was about 5-7 hours depending on consistency level. 4.5 Benchmarking with the java-driver stress tool As creating a custom stress tool for the purpose of client evaluation is outside the scope of this thesis, the stress application that comes together with the Java driver was used to evaluate the client implementation of C3. By having made some small modifications in the source code of the stress application it was possible to test the di erent load balancing policies with di erent consistency levels. 2 For example, inserting rows will write rows with key values , meaning that if you try to read rows of a di erent magnitude, the keys will not match and the read will fail. 25

35 CHAPTER 4. IMPLEMENTATION The deployment again consisted of the 8 Cassandra nodes, and 6 service machines, each running 100 threads. 250 million records of 1KB each were inserted across the cluster. For each Cassandra version and consistency level, 100 million rows were read. The duration of a read run was about 5-7 hours depending on consistency level. 4.6 Darkloading In order to benchmark the performance of C3 under production load, a cluster had to be duplicated. A suitable cluster was decided with the recommendations from Jimmy Mårdell at Spotify. The chosen cluster consists of 8 Cassandra nodes with approximately 130 GB of data per node and 6 service machines sending tra c to the cluster. The read/write ratio of the incoming requests to the service is approximately 97% reads and 3% writes. To send tra c to the test cluster, two versions of the service client was used. The first version was token aware and used consistency level QUORUM, just like the original service. In the other version the token awareness was replaced by plain round robin, and the consistency level was set to ONE, to try and match the settings that the original C3 was developed with. Due to the service client using the Astyanax client and not the Java driver, it was unfortunately not possible to Darkload the C3 client. Although Astyanax supports a beta version that uses the Java driver under the hood, it only does so for older versions of the Java driver. For each setup, the sampled tra c was replayed at a configured rate which resulted in a disk I/O utilization of around 50-60%, making sure that the cluster had as much tra c as possible without choking the disks. Note however that even though the same tra c was replayed, writes altered the data in the cluster, potentially a ecting some reads, but given the low amount of writes this was deemed to be negligible. In the Darkloading setup an extention in C3 was also tried, were the coordinator node would always serve the data locally if possible (this while using round robin in the client), but as this showed no di erence in performance, that particular test was omitted. 26

36 Chapter 5 Results Here we present the results from our di erent benchmarks. The standard deviation for each measure is marked in all charts. In some charts, where the di erence was small, we have omitted the average latencies as the focus lies on the improving the tail latency. All exact numbers, including averages, are available in Appendix A. 5.1 Benchmarking with YCSB Here we present the results from the YCSB runs. The results are the averages of the combined values outputted from the three YCSB instances. In Figure 5.1 we have consistency level ONE to the left and QUORUM to the right. 100 Consistency level ONE. 100 Consistency level QUORUM Latency (ms) Latency (ms) Mean 95:th 99:th 99.9:th 0 Mean 95:th 99:th 99.9:th C C3 Figure 5.1: Benchmark of C3 with YCSB. 27

37 CHAPTER 5. RESULTS 5.2 Benchmarking with cassandra-stress Here we present the results from the cassandra-stress runs. The results are the averages from the single cassandra-stress instance. In Figure 5.2 we have the results from the 95th and 99.9th percentile latencies, with consistency level ONE to the left and QUORUM to the right. Consistency level ONE. Consistency level QUORUM Latency (ms) Latency (ms) :th C3 99.9:th 0 95:th C3 99.9:th Figure 5.2: Benchmark of C3 with cassandra-stress. 28

38 5.3. BENCHMARKING WITH THE JAVA-DRIVER STRESS TOOL 5.3 Benchmarking with the java-driver stress tool Here we present the results from the java-driver stress runs. The default we compare against is the java-driver with the default LoadBalancingPolicy that is token aware Performance of the C3 client In Figure 5.3 we have the results from the mean, 95 and 99th percentile latency, with consistency level ONE to the left and QUORUM to the right. For both the and the C3 client, Cassandra was running server side. Consistency level ONE. Consistency level QUORUM Latency (ms) 100 Latency (ms) Mean 95:th 99:th 0 Mean 95:th 99:th C C3 Figure 5.3: Benchmark of client C3 with server

39 CHAPTER 5. RESULTS 5.4 Darkloading Here we present the results from the Darkloading runs. First we present the performance with token awareness in the client, followed by the performance with plain round robin Performance with token awareness In Figure 5.4 we have the results from the 95, 98, 99 and 99.9th percentile latencies. 100 Consistency level QUORUM. 80 Latency (ms) :th 98:th 99:th 99.9:th C3 Figure 5.4: Darkloading with token awareness Performance with round robin In Figure 5.5 we have the results from the 95, 98, 99 and 99.9th percentile latencies. 30

40 5.4. DARKLOADING 80 Consistency level ONE :th 98:th 99:th 99.9:th C3 Figure 5.5: Darkloading with round robin. 31

41

42 Chapter 6 Discussion 6.1 Performance of server side C YCSB vs. cassandra-stress The YCSB stress runs confirms the results from the original experiment, that C3 is superior to the original dynamic snitch. Furthermore we found that regardless of using consistency level ONE or QUORUM (in the original experiment only consistency level ONE was evaluated), C3 proved to reduce both latency and variance across all percentiles. In our cassandra-stress runs, results were again positive but not at all with the same confidence as in the YCSB runs. Although it would have been reassuring to get more similar results between tools we want to emphasize the di erences between setups. The cassandra-stress runs were read only, whereas the YCSB runs were read heavy. We had a di erent number of instances running, as well as a di erent thread count. We also do not have any control over the read patterns in cassandra-stress, which also could contribute to the di ering results. Additional YCSB runs similar to the cassandra-stress setup is desirable to see if the di erence between results decreased, but due to time constraints we leave this to future work Darkloading When evaluating C3 on production load, results were a bit di erent from the stress tool results. In the case of the token aware client, we actually saw a little bit (around 100 µs) of overhead in the average case. Not until the 98th percentile did we see an actual improvement, and it was only by a couple of ms, which is not strong enough to suggest an actual performance gain. 33

43 CHAPTER 6. DISCUSSION We believe that one reason for not seeing much improvement in this case is the fact that the client is token aware. The client will therefore already send the request to a node that has the data, meaning that C3 in some cases will not be able to improve the routing. Darkloading C3 with round robin in the client (and consistency level ONE) actually did improve the results, supporting this claim. Although still having the small 100 µs overhead in the average case, we could now see an improvement already in the 95th percentile, with the 99.9th percentile having improved with about 20%. Even though we did see this improvement, there is still a big gap in performance gain compared to the stress tool results. This could have several reasons. Firstly, when generating workload, all the records were of equal size (1KB), meaning that all read requests are equally large. In the case of production load, some rows might contain more data than others due to the nature of the Darkloaded service. This means that some reads will have higher latencies, not due to slow servers but due to how the data is structured. The result of this would be that C3 might rank fast servers as slow ones just because they happen to get heavier reads. Another point worth making is that problems such as garbage collections, where C3 really could improve the performance, commonly does not occur until the cluster has been running for a couple of weeks, which makes it a hard scenario to simulate in the scope of this thesis. 6.2 Performance of client side C3 Although not having the exact metrics like the server C3, the C3 client implementation did lower the tail latency. However, the benchmark showed a lot of variance, making the results inconclusive. Since the variance was present in both the default java-driver version and the C3 implementation we deem this to be a fault in the benchmark setup and not in the implementation. We suggest that making repeated benchmarks and perhaps tweaking the parameters could give a more conclusive result. However, we are under the impression that C3 in the client could work well, and perhaps be a substitute for token aware clients. 6.3 Conclusion Given the right conditions the C3 algorithm has proven to be an e ective way to decrease tail latencies in Cassandra. We would recommend the current implementation in systems where row sizes are homogeneous as variable size records are not taken into account in the scoring function. However, we see no problem with extending the algorithm to take into account 34

44 6.3. CONCLUSION variable size rows. Given that one can obtain the size of the data requested, it should be possible to make a weighted scoring function, but this is outside the scope of this thesis. We would also argue that C3 will be most e ective if the client is not token aware. A client implementation of C3 could resolve this, but the results found in this thesis was too inconclusive to support this claim, and further testing is needed. 35

45

46 Appendix A Results from benchmarks A.1 YCSB System C3 Average latency (ms) 11.59, = , = th percentile (ms) 21.22, = , = th percentile (ms) 30.28, = , = th percentile (ms) 54.85, = , =2.32 Table A.1: YCSB read latencies with consistency level ONE. System C3 Average latency (ms) 16.11, = , = th percentile (ms) 28.46, = , = th percentile (ms) 40.69, = , = th percentile (ms) 80.45, = , =4.37 Table A.2: YCSB read latencies with consistency level QUORUM. 37

47 APPENDIX A. RESULTS FROM BENCHMARKS A.2 cassandra-stress System C3 Average latency (ms) 3.93, = , = th percentile (ms) 27.12, = , = th percentile (ms) , = , =23.89 Table A.3: cassandra-stress read latencies with consistency level ONE. System C3 Average latency (ms) 8.44, = , = th percentile (ms) 34.54, = , = th percentile (ms) , = , =32.18 Table A.4: cassandra-stress read latencies with consistency level QUORUM. A.3 java-driver stress System java-driver client C3 Average latency (ms) 8.75, = , = th percentile (ms) 75.05, = , = th percentile (ms) , = , =55.80 Table A.5: java-driver stress read latencies with consistency level ONE. System java-driver client C3 Average latency (ms) 14.95, = , = th percentile (ms) , = , = th percentile (ms) , = , =42.77 Table A.6: java-driver stress read latencies with consistency level QUORUM. 38

48 A.4. DARKLOADING A.4 Darkloading A.4.1 Token aware System C3 50th percentile (ms) 0.90, = , = th percentile (ms) 1.12, = , = th percentile (ms) 14.59, = , = th percentile (ms) 27.31, = , = th percentile (ms) 36.97, = , = th percentile (ms) 70.61, = , =7.47 Table A.7: Darkloading read latencies with consistency level QUORUM. A.4.2 Round robin System C3 50th percentile (ms) 0.88, = , = th percentile (ms) 1.11, = , = th percentile (ms) 12.27, = , = th percentile (ms) 23.25, = , = th percentile (ms) 31.85, = , = th percentile (ms) 61.96, = , =3.95 Table A.8: Darkloading read latencies with consistency level ONE. 39

49

50 Bibliography [1] Yossi Azar, Andrei Z Broder, Anna R Karlin, and Eli Upfal. Balanced allocations. SIAM journal on computing, 29(1): , [2] Nick Bailey. Balancing your Cassandra cluster. Accessed: [3] Fay Chang, Je rey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, [4] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing, pages ACM, [5] DataStax. The cassandra-stress tool. tools/toolscstress_t.html. Accessed: [6] DataStax. Coming in Cassandra 1.2: binary CQL protocol. Accessed: [7] DataStax. Data replication. architecture/architecturedatadistributereplication_c.html. Accessed: [8] DataStax. How not to benchmark cassandra. Accessed: [9] DataStax. Partitioners. architecturepartitionerabout_c.html. Accessed:

51 BIBLIOGRAPHY [10] DataStax. Rapid read protection in cassandra Accessed: [11] DataStax. Snitches. architecture/architecturesnitchesabout_c.html. Accessed: [12] Je rey Dean and Luiz André Barroso. The tail at scale. Communications of the ACM, 56(2):74 80, [13] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon s highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, pages ACM, [14] Seth Gilbert and Nancy Lynch. Brewer s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 33(2):51 59, [15] Varun Gupta, Mor Harchol Balter, Karl Sigman, and Ward Whitt. Analysis of join-the-shortest-queue routing for web server farms. Performance Evaluation, 64(9): , [16] Sangtae Ha, Injong Rhee, and Lisong Xu. Cubic: a new tcp-friendly high-speed tcp variant. ACM SIGOPS Operating Systems Review, 42(5):64 74, [17] Naohiro Hayashibara, Xavier Defago, Rami Yared, and Takuya Katayama. The Ï accrual failure detector. In Reliable Distributed Systems, Proceedings of the 23rd IEEE International Symposium on, pages IEEE, [18] Avinash Lakshman and Prashant Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2):35 40, [19] Yi Lu, Qiaomin Xie, Gabriel Kliot, Alan Geller, James R Larus, and Albert Greenberg. Join-idle-queue: A novel load balancing algorithm for dynamically scalable web services. Performance Evaluation, 68(11): , [20] Michael Mitzenmacher. The power of two choices in randomized load balancing. Parallel and Distributed Systems, IEEE Transactions on, 12(10): , [21] Eric Schurman and Jake Brutlag. The user and business impact of server delays, additional bytes, and http chunking in web search. In Velocity Web Performance and Operations Conference,

52 BIBLIOGRAPHY [22] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. Thrift: Scalable crosslanguage services implementation. Facebook White Paper, 5(8), [23] Lalith Suresh, Marco Canini, Stefan Schmid, and Anja Feldmann. C3: Cutting tail latency in cloud data stores via adaptive replica selection. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, [24] Brandon Williams. Dynamic snitching in cassandra: past, present, and future. Accessed:

53