Replica selection in Apache Cassandra

Size: px
Start display at page:

Download "Replica selection in Apache Cassandra"

Transcription

1 DEGREE PROJECT, IN COMPUTER SCIENCE, SECOND LEVEL STOCKHOLM, SWEDEN 2015 Replica selection in Apache Cassandra REDUCING THE TAIL LATENCY FOR READS USING THE C3 ALGORITHM SOFIE THORSEN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION (CSC)

2 Replica selection in Apache Cassandra Reducing the tail latency for reads using the C3 algorithm Val av replikor i Apache Cassandra SOFIE THORSEN [email protected] Master s Thesis at CSC Supervisor: Per Austrin Examiner: Johan Håstad Employer: Spotify

3

4 Abstract Keeping response times low is crucial in order to provide a good user experience. Especially the tail latency proves to be a challenge to keep low as size, complexity and overall use of services scale up. In this thesis we look at reducing the tail latency for reads in the Apache Cassandra database system by implementing the new replica selection algorithm called C3, recently developed by Lalith Suresh, Marco Canini, Stefan Schmid and Anja Feldmann. Through extensive benchmarks with several stress tools, we find that C3 indeed decreases the tail latencies of Cassandra on generated load. However, when evaluating C3 on production load, results does not show any particular improvement. We argue that this is mostly due to the variable size records in the data set and token awareness in the production client. We also present a client-side implementation of C3 in the DataStax Java driver in an attempt to remove the caveat of token aware clients. The client-side implementation did give positive results, but as the benchmark results showed a lot of variance we deem the results to be too inconclusive to confirm that the implementation works as intended. We conclude that the serverside C3 algorithm will work e ectively for systems with homogeneous row sizes where the clients are not token aware.

5 Sammanfattning Val av replikor i Apache Cassandra För att kunna erbjuda en bra användarupplevelse så är det av högsta vikt att hålla responstiden låg. Speciellt svanslatensen är en utmaning att hålla låg då dagens applikationer växer både i storlek, komplexitet och användning. I denna rapport undersöker vi svanslatensen vid läsning i databassystemet Apache Cassandra och huruvida den går att förbättra. Detta genom att implementera den nya selektionsalgoritmen för replikor, kallad C3, nyligen framtagen av Lalith Suresh, Marco Canini, Stefan Schmid och Anja Feldmann. Genom utförliga tester med flera olika stressverktyg så finner vi att C3 verkligen förbättrar Cassandras svanslatenser på genererad last. Dock så visade använding av C3 på produktionslast ingen större förbättring. Vi hävdar att detta framförallt beror på en variabel storlek på datasetet och att produktionsklienten är tokenmedveten. Vi presenterar också en klientimplementation av C3 i Java-drivrutinen från DataStax, i ett försök att åtgärda problemet med tokenmedventa klienter. Klientimplementationen av C3 gav positiva resultat, men då testresultaten uppvisade stor varians så anser vi att resultaten är för osäkra för att kunna bekräfta att implentationen fungerar så som den är avsedd. Vi drar slutsatsen att C3, implementerad på servern, fungerar e ektivt på system med homogen storlek på datat och där klienter ej är tokenmedvetna.

6 Contents Acknowledgements 1 Introduction Problem statement Background Terminology and definitions Load balancing and replica selection Percentiles and tail latency CAP theorem Eventual consistency SQL NoSQL Accrual failure detection Exponentially weighted moving averages (EMWA) RAID Apache Cassandra Load balancing techniques in distributed systems The power of d choices Join-Shortest-Queue Join-Idle-Queue Speculative retries Tied requests The C3 algorithm Replica ranking Rate control Notes on the C3 implementation Method Tools for testing The cassandra-stress tool The Yahoo Cloud Serving Benchmark The Java driver stress tool

7 3.1.4 Darkloading Test environment setup Testing on generated load Testing on production load Implementation Implementing C3 in Cassandra Implementing C3 in the DataStax Java driver Naive implementation Benchmarking with YCSB Benchmarking with cassandra-stress Benchmarking with the java-driver stress tool Darkloading Results Benchmarking with YCSB Benchmarking with cassandra-stress Benchmarking with the java-driver stress tool Performance of the C3 client Darkloading Performance with token awareness Performance with round robin Discussion Performance of server side C YCSB vs. cassandra-stress Darkloading Performance of client side C Conclusion A Results from benchmarks 37 A.1 YCSB A.2 cassandra-stress A.3 java-driver stress A.4 Darkloading A.4.1 Token aware A.4.2 Round robin Bibliography 41

8 Acknowledgements I want to thank Lalith Suresh and Marco Canini for continuously discussing thoughts and sharing ideas throughout this project. I also want to thank Jimmy Mårdell for his support and expertise with the quirkiness and caveats that Cassandra presents, as well for volunteering to be my supervisor in the first place.

9

10 Chapter 1 Introduction For all service-oriented applications, fast response times are vital for a good user experience. To examine the exact impact of server delays, Amazon and Google conducted experiments where they added extra delays on every query before sending back results to users [21]. One of their findings was that an extra delay of only 500 milliseconds per query resulted in a 1.2% loss of users and revenue, with the e ect persisting even after the delay was removed. However, keeping response times low is not an easy task. As Google reported [12], especially the tail latency is challenging to keep low as size, complexity and overall use of services scale up. When serving a single user request, multiple servers can be involved. Bad latency on a few machines then quickly results in higher overall latencies, and the more machines, the worse the tail latency. To illustrate why, consider a client making a request to a single server. Suppose that the server has an acceptable response time in 99% of the requests, but the last 1% of the requests takes a second or more to serve. This scenario is not too bad, as it only means that one client gets a slightly slower response every now and then. Consider instead a hundred servers like this and that a request requires a response from all servers. This will greatly change the responsiveness of the system. From 1% of the requests being slow, suddenly 63% 1 of the requests will take more than a second to serve. It is then apparent that the tail latency must be taken seriously in order to provide a good service. Apache Cassandra is the database of choice at Spotify for end user facing features. Spotify runs more than 80 Cassandra clusters on over 650 servers, managing im- 1 Assuming independence between response times, the probability that at least one response takes more than a second is

11 CHAPTER 1. INTRODUCTION portant data such as playlists, music collections, account information, user/artist followers and more. Since an end user request often involves reading from several databases, poor tail latencies will a ect the user experience negatively for a large number of users. In this thesis a replica selection algorithm for usage with Cassandra was implemented and evaluated, with focus on reducing the tail latency for reads. 1.1 Problem statement The data in Cassandra is replicated to several nodes in the cluster to provide high availability. The performance of the nodes in the cluster varies over time though, for instance due to internal data maintenance operations and Java garbage collections. When data is read, a replica selection algorithm in Cassandra determines which node in the cluster the request should be sent to. The built in replica selection algorithm provides good median latency, but the tail latency is often an order of magnitude worse than the median, which leads to the following question: Can the tail latency for reads in Cassandra be reduced in practice by using a more sophisticated replica selection algorithm? 2

12 Chapter 2 Background 2.1 Terminology and definitions In this section we discuss concepts and technology necessary to follow the thesis. The reader familiar with the concepts can skip this section Load balancing and replica selection Load balancing is the process of distributing workload across multiple computing resources, such as servers. Replica selection is a form of load balancing as it tries to balance requests across the set of nodes that own the requested data Percentiles and tail latency A percentile is a statistical measure that indicates the value below which a given percentage of observations in a group of observations fall. For example, the 95th percentile is the smallest value which is greater than or equal to 95% of the observations. In the context of latencies, percentiles are important measures when analyzing data. For example, if only using mean and median in analysis, outliers can remain hidden. In contrast, the maximum value gives a pessimistic view since it can be distorted by a single data point. Consider the graph in Figure 2.1, showing latencies over time. If only presenting mean and median latency, crucial information is lost. After 5 hours, the 99th percentile shows a spike that is not noticeable in the mean or median. After 8.5 hours the 99th percentile shows that 1% of the users are experiencing more than 800 ms latencies, while the mean is only 75 ms. The higher percentiles, commonly 95-99th, are often referred to as the tail latency. 3

13 CHAPTER 2. BACKGROUND 1, Mean Median 99th Latency (ms) Time (hours) Figure 2.1: Latencies over time CAP theorem The CAP theorem, also known as Brewer s theorem [14], states that for a distributed computer system it is impossible to simultaneously provide all three of the following: Consistency - all nodes see the same data at the same time. Availability - every request receives a response about whether it succeeded or failed. Partition tolerance - the system continues to operate despite arbitrary message loss or failure of part of the system Eventual consistency Eventual consistency is a consistency model used in distributed systems to achieve high availability. The consistency model informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value SQL Structured Query Language (SQL) is a special-purpose programming language designed for managing data held in a traditional relational database management system (RDBMS). The data model in a relational database uses tables with rows and 4

14 2.1. TERMINOLOGY AND DEFINITIONS columns, with rows containing information about one specific entity and columns being the separate data points. For example, a row could represent a specific car, in which the columns are Model, Color and so on. The tables can have relationships between each other and the data is queried using SQL NoSQL NoSQL 1 databases are an alternative to the tabular relations used in relational databases. The motivation for this approach includes simplicity of design, horizontal scaling and availability. The data structures used by NoSQL databases (e.g. column, document, key-value or graph) di ers from those used in relational databases, making some operations faster in NoSQL and others faster in relational databases. The suitability of a particular database, regardless of it being relational or NoSQL, depends on the problem it must solve. There are many di erent distributed NoSQL databases and their functionality can di er a lot depending on which two properties from the CAP theorem they support Accrual failure detection In distributed systems, a failure detector is an application or a subsystem that is responsible for detecting slow or failing nodes. This mechanism is important to detect situations where the system would perform better by excluding the culprit node or putting it on probation. To decide if a node is subject for exclusion/probation a suspicion level is used. For example, traditional failure detectors use boolean information as the suspicion level: a node is simply suspected or not suspected. Accrual failure detectors are a class of failure detectors where the information is a value on a continuous scale rather than a boolean value. The higher the value, the higher the confidence that the monitored node has failed. If an actual crash occurs, the output of the accrual failure detector will accumulate over time and tend towards infinity (hence the name). This model provides more flexibility as the application itself can decide an appropriate suspicion threshold. Note that a low threshold means quick detection in the event of a real crash, but also increases the likelihood of incorrect suspicion. On the other hand, a high threshold makes less mistakes but makes the failure detector slower to detect failing nodes. Hayashibara et al. describe an implementation of such an accrual failure detector in [17], called the Ï accrual failure detector. In the Ï failure detector the arrival times of heartbeats 2 are used to approximate the probabilistic distribution of future heartbeat messages. With this information, a value Ï is calculated with a scale that changes dynamically to match recent network conditions. 1 interpreted as Not only SQL, to emphasize that they may also support SQL-like languages. 2 A periodic signal generated by hardware/software for activation or synchronization purposes. 5

15 CHAPTER 2. BACKGROUND Exponentially weighted moving averages (EMWA) A moving average (also known as rolling average or running average) is a technique used to analyze trends in a data set by creating a series of averages of di erent subsets of the full data set. Given a sequence of numbers and a fixed subset size, the first element of the moving average sequence is obtained by taking the average of the initial fixed subset of the number sequence. Then the subset is modified by excluding the first number of the series and including the next number following the original subset in the series. This creates a new averaged subset of numbers. More mathematically formulated: Given a sequence {a i } N i=1, an n-moving average is a new sequence {s i} N n+1 i=1 defined from the a i sequence by taking the mean of subsequences of n terms: s i = 1 n i+n 1 ÿ j=i a j The sequences S n giving n-moving averages then are: S 2 = 1 2 (a 1 + a 2,a 2 + a 3,...,a n 1 + a n ) S 3 = 1 3 (a 1 + a 2 + a 3,a 2 + a 3 + a 4,...,,a n 2 + a n 1 + a n ) and so on. An example of di erent moving averages can be seen in Figure Figure 2.2: The 2-(red), 3-(green), and 4-(blue) moving averages for 20 data points. 6

16 2.1. TERMINOLOGY AND DEFINITIONS An exponentially weighted moving average (EMWA), instead of using the average of a fixed subset of data points, applies weighting factors to the data points. The weighting for each older data point decreases exponentially, never reaching zero. The EMWA for a series Y can be calculated as: S 1 = Y 1 for t>1:s t = Y t +(1 ) S t 1 Where represents the degree of weighting decrease, a constant smoothing factor between 0 and 1. A higher value of discounts older observations faster. Y t is the value at a time period t, and S t is the value of the EMWA at a time period t RAID RAID 3 is a virtualization technology for data storage which combines multiple disk drives into one logical unit. Data is distributed across the drives in di erent ways called RAID levels, depending on the specific level of redundancy and performance wanted. The di erent schemes are named by the word RAID followed by a number (e.g. RAID 0, RAID 1). Each scheme provides di erent balance between the key goals: reliability, availability, performance and capacity. RAID 10, or RAID 1+0 is a scheme where throughput and latency are prioritized and is therefore the preferable RAID level for I/O intense applications such as databases Apache Cassandra Apache Cassandra, born at Facebook [18] and built on ideas from Amazon s Dynamo [13] and Google s BigTable [3], is an open source NoSQL distributed database system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. DataStax DataStax is a computer software company whose business model centers around selling an enterprise distribution of the Cassandra project which includes extensions to Cassandra, analytics and search functionality. DataStax also employ more than ninety percent of the Cassandra committers. 3 Originally redundant array of inexpensive disks, now commonly redundant array of independent disks. 7

17 CHAPTER 2. BACKGROUND Replication To ensure fault tolerance and reliability, Cassandra stores copies of data, called replicas, on multiple nodes. The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is one copy of each row on one node. A factor of two means two copies of each row, where each copy is on a di erent node [7]. When a client read or write request is issued, it can go to any node in the cluster since all nodes in Cassandra are peers. When a client connects to a node, that node serves as the coordinator for that particular client operation. What the coordinator then does is to act as a proxy between the client application and the nodes that own the requested data. The coordinator is responsible for determining which node should get the request based on the cluster configuration and replica placement strategy. Partitioners and tokens A partitioner in Cassandra determines how the data is distributed across the nodes in a cluster, including replicas. In essence, the partitioner is a hash function for deriving a token, representing a row from its partion key 4 [9]. The basic idea is that each node in the Cassandra cluster is assigned a token that determines what data in the cluster it is responsible for [2]. The tokens assigned to a node needs to be distributed throughout the entire possible range of tokens. As a simple example, consider a cluster with four nodes and a possible token range of Then you would want the tokens for the nodes to be 0, 20, 40, 60, making each node responsible for an equal portion of the data. Data consistency As Cassandra sacrifices consistency for availability and partition tolerance, making it an AP system in the CAP theorem sense, replicas may not always be synchronized. Cassandra extends the concept of eventual consistency by o ering tunable consistency, meaning that the client application can decide how consistent the requested data must be. In the context of read requests, the consistency level specifies how many replicas must respond to a read request before data is returned to the client application. Examples of consistency levels can be seen in Table The partition key is the first column declared in the PRIMARY KEY definition. Each row of data is uniquely identified by the partition key. 8

18 2.1. TERMINOLOGY AND DEFINITIONS Level ALL QUORUM ONE TWO Description Returns the data after all replicas has responded. The read operation fails if a replica does not respond. Returns the data once a quorum, i.e. a majority, of replicas has responded. Return the data from the closest replica. Return the data from the two closest replicas. Table 2.1: Examples of read consistency levels. To minimize the amount of data sent over the network when doing reads with a consistency level above ONE, Cassandra makes use of digest requests. A digest request is just like a regular read request except that instead of the node actually sending the data it only returns a digest, i.e. a hash of the data. The intent is to discover whether two or more nodes agree on what the current data is, without actually sending the data over the network and therefore save bandwidth. Cassandra sends one data request to one replica and digest requests to the remaining replicas. Note that the digest queried nodes still will do all the work of fetching data, they will just not return it. Replica selection In order for the coordinator node to route requests e ciently it makes use of a snitch. A snitch informs Cassandra about the network topology and determines which data centers and racks nodes belong to. This information allows Cassandra to distribute replicas according to the replication strategy [11] by grouping machines into data centers and racks. In addition, all snitches also use a dynamic snitch layer that provides an adaptive behaviour when performing reads [24]. It uses an accrual failure detection mechanism, based on the Ï failure detector discussed in section 2.1.7, to calculate a per node threshold that takes into account network performance, workload and historical latency conditions. This information is used to detect failing or slow nodes, but also for calculating the best host in terms of latency, i.e. selecting the best replica. However, calculating the best host is expensive. If too much CPU time is spent on calculations it would become counterproductive as it would sacrifice overall read throughput. The dynamic snitch therefore adopts two separate operations. One is receiving the updates, which is cheap, and the other is calculating scores for each host which is more expensive. In the update part latencies of the hosts are sampled and weighted with EMWA:s. The calculation part in turn iterates through the recorded latencies of each host to 9

19 CHAPTER 2. BACKGROUND find the worst latency as a measure for the scoring. After finding the worst latency it makes a second pass over the hosts and score them against the maximum value. This calculation has been configured to only run once every 100 ms to reduce the cost. As hosts can not inform the system of their recovery once put on probation, all computed scores are reset once every ten minutes as well. Client drivers and token awareness To enable communication between client applications and a Cassandra cluster, multiple client drivers for Cassandra exists. Cassandra supports two communication protocols, the legacy Thrift interface [22], and the newer native binary protocol that enables use of the Cassandra Query Language (CQL) [6], resembling SQL. Di erent drivers can therefore use di erent protocols. Popular drivers includes Astyanax, which uses the Thrift interface, and the Java driver from DataStax which only supports CQL. As these drivers can get the token information from the nodes during initialization, they can be configured to be token aware. This means that the client driver can make a qualified choice about which nodes to issue requests to, based on the data requested. 2.2 Load balancing techniques in distributed systems There exists numerous ideas and techniques to improve load balancing in distributed systems. The problem is often to decide on a good trade-o between exchanging a lot of communication between servers and clients and making guesses and approximations on the tra c. Intuitively, more information makes it easier to do good decisions, but information passing can be costly. This section briefly discusses previous work, ideas and algorithms for load balancing techniques in distributed systems, not necessarily with focus on improving the tail latency The power of d choices Consider a system with n requests and n servers to serve them. If each request is dispatched independently and uniformly at random to a server then the maximum log n load, or the largest number of requests at any server, is approximately log log n. Suppose instead that each request gets placed sequentially onto the least loaded (in terms of number of requests enqueued on a server) of d Ø 2 servers chosen independently and uniformly at random. It has then been shown that with high log log n log d probability 5 the maximum load is instead only + C, wherec is a constant factor [1] [20]. This means that getting two choices instead of just one leads to an exponential improvement in the maximum load. 5 High probability means here at least 1 1, where n is the number of requests. n 10

20 2.2. LOAD BALANCING TECHNIQUES IN DISTRIBUTED SYSTEMS This result demonstrates the power of two choices, which is a commonly used property in load balancing strategies. When referring to this idea the common way to denote it is by SQ(d), meaning shortest-queue-of-d-choices Join-Shortest-Queue The Join-Shortest-Queue (JSQ) algorithm is a popular routing policy used in processor sharing server clusters. In JSQ, an incoming request gets dispatched to the server with the least number of currently active requests. Ties are broken by chosing randomly between the two servers. JSQ therefore tries to load balance across servers by reducing the chance of one server having multiple jobs while another server has none. This is a greedy policy since the incoming request prefers sharing a server with as few jobs as possible. Figure 2.3 illustrates the algorithm, with the clients at the top, A-C being servers with their respective queues and pending jobs. An interesting result that was shown by Gupta et al. [15], is that the performance of JSQ on a processor sharing system shows near insensitivity to di erences on the job size distribution. This is di erent from similar routing policies like Least-Work- Left (send the job to the host with the least total work) or Round-Robin which are highly sensitive to the job size distribution. JSQ is not optimal 6, but was still shown to have great performance in comparison to algorithms with much higher complexity. A potential drawback with JSQ though, is that as the system grows, the amount of communication over the network between dispatchers and servers could get overwhelming given that each of the distributed dispatchers will need to obtain the number of jobs at every server before every job assignment Join-Idle-Queue The Join-Idle-Queue (JIQ) algorithm, described in [19], tries to decouple detection of lightly loaded servers from the job assignment. The idea is to have idle processors inform the dispatchers as they become idle, without interfering with job arrivals. This removes the load balancing work from request processing. JIQ consists of two parts, the primary and the secondary load balancing problem, which communicate via a data structure called an I-queue. An I-queue is a list of processors that have reported themselves as idle. When a processor becomes idle it joins an I-queue based on a load balancing algorithm. Two load balancing algorithms for this purpose was considered in [19]: Random and SQ(d). With 6 In the optimal solution, each incoming job is assigned as to minimize the mean response time for all jobs currently in the system, assuming there are 0 future arrivals. 11

21 CHAPTER 2. BACKGROUND Figure 2.3: The join-shortest-queue algorithm. Clients prefer the server with the shortest queue. JIQ-Random an idle processor joins an I-queue uniformly at random, and with JIQ-SQ(d) an idle processor chooses d random I-queues and joins the one with the shortest queue length. If a client do not have any servers in its I-queue it will in turn make a choice based on the SQ(d) algorithm. Figure 2.4 illustrates the algorithm, again with the clients at the top with their respective I-queues, A-F being servers with their respective queues and pending jobs. It is worth noting that JIQ-Random has the additional advantage of having a oneway communication, without requiring messages from the I-queues. Lu et al. showed three interesting results: JIQ-Random outperforms traditional SQ(2), in respect to mean response time. JIQ-SQ(2) achieves close to the minimum possible mean response time. Both JIQ-Random and JIQ-SQ(2) are near-insensitive to job size distribution with processor sharing in a finite system Speculative retries Speculative retries, also denoted eager retries and hedged requests [12] is the process of sending requests to several servers and use the one that responds first. The client initially sends one request to the server that is believed to perform the best, but falls back on sending a secondary request after a delay. The client cancels remaining outstanding requests once a result is received. 12

22 2.2. LOAD BALANCING TECHNIQUES IN DISTRIBUTED SYSTEMS Figure 2.4: The join-idle-queue algorithm. Servers join an I-queue based on the power of d choices algorithm. If a client do not have any servers in its I-queue it will in turn make a choice based on the power of d choices algorithm. Implementing speculative retries adds some overhead, but can still give latencyreduction e ects while increasing load only modestly. A way to achieve this is by waiting to send a second request until the first one has been outstanding for more than the 95th or 99th percentile expected latency for that type of request. This limits the additional load to only a couple of percents (~1-5%) while substantially reducing the tail latency, since the pending request might be a several second timeout for example. Speculative retries was implemented in Cassandra with the default of sending the next request in the 99th percentile [10] Tied requests Dean and Barroso [12] stated that instead of letting the client choose according to the SQ(d) algorithm, you should let the request be sent to multiple servers simultaneously while making sure that the servers are allowed to communicate updates on the status of the request with each other. These requests where servers use status updates are called tied request. As soon as one server starts processing a request, it sends a cancellation message to the other servers ( ties ), which keeps the client out of the loop for the cancel logic. The corresponding requests, if still enqueued on the other servers can then be aborted immediately or be deprioritized. 13

23 CHAPTER 2. BACKGROUND 2.3 The C3 algorithm The C3 algorithm, described in [23], is a replica selection algorithm for Cassandra usage. Suresh et al. argue that replica selection is an overlooked process which should be a cause for concern. They argue that putting mechanisms such as speculative retries on top of bad replica selection may increase system utilization for little benefit. C3 tries to solve the problem by using two concepts. Firstly it uses additional feedback from server nodes in order for the clients to rank them and prefer faster ones. Secondly, the clients implement a rate control mechanism to prevent nodes from being overwhelmed. A note worth making is that a client in the C3 design is actually the coordinator node in Cassandra, so the entire algorithm is implemented server side. The current implementation is in Cassandra version Replica ranking In the C3 replica ranking, the clients ranks the server nodes using a scoring function, just like the dynamic snitch, with the score working as a measure of latency to expect from the node in question. Clients prefer lower scores which corresponds to faster nodes for each request. Instead of only using the latency, the C3 scoring function tries to minimize the product of the job queue size 7 q s and the service time 1/µ s (the time to fetch the requested rows) across every server s. Along with each response to a client, the servers send back additional information about their queue sizes and service times. The queue size is recorded after a request has been served and when the response is about to be returned. To make a better forecast, the values are smoothed with EWMA:s, denoting the new values q s and µ s. In addition to these values, the response time R s (i.e the di erence between the latency for the entire request and the service time) is also recorded and smoothed. To account for other clients in the system as well as ongoing requests, each client also maintain, for each server s, an instantaneous count of its outstanding requests os s (requests for which a response is yet to be received). It is assumed that each client knows how many other clients there are in the system (n). The clients then make an estimate of the queue size of each server as: ˆq s = os s n + q s +1 (2.1) 7 The job queue size refers to the number of pending requests at a server. 14

24 2.3. THE C3 ALGORITHM where the os s n term is referred to as the concurrency compensation. The idea behind the concurrency compensation is that clients will account for the scenario of multiple clients concurrently issuing requests to the same server. The clients with a higher value of os s will therefore give a higher estimate of the queue size at s and rank it lower than a client with fewer requests to s. This results in clients that have a higher demand will be more likely to rank s lower than clients with a lighter demand. Using this estimation, clients compute the queue size to service rate ratio ( ˆq s / µ s ) of each server and rank them accordingly. However, a function linear in ˆq s is not su cient as it would demand a rather large increase in queue size in order for a client to switch back to a slower server again, which could result in accumulation of jobs at the faster nodes. Instead, C3 penalizes longer queue lengths by raising the ˆq s term to a higher power, b: (ˆq s ) b / µ s. For higher values of b, clients are less greedy about preferring a server with a lower service time as the (q s ) b term will dominate the scoring function more strongly. In C3, b is set to 3, yielding a cubic function. This results in a final scoring function: s = R s +(ˆq s ) 3 / µ s (2.2) where R s and µ s are the EWMA:s of the response time and service rate and ˆq s is the queue size estimate described in equation Rate control To prevent exceeding server capacity, clients incorporate a rate limiting mechanism inspired by the congestion control in the CUBIC TCP implementation [16]. This mechanism is decentralized as clients do not inform each other of their demands of a server. Every client uses a rate limiter for each server which limits the number of requests sent within a configured time window of ms. The limit is referred to as the sending rate (srate). By letting the clients track the number of responses received from a server in the ms interval (the receive rate, rrate) the rate limiter adapts and adjusts srate to match the rrate of the server. When a client receives a response from a server s, the client compares the current srate and rrate for s. Ifsrate is found to be lower than rrate, the client increases its rate according to a cubic function: 15

25 CHAPTER 2. BACKGROUND srate Ω A Û B 3 R0 T 3 + R 0 (2.3) where T is the elapsed time since the last rate decrease, and R 0 is the rate at the time of the last rate decrease. If the rrate is lower than the srate, theclient instead decreases its srate multiplicatively by, in C3 set to 0.2. The value represents a scaling factor and is used to set the desired duration of the saddle region. Additionally a cap for the step size is set by a parameter s max. The scaling factor in C3 is set to 100 milliseconds and the cap size is set to 10. To get a better understanding of the properties of the rate controlling function, consider Figure 2.5. The proposed benefits with using this function is mostly the configurable saddle region. While the sending rate is significantly lower than the saturation rate, the client will increase the rate aggressively (low rate region). When the sending rate is close to the perceived saturation point of the server, that is, R 0, the client stabilizes its sending rate and increases it conservatively (saddle region). Lastly, when the client has spent enough time in the stable region, it will again increase its rate aggressively, probing for more capacity (optimistic probing region). Rate (requests per ms) R 0 Saddle region Optimistic probing region Low rate region T (ms) Figure 2.5: Growth curve for the rate control function Notes on the C3 implementation Some notes are worth making regarding the C3 algorithm. Firstly, C3 will always route requests solely based on the replica scoring. This means that if the coordinator already has the requested data locally, it might route the request to a remote node, if that node has a better score than the coordinator itself. 16

26 2.3. THE C3 ALGORITHM Secondly, although all replicas get sorted, C3 will stop processing as soon as it has found the best replica that is not limited and put it first in queue for request processing. This means that when using consistency level QUORUM, i.e. sending multiple requests, only the data request will be rate limited, leaving the digest una ected by the rate limiting part of C3. 17

27

28 Chapter 3 Method Evaluating performance is not a trivial task. While the focus in this thesis was on improving the tail latency, it was important to not achieve this by sacrificing the average case performance, i.e. the average latency of a request. A good starting point was to implement C3 in Cassandra (the version that Spotify uses), to try and verify if the performance gains seen by the C3 authors in version 2.0.0, could also be seen in the newer version, despite the version gap. 3.1 Tools for testing This section describes tools used while implementing the algorithm and evaluating Cassandra performance. In the process of benchmarking, guidelines and advice from DataStax [8] was adhered to The cassandra-stress tool The cassandra-stress tool is a stress testing utility for Cassandra clusters written in Java which is included in the Cassandra installation [5]. It has three modes of operations: inserting data, reading data and indexed range slicing. For the purpose of this thesis the read mode is what was used for analysis. During a run, the cassandra-stress tool reports information at a configurable interval. Example output can be seen below: total, interval_op_rate, interval_key_rate, latency,95th,99.9th, elapsed_time ,1057,1057,15.4,36.4,571.6, ,1620,1620,10.5,32.9,475.8, ,2071,2071,4.0,29.4,380.6, ,2436,2436,2.5,27.1,378.1,

29 CHAPTER 3. METHOD Here, each line reports data for the interval between the last elapsed time and current elapsed time (default is 10 seconds). The columns of interest are in particular, latency, 95th and 99.9th. The latency column describes the average latency in milliseconds for each operation during that interval. The 95th and 99.9th columns describe the percentiles, i.e. 95% and 99.9th% of the time the latency was less than the number displayed. The cassandra-stress tool is highly configurable, for example it is possible to specify the number of threads, read and write consistency and size of the records The Yahoo Cloud Serving Benchmark The Yahoo Cloud Serving Benchmark (YCSB) is a framework for benchmarking various cloud serving systems [4]. The YCSB client is a workload generator, and the core workloads included in the installation is a set of workload scenarios to be executed by the generator. Just like the cassandra-stress tool, the YCSB client is highly configurable. For example it is possible to specify the number of threads, read and write consistency, size of the records and format of the output. Below is example output where the format is a time series:... [READ], 40, [READ], 50, [READ], 60, [READ], 70, [READ], 80, Here, each line reports the average read latency (in microseconds) at an interval of ten milliseconds The Java driver stress tool The Java driver stress tool is a simple example application that uses the DataStax Java driver to stress test Cassandra - which also stress test the Java driver as a result. The example tool is by no means a complete stress application and supports only a very limited number of stress scenarios Darkloading To test new versions of Cassandra, Spotify makes use of Darkloading. Darkloading is the process of duplicating the tra c of a certain system and replay it on another system, to compare the performance. 20

30 3.2. TEST ENVIRONMENT SETUP This is done by snooping on the tra duplicate request to another system. c to the original system and then make a 3.2 Test environment setup In the process of evaluating the performance of di erent Cassandra versions, the task was divided into two parts. The first was evaluating performance by using stress tools such as cassandra-stress, YCSB and the Java driver stress tool which generate the workload and tra c by itself. The other part was evaluating performance on production workload and tra c, which was obtained with the Darkloading strategy. Testing on dedicated hardware is preferable as it removes the uncertainty of skewed results due to resource sharing. Therefore, dedicated hardware was used for both cases. For the Cassandra cluster, machines suited for databases was provisioned, with 16 cores, 32 GB of RAM and spinning disks in a RAID 10 configuration. For the machines which send the tra c, dedicated service machines with 32 cores and 64 GB of RAM was used instead. When di erent benchmarks were conducted it was deemed interesting to test both consistency level ONE and QUORUM. Testing with speculative retries both enabled and disabled was tried, but as this did not yield any interesting results 1 it was omitted as a testing parameter Testing on generated load When testing on generated workload there were two things in particular desirable to achieve. The first was that enough data was inserted to ensure that the entire dataset does not fit in memory. The other part was running the test long enough, since a cluster has very bad performance at the start of a run (due to the Java Virtual Machine warming up). Due to this the first 15% of all recorded values was discarded to only record values when the cluster performance had stabilized. The 15% breakpoint was not thoroughly analyzed, but was simply decided appropriate when looking at the raw output from test runs Testing on production load To try and make the comparison between di erent Cassandra versions as fair as possible, the same production tra c was used in each test run. The data was sampled from the real service and saved to file, making it possible to replay the same data multiple times. 1 A slight improvement could be seen in the higher percentiles, but as this improved performance equally across di erent versions, it was deemed irrelevant. 21

31 CHAPTER 3. METHOD As the tra c was replayed at a fixed rate (in production the rate varies over the day) it only made sense to compare test runs against each other and not against the real production cluster performance. 22

32 Chapter 4 Implementation 4.1 Implementing C3 in Cassandra As Spotify uses Cassandra (and above) for new applications, their development environment is also suited for those versions. Due to the fact that Cassandra and Cassandra are incompatible, C3 was instead implemented directly in Cassandra , making the comparisons and cluster setup easier in the Spotify environment. The implementation did not need much additional reworking of the newer code 1, making the process simple. 4.2 Implementing C3 in the DataStax Java driver As previously mentioned, the entire C3 algorithm is implemented server side. However, a client implementation may be preferable as many newer Cassandra client drivers are token aware, meaning that the coordinator node will be able to serve the requested data directly. By implementing C3 in the client, we can send the request to the best replica in the first step, removing the need of going through the coordinator node just to rank the replicas. With that in mind, the C3 algorithm was implemented in the DataStax Java driver. The Java driver was chosen since it is actively maintained, uses the newer communication protocol and also since it has good support for implementing new load balancing policies. There were some impediments along the way though. Firstly, the queue size and service time as recorded by the server could not be used as this is an extension in the C3 server code. This means that the replica scoring only used metrics as seen by the clients which might have had a significant impact on the performance. 1 To make C3 work in , moving some method calls was su cient. 23

33 CHAPTER 4. IMPLEMENTATION Secondly, as the driver code is substantially di erent from the server code, the parameters set in the C3 server code might not have been suitable values for the client Naive implementation To decide which hosts to send a request to, the driver makes use of a load balancing policy. For each request, the load balancing policy is responsible for returning an iterator containing the hosts to query. This served as a suitable place to implement the replica scoring part of C3. Therefore a new policy called HostScoringPolicy was implemented, responsible for the logic of ranking hosts. As mentioned earlier, the scoring function was simplified as the metrics from the servers used in the original C3 version were not available. The metrics used in the client-side ranking are the latency for the entire request (L s ), the queue size (q s ), and the outstanding requests to a host (os s ), all as seen by the client. Just like the server implementation, EMWA:s was used to smooth the values. The client version of ˆq s is therefore defined just as before: ˆq s = os s n + q s +1 (4.1) But with the di erence that the queue size here is recorded from the client perspective and not by the server itself as in the original C3 implementation. This results in the final client scoring function: s = L s +(ˆq s ) 3 L s (4.2) Here we can notice the big di erence that we do not have the service time metric, leaving us with the entire latency of the request as the only measure. The rate limiting part of C3 was however easily plugged in as the functionality is self contained and not relying on external metrics. 4.3 Benchmarking with YCSB To confirm that C3 performs as suggested, as well as verify that the implementation worked as intended, it was desirable to reproduce the results presented 24

34 4.4. BENCHMARKING WITH CASSANDRA-STRESS by Suresh et al. in [23]. To achieve this, the YCSB framework was used, just like in the original paper. The test scenario with a read-heavy workload (95% reads, 5% writes) was chosen to be reproduced. In the original experiment 15 Cassandra nodes were used, with a replication factor of million records of 1KB each were inserted across the nodes, yielding approximately 100 GB of data per machine. Since the test setup only had 8 Cassandra nodes the record count was modified to be similar to the load in the original experiment. Therefore 250 million records of 1KB each was inserted, yielding near to 100 GB of data per machine. Just like the original test scenario three YCSB instances were used (running on separate machines) each running 40 threads, yielding a total of 120 generators. Then for each Cassandra version and consistency level, just like in the original test, two million rows were read, five times. The duration of a read run was about minutes depending on consistency level. 4.4 Benchmarking with cassandra-stress As the cassandra-stress tool already comes packaged together with the Cassandra installation, C3 was also tested with this tool, to gain further confidence about the performance of C3. The deployment again consisted of the 8 Cassandra nodes, and one separate service machine, running the cassandra-stress tool with the default of 50 threads. 250 million records of 1KB each were inserted across the cluster. Due to a design choice in the cassandra-stress tool million rows were read. The duration of a read run was about 5-7 hours depending on consistency level. 4.5 Benchmarking with the java-driver stress tool As creating a custom stress tool for the purpose of client evaluation is outside the scope of this thesis, the stress application that comes together with the Java driver was used to evaluate the client implementation of C3. By having made some small modifications in the source code of the stress application it was possible to test the di erent load balancing policies with di erent consistency levels. 2 For example, inserting rows will write rows with key values , meaning that if you try to read rows of a di erent magnitude, the keys will not match and the read will fail. 25

35 CHAPTER 4. IMPLEMENTATION The deployment again consisted of the 8 Cassandra nodes, and 6 service machines, each running 100 threads. 250 million records of 1KB each were inserted across the cluster. For each Cassandra version and consistency level, 100 million rows were read. The duration of a read run was about 5-7 hours depending on consistency level. 4.6 Darkloading In order to benchmark the performance of C3 under production load, a cluster had to be duplicated. A suitable cluster was decided with the recommendations from Jimmy Mårdell at Spotify. The chosen cluster consists of 8 Cassandra nodes with approximately 130 GB of data per node and 6 service machines sending tra c to the cluster. The read/write ratio of the incoming requests to the service is approximately 97% reads and 3% writes. To send tra c to the test cluster, two versions of the service client was used. The first version was token aware and used consistency level QUORUM, just like the original service. In the other version the token awareness was replaced by plain round robin, and the consistency level was set to ONE, to try and match the settings that the original C3 was developed with. Due to the service client using the Astyanax client and not the Java driver, it was unfortunately not possible to Darkload the C3 client. Although Astyanax supports a beta version that uses the Java driver under the hood, it only does so for older versions of the Java driver. For each setup, the sampled tra c was replayed at a configured rate which resulted in a disk I/O utilization of around 50-60%, making sure that the cluster had as much tra c as possible without choking the disks. Note however that even though the same tra c was replayed, writes altered the data in the cluster, potentially a ecting some reads, but given the low amount of writes this was deemed to be negligible. In the Darkloading setup an extention in C3 was also tried, were the coordinator node would always serve the data locally if possible (this while using round robin in the client), but as this showed no di erence in performance, that particular test was omitted. 26

36 Chapter 5 Results Here we present the results from our di erent benchmarks. The standard deviation for each measure is marked in all charts. In some charts, where the di erence was small, we have omitted the average latencies as the focus lies on the improving the tail latency. All exact numbers, including averages, are available in Appendix A. 5.1 Benchmarking with YCSB Here we present the results from the YCSB runs. The results are the averages of the combined values outputted from the three YCSB instances. In Figure 5.1 we have consistency level ONE to the left and QUORUM to the right. 100 Consistency level ONE. 100 Consistency level QUORUM Latency (ms) Latency (ms) Mean 95:th 99:th 99.9:th 0 Mean 95:th 99:th 99.9:th C C3 Figure 5.1: Benchmark of C3 with YCSB. 27

37 CHAPTER 5. RESULTS 5.2 Benchmarking with cassandra-stress Here we present the results from the cassandra-stress runs. The results are the averages from the single cassandra-stress instance. In Figure 5.2 we have the results from the 95th and 99.9th percentile latencies, with consistency level ONE to the left and QUORUM to the right. Consistency level ONE. Consistency level QUORUM Latency (ms) Latency (ms) :th C3 99.9:th 0 95:th C3 99.9:th Figure 5.2: Benchmark of C3 with cassandra-stress. 28

38 5.3. BENCHMARKING WITH THE JAVA-DRIVER STRESS TOOL 5.3 Benchmarking with the java-driver stress tool Here we present the results from the java-driver stress runs. The default we compare against is the java-driver with the default LoadBalancingPolicy that is token aware Performance of the C3 client In Figure 5.3 we have the results from the mean, 95 and 99th percentile latency, with consistency level ONE to the left and QUORUM to the right. For both the and the C3 client, Cassandra was running server side. Consistency level ONE. Consistency level QUORUM Latency (ms) 100 Latency (ms) Mean 95:th 99:th 0 Mean 95:th 99:th C C3 Figure 5.3: Benchmark of client C3 with server

39 CHAPTER 5. RESULTS 5.4 Darkloading Here we present the results from the Darkloading runs. First we present the performance with token awareness in the client, followed by the performance with plain round robin Performance with token awareness In Figure 5.4 we have the results from the 95, 98, 99 and 99.9th percentile latencies. 100 Consistency level QUORUM. 80 Latency (ms) :th 98:th 99:th 99.9:th C3 Figure 5.4: Darkloading with token awareness Performance with round robin In Figure 5.5 we have the results from the 95, 98, 99 and 99.9th percentile latencies. 30

40 5.4. DARKLOADING 80 Consistency level ONE :th 98:th 99:th 99.9:th C3 Figure 5.5: Darkloading with round robin. 31

41

42 Chapter 6 Discussion 6.1 Performance of server side C YCSB vs. cassandra-stress The YCSB stress runs confirms the results from the original experiment, that C3 is superior to the original dynamic snitch. Furthermore we found that regardless of using consistency level ONE or QUORUM (in the original experiment only consistency level ONE was evaluated), C3 proved to reduce both latency and variance across all percentiles. In our cassandra-stress runs, results were again positive but not at all with the same confidence as in the YCSB runs. Although it would have been reassuring to get more similar results between tools we want to emphasize the di erences between setups. The cassandra-stress runs were read only, whereas the YCSB runs were read heavy. We had a di erent number of instances running, as well as a di erent thread count. We also do not have any control over the read patterns in cassandra-stress, which also could contribute to the di ering results. Additional YCSB runs similar to the cassandra-stress setup is desirable to see if the di erence between results decreased, but due to time constraints we leave this to future work Darkloading When evaluating C3 on production load, results were a bit di erent from the stress tool results. In the case of the token aware client, we actually saw a little bit (around 100 µs) of overhead in the average case. Not until the 98th percentile did we see an actual improvement, and it was only by a couple of ms, which is not strong enough to suggest an actual performance gain. 33

43 CHAPTER 6. DISCUSSION We believe that one reason for not seeing much improvement in this case is the fact that the client is token aware. The client will therefore already send the request to a node that has the data, meaning that C3 in some cases will not be able to improve the routing. Darkloading C3 with round robin in the client (and consistency level ONE) actually did improve the results, supporting this claim. Although still having the small 100 µs overhead in the average case, we could now see an improvement already in the 95th percentile, with the 99.9th percentile having improved with about 20%. Even though we did see this improvement, there is still a big gap in performance gain compared to the stress tool results. This could have several reasons. Firstly, when generating workload, all the records were of equal size (1KB), meaning that all read requests are equally large. In the case of production load, some rows might contain more data than others due to the nature of the Darkloaded service. This means that some reads will have higher latencies, not due to slow servers but due to how the data is structured. The result of this would be that C3 might rank fast servers as slow ones just because they happen to get heavier reads. Another point worth making is that problems such as garbage collections, where C3 really could improve the performance, commonly does not occur until the cluster has been running for a couple of weeks, which makes it a hard scenario to simulate in the scope of this thesis. 6.2 Performance of client side C3 Although not having the exact metrics like the server C3, the C3 client implementation did lower the tail latency. However, the benchmark showed a lot of variance, making the results inconclusive. Since the variance was present in both the default java-driver version and the C3 implementation we deem this to be a fault in the benchmark setup and not in the implementation. We suggest that making repeated benchmarks and perhaps tweaking the parameters could give a more conclusive result. However, we are under the impression that C3 in the client could work well, and perhaps be a substitute for token aware clients. 6.3 Conclusion Given the right conditions the C3 algorithm has proven to be an e ective way to decrease tail latencies in Cassandra. We would recommend the current implementation in systems where row sizes are homogeneous as variable size records are not taken into account in the scoring function. However, we see no problem with extending the algorithm to take into account 34

44 6.3. CONCLUSION variable size rows. Given that one can obtain the size of the data requested, it should be possible to make a weighted scoring function, but this is outside the scope of this thesis. We would also argue that C3 will be most e ective if the client is not token aware. A client implementation of C3 could resolve this, but the results found in this thesis was too inconclusive to support this claim, and further testing is needed. 35

45

46 Appendix A Results from benchmarks A.1 YCSB System C3 Average latency (ms) 11.59, = , = th percentile (ms) 21.22, = , = th percentile (ms) 30.28, = , = th percentile (ms) 54.85, = , =2.32 Table A.1: YCSB read latencies with consistency level ONE. System C3 Average latency (ms) 16.11, = , = th percentile (ms) 28.46, = , = th percentile (ms) 40.69, = , = th percentile (ms) 80.45, = , =4.37 Table A.2: YCSB read latencies with consistency level QUORUM. 37

47 APPENDIX A. RESULTS FROM BENCHMARKS A.2 cassandra-stress System C3 Average latency (ms) 3.93, = , = th percentile (ms) 27.12, = , = th percentile (ms) , = , =23.89 Table A.3: cassandra-stress read latencies with consistency level ONE. System C3 Average latency (ms) 8.44, = , = th percentile (ms) 34.54, = , = th percentile (ms) , = , =32.18 Table A.4: cassandra-stress read latencies with consistency level QUORUM. A.3 java-driver stress System java-driver client C3 Average latency (ms) 8.75, = , = th percentile (ms) 75.05, = , = th percentile (ms) , = , =55.80 Table A.5: java-driver stress read latencies with consistency level ONE. System java-driver client C3 Average latency (ms) 14.95, = , = th percentile (ms) , = , = th percentile (ms) , = , =42.77 Table A.6: java-driver stress read latencies with consistency level QUORUM. 38

48 A.4. DARKLOADING A.4 Darkloading A.4.1 Token aware System C3 50th percentile (ms) 0.90, = , = th percentile (ms) 1.12, = , = th percentile (ms) 14.59, = , = th percentile (ms) 27.31, = , = th percentile (ms) 36.97, = , = th percentile (ms) 70.61, = , =7.47 Table A.7: Darkloading read latencies with consistency level QUORUM. A.4.2 Round robin System C3 50th percentile (ms) 0.88, = , = th percentile (ms) 1.11, = , = th percentile (ms) 12.27, = , = th percentile (ms) 23.25, = , = th percentile (ms) 31.85, = , = th percentile (ms) 61.96, = , =3.95 Table A.8: Darkloading read latencies with consistency level ONE. 39

49

50 Bibliography [1] Yossi Azar, Andrei Z Broder, Anna R Karlin, and Eli Upfal. Balanced allocations. SIAM journal on computing, 29(1): , [2] Nick Bailey. Balancing your Cassandra cluster. Accessed: [3] Fay Chang, Je rey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, [4] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing, pages ACM, [5] DataStax. The cassandra-stress tool. tools/toolscstress_t.html. Accessed: [6] DataStax. Coming in Cassandra 1.2: binary CQL protocol. Accessed: [7] DataStax. Data replication. architecture/architecturedatadistributereplication_c.html. Accessed: [8] DataStax. How not to benchmark cassandra. Accessed: [9] DataStax. Partitioners. architecturepartitionerabout_c.html. Accessed:

51 BIBLIOGRAPHY [10] DataStax. Rapid read protection in cassandra Accessed: [11] DataStax. Snitches. architecture/architecturesnitchesabout_c.html. Accessed: [12] Je rey Dean and Luiz André Barroso. The tail at scale. Communications of the ACM, 56(2):74 80, [13] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon s highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, pages ACM, [14] Seth Gilbert and Nancy Lynch. Brewer s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 33(2):51 59, [15] Varun Gupta, Mor Harchol Balter, Karl Sigman, and Ward Whitt. Analysis of join-the-shortest-queue routing for web server farms. Performance Evaluation, 64(9): , [16] Sangtae Ha, Injong Rhee, and Lisong Xu. Cubic: a new tcp-friendly high-speed tcp variant. ACM SIGOPS Operating Systems Review, 42(5):64 74, [17] Naohiro Hayashibara, Xavier Defago, Rami Yared, and Takuya Katayama. The Ï accrual failure detector. In Reliable Distributed Systems, Proceedings of the 23rd IEEE International Symposium on, pages IEEE, [18] Avinash Lakshman and Prashant Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2):35 40, [19] Yi Lu, Qiaomin Xie, Gabriel Kliot, Alan Geller, James R Larus, and Albert Greenberg. Join-idle-queue: A novel load balancing algorithm for dynamically scalable web services. Performance Evaluation, 68(11): , [20] Michael Mitzenmacher. The power of two choices in randomized load balancing. Parallel and Distributed Systems, IEEE Transactions on, 12(10): , [21] Eric Schurman and Jake Brutlag. The user and business impact of server delays, additional bytes, and http chunking in web search. In Velocity Web Performance and Operations Conference,

52 BIBLIOGRAPHY [22] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. Thrift: Scalable crosslanguage services implementation. Facebook White Paper, 5(8), [23] Lalith Suresh, Marco Canini, Stefan Schmid, and Anja Feldmann. C3: Cutting tail latency in cloud data stores via adaptive replica selection. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, [24] Brandon Williams. Dynamic snitching in cassandra: past, present, and future. Accessed:

53

C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection

C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Lalith Suresh (TU Berlin) with Marco Canini (UCL), Stefan Schmid, Anja Feldmann (TU Berlin) Tail-latency matters One User Request

More information

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful. Architectures Cluster Computing Job Parallelism Request Parallelism 2 2010 VMware Inc. All rights reserved Replication Stateless vs. Stateful! Fault tolerance High availability despite failures If one

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra January 2014 Legal Notices Apache Cassandra, Spark and Solr and their respective logos are trademarks or registered trademarks

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

Cassandra A Decentralized, Structured Storage System

Cassandra A Decentralized, Structured Storage System Cassandra A Decentralized, Structured Storage System Avinash Lakshman and Prashant Malik Facebook Published: April 2010, Volume 44, Issue 2 Communications of the ACM http://dl.acm.org/citation.cfm?id=1773922

More information

Evaluation of NoSQL databases for large-scale decentralized microblogging

Evaluation of NoSQL databases for large-scale decentralized microblogging Evaluation of NoSQL databases for large-scale decentralized microblogging Cassandra & Couchbase Alexandre Fonseca, Anh Thu Vu, Peter Grman Decentralized Systems - 2nd semester 2012/2013 Universitat Politècnica

More information

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

1. Comments on reviews a. Need to avoid just summarizing web page asks you for: 1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of

More information

LARGE-SCALE DATA STORAGE APPLICATIONS

LARGE-SCALE DATA STORAGE APPLICATIONS BENCHMARKING AVAILABILITY AND FAILOVER PERFORMANCE OF LARGE-SCALE DATA STORAGE APPLICATIONS Wei Sun and Alexander Pokluda December 2, 2013 Outline Goal and Motivation Overview of Cassandra and Voldemort

More information

How to Choose Between Hadoop, NoSQL and RDBMS

How to Choose Between Hadoop, NoSQL and RDBMS How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A

More information

Data Storage - II: Efficient Usage & Errors

Data Storage - II: Efficient Usage & Errors Data Storage - II: Efficient Usage & Errors Week 10, Spring 2005 Updated by M. Naci Akkøk, 27.02.2004, 03.03.2005 based upon slides by Pål Halvorsen, 12.3.2002. Contains slides from: Hector Garcia-Molina

More information

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM 152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented

More information

INTRODUCTION TO CASSANDRA

INTRODUCTION TO CASSANDRA INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open

More information

Recommendations for Performance Benchmarking

Recommendations for Performance Benchmarking Recommendations for Performance Benchmarking Shikhar Puri Abstract Performance benchmarking of applications is increasingly becoming essential before deployment. This paper covers recommendations and best

More information

Cloud Computing with Microsoft Azure

Cloud Computing with Microsoft Azure Cloud Computing with Microsoft Azure Michael Stiefel www.reliablesoftware.com [email protected] http://www.reliablesoftware.com/dasblog/default.aspx Azure's Three Flavors Azure Operating

More information

A survey of big data architectures for handling massive data

A survey of big data architectures for handling massive data CSIT 6910 Independent Project A survey of big data architectures for handling massive data Jordy Domingos - [email protected] Supervisor : Dr David Rossiter Content Table 1 - Introduction a - Context

More information

High Throughput Computing on P2P Networks. Carlos Pérez Miguel [email protected]

High Throughput Computing on P2P Networks. Carlos Pérez Miguel carlos.perezm@ehu.es High Throughput Computing on P2P Networks Carlos Pérez Miguel [email protected] Overview High Throughput Computing Motivation All things distributed: Peer-to-peer Non structured overlays Structured

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at distributing load b. QUESTION: What is the context? i. How

More information

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.

More information

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014 Highly available, scalable and secure data with Cassandra and DataStax Enterprise GOTO Berlin 27 th February 2014 About Us Steve van den Berg Johnny Miller Solutions Architect Regional Director Western

More information

Rackspace Cloud Databases and Container-based Virtualization

Rackspace Cloud Databases and Container-based Virtualization Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many

More information

MinCopysets: Derandomizing Replication In Cloud Storage

MinCopysets: Derandomizing Replication In Cloud Storage MinCopysets: Derandomizing Replication In Cloud Storage Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and Mendel Rosenblum Stanford University [email protected], {stutsman,rumble,skatti,ouster,mendel}@cs.stanford.edu

More information

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a

More information

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications ECE6102 Dependable Distribute Systems, Fall2010 EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications Deepal Jayasinghe, Hyojun Kim, Mohammad M. Hossain, Ali Payani

More information

Cloud Computing Is In Your Future

Cloud Computing Is In Your Future Cloud Computing Is In Your Future Michael Stiefel www.reliablesoftware.com [email protected] http://www.reliablesoftware.com/dasblog/default.aspx Cloud Computing is Utility Computing Illusion

More information

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory) WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk Benchmarking Couchbase Server for Interactive Applications By Alexey Diomin and Kirill Grigorchuk Contents 1. Introduction... 3 2. A brief overview of Cassandra, MongoDB, and Couchbase... 3 3. Key criteria

More information

Yahoo! Cloud Serving Benchmark

Yahoo! Cloud Serving Benchmark Yahoo! Cloud Serving Benchmark Overview and results March 31, 2010 Brian F. Cooper [email protected] Joint work with Adam Silberstein, Erwin Tam, Raghu Ramakrishnan and Russell Sears System setup and

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies

More information

SAN Conceptual and Design Basics

SAN Conceptual and Design Basics TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer

More information

DataStax Enterprise, powered by Apache Cassandra (TM)

DataStax Enterprise, powered by Apache Cassandra (TM) PerfAccel (TM) Performance Benchmark on Amazon: DataStax Enterprise, powered by Apache Cassandra (TM) Disclaimer: All of the documentation provided in this document, is copyright Datagres Technologies

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB bankmark UG (haftungsbeschränkt) Bahnhofstraße 1 9432 Passau Germany www.bankmark.de [email protected] T +49 851 25 49 49 F +49 851 25 49 499 NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB,

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction

More information

NoSQL Data Base Basics

NoSQL Data Base Basics NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

Windows Server Performance Monitoring

Windows Server Performance Monitoring Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

More information

In Memory Accelerator for MongoDB

In Memory Accelerator for MongoDB In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000

More information

DELL TM PowerEdge TM T610 500 Mailbox Resiliency Exchange 2010 Storage Solution

DELL TM PowerEdge TM T610 500 Mailbox Resiliency Exchange 2010 Storage Solution DELL TM PowerEdge TM T610 500 Mailbox Resiliency Exchange 2010 Storage Solution Tested with: ESRP Storage Version 3.0 Tested Date: Content DELL TM PowerEdge TM T610... 1 500 Mailbox Resiliency

More information

11.1 inspectit. 11.1. inspectit

11.1 inspectit. 11.1. inspectit 11.1. inspectit Figure 11.1. Overview on the inspectit components [Siegl and Bouillet 2011] 11.1 inspectit The inspectit monitoring tool (website: http://www.inspectit.eu/) has been developed by NovaTec.

More information

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database WHITE PAPER Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

HP Smart Array Controllers and basic RAID performance factors

HP Smart Array Controllers and basic RAID performance factors Technical white paper HP Smart Array Controllers and basic RAID performance factors Technology brief Table of contents Abstract 2 Benefits of drive arrays 2 Factors that affect performance 2 HP Smart Array

More information

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance. Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance

More information

Copyright www.agileload.com 1

Copyright www.agileload.com 1 Copyright www.agileload.com 1 INTRODUCTION Performance testing is a complex activity where dozens of factors contribute to its success and effective usage of all those factors is necessary to get the accurate

More information

1 Storage Devices Summary

1 Storage Devices Summary Chapter 1 Storage Devices Summary Dependability is vital Suitable measures Latency how long to the first bit arrives Bandwidth/throughput how fast does stuff come through after the latency period Obvious

More information

Distributed Systems. Tutorial 12 Cassandra

Distributed Systems. Tutorial 12 Cassandra Distributed Systems Tutorial 12 Cassandra written by Alex Libov Based on FOSDEM 2010 presentation winter semester, 2013-2014 Cassandra In Greek mythology, Cassandra had the power of prophecy and the curse

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE 1 W W W. F U S I ON I O.COM Table of Contents Table of Contents... 2 Executive Summary... 3 Introduction: In-Memory Meets iomemory... 4 What

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Application. Performance Testing

Application. Performance Testing Application Performance Testing www.mohandespishegan.com شرکت مهندش پیشگان آزمون افسار یاش Performance Testing March 2015 1 TOC Software performance engineering Performance testing terminology Performance

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Efficient database auditing

Efficient database auditing Topicus Fincare Efficient database auditing And entity reversion Dennis Windhouwer Supervised by: Pim van den Broek, Jasper Laagland and Johan te Winkel 9 April 2014 SUMMARY Topicus wants their current

More information

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage Volume 2, No.4, July August 2013 International Journal of Information Systems and Computer Sciences ISSN 2319 7595 Tejaswini S L Jayanthy et al., Available International Online Journal at http://warse.org/pdfs/ijiscs03242013.pdf

More information

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store Oracle NoSQL Database A Distributed Key-Value Store Charles Lamb, Consulting MTS The following is intended to outline our general product direction. It is intended for information

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

More information

Report Paper: MatLab/Database Connectivity

Report Paper: MatLab/Database Connectivity Report Paper: MatLab/Database Connectivity Samuel Moyle March 2003 Experiment Introduction This experiment was run following a visit to the University of Queensland, where a simulation engine has been

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information

More information

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg HDB++: HIGH AVAILABILITY WITH Page 1 OVERVIEW What is Cassandra (C*)? Who is using C*? CQL C* architecture Request Coordination Consistency Monitoring tool HDB++ Page 2 OVERVIEW What is Cassandra (C*)?

More information

DMS Performance Tuning Guide for SQL Server

DMS Performance Tuning Guide for SQL Server DMS Performance Tuning Guide for SQL Server Rev: February 13, 2014 Sitecore CMS 6.5 DMS Performance Tuning Guide for SQL Server A system administrator's guide to optimizing the performance of Sitecore

More information

MASTER PROJECT. Resource Provisioning for NoSQL Datastores

MASTER PROJECT. Resource Provisioning for NoSQL Datastores Vrije Universiteit Amsterdam MASTER PROJECT - Parallel and Distributed Computer Systems - Resource Provisioning for NoSQL Datastores Scientific Adviser Dr. Guillaume Pierre Author Eng. Mihai-Dorin Istin

More information

Practical Cassandra. Vitalii Tymchyshyn [email protected] @tivv00

Practical Cassandra. Vitalii Tymchyshyn tivv00@gmail.com @tivv00 Practical Cassandra NoSQL key-value vs RDBMS why and when Cassandra architecture Cassandra data model Life without joins or HDD space is cheap today Hardware requirements & deployment hints Vitalii Tymchyshyn

More information

Benchmarking and Analysis of NoSQL Technologies

Benchmarking and Analysis of NoSQL Technologies Benchmarking and Analysis of NoSQL Technologies Suman Kashyap 1, Shruti Zamwar 2, Tanvi Bhavsar 3, Snigdha Singh 4 1,2,3,4 Cummins College of Engineering for Women, Karvenagar, Pune 411052 Abstract The

More information

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7 Introduction 1 Performance on Hosted Server 1 Figure 1: Real World Performance 1 Benchmarks 2 System configuration used for benchmarks 2 Figure 2a: New tickets per minute on E5440 processors 3 Figure 2b:

More information

Berkeley Ninja Architecture

Berkeley Ninja Architecture Berkeley Ninja Architecture ACID vs BASE 1.Strong Consistency 2. Availability not considered 3. Conservative 1. Weak consistency 2. Availability is a primary design element 3. Aggressive --> Traditional

More information

Performance Workload Design

Performance Workload Design Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles

More information

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System CS341: Operating System Lect 36: 1 st Nov 2014 Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati File System & Device Drive Mass Storage Disk Structure Disk Arm Scheduling RAID

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

How To Test For Elulla

How To Test For Elulla EQUELLA Whitepaper Performance Testing Carl Hoffmann Senior Technical Consultant Contents 1 EQUELLA Performance Testing 3 1.1 Introduction 3 1.2 Overview of performance testing 3 2 Why do performance testing?

More information

SCALABILITY AND AVAILABILITY

SCALABILITY AND AVAILABILITY SCALABILITY AND AVAILABILITY Real Systems must be Scalable fast enough to handle the expected load and grow easily when the load grows Available available enough of the time Scalable Scale-up increase

More information

OpenFlow Based Load Balancing

OpenFlow Based Load Balancing OpenFlow Based Load Balancing Hardeep Uppal and Dane Brandon University of Washington CSE561: Networking Project Report Abstract: In today s high-traffic internet, it is often desirable to have multiple

More information

Business Application Services Testing

Business Application Services Testing Business Application Services Testing Curriculum Structure Course name Duration(days) Express 2 Testing Concept and methodologies 3 Introduction to Performance Testing 3 Web Testing 2 QTP 5 SQL 5 Load

More information

Informix Dynamic Server May 2007. Availability Solutions with Informix Dynamic Server 11

Informix Dynamic Server May 2007. Availability Solutions with Informix Dynamic Server 11 Informix Dynamic Server May 2007 Availability Solutions with Informix Dynamic Server 11 1 Availability Solutions with IBM Informix Dynamic Server 11.10 Madison Pruet Ajay Gupta The addition of Multi-node

More information

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015 NoSQL Databases Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015 Database Landscape Source: H. Lim, Y. Han, and S. Babu, How to Fit when No One Size Fits., in CIDR,

More information

MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 [email protected] 1.866.963.0424 www.simplehelix.com 2 Table of Contents

More information

So What s the Big Deal?

So What s the Big Deal? So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data

More information

RAID-DP: NetApp Implementation of Double- Parity RAID for Data Protection

RAID-DP: NetApp Implementation of Double- Parity RAID for Data Protection Technical Report RAID-DP: NetApp Implementation of Double- Parity RAID for Data Protection Jay White & Chris Lueth, NetApp May 2010 TR-3298 ABSTRACT This document provides an in-depth overview of the NetApp

More information

An improved task assignment scheme for Hadoop running in the clouds

An improved task assignment scheme for Hadoop running in the clouds Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 RESEARCH An improved task assignment scheme for Hadoop running in the clouds Wei Dai * and Mostafa Bassiouni

More information