University of Warsaw January 12, 2011
Outline Introduction 1 Introduction 2 3 4 5
Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed adds up to significant costs
Key issues Introduction Is there a way to exploit the CPU vs. I/O gap to the users advantage?
Observations Introduction Many industry problems exhibit massive data parallelism with relatively small computational demands A fair amount of real-life problems heavily depends on efficient, distributed key-value stores that span several gigabytes Such stores often contain millions of small items (on the order of kilobytes)
A motivating example Twitter A wonderfully popular service, Twitter has all the above-mentioned properties. Each tweet is limited to 140B. There is fairly little processing performed on the tweets, yet just the search system is stressed by an average of 12000 queries per second. There is a stream of over a thousand tweets per second entering the system. A high-performance key-value store is crucial to the operation. At the same time the cost of running a conventional cluster capable of meeting this demand is extremely high. Disclaimer To my knowledge, FAWN is not being used in Twitter. But it would probably make a lot of sense if it were. Thank you.
The problem, defined To engineer a fast, scalable key-value store for small (hundreds to thousands of bytes) items This store is expected to: respond to upwards from thousands of random queries per second (QPS) conserve power as much as possible meet service level agreements regarding latency scale well upwards as the system grows scale well downwards as demand fluctuates during operating hours
Possible solutions (1) A cluster of traditional servers with HDD as storage. Problems: very poor performance for random accesses, unless RAID or a similar disk array is used if RAID is to be used, both initial price and total cost of ownership skyrocket most of the power consumption is fixed not much power is conserved during low load periods
Possible solutions (2) A cluster of traditional servers with RAM as storage (think memcached) Problems: very high cost in terms of $/GB robustness is lost unless additional systems are employed power consumption is just as bad as before
Possible solutions (3) A cluster of traditional servers with SSD as storage Problems: while random reads are great, random writes are terrible (BerkleyDB running on SSD averages just 0.07MBps) power consumption is just as bad as before
Possible solutions (4) A combination of the above Problems: a combination of the above :)
Introducing FAWN A slightly different approach: Let s use energy-efficient, wimpy processors coupled with fast SSD storage. Design a custom key-value store exploiting the characteristics of flash storage. That way power consumption can be kept to a minimum while retaining high performance and robustness. The resulting system has a lower total cost of ownership and good scalability.
Outline Introduction 1 Introduction 2 3 4 5
Anatomy of a key-value data store A request can be either a get, put or delete Keys are 160-bit integers Values are small blobs (typically between 256B and 1KB) Each request pertains to a single key-value pair there is no relational overlay at this level
Overview Introduction
Overview Introduction The cluster is composed of Front-ends and Back-ends Front-ends forward requests to appropriate back-ends and return responses to clients The front-ends are responsible for maintaining order in the cluster Back-ends run the datastores (one per key-range) Together the machines form a single key-value store
Front-end Introduction Responsibilities: passing requests and responses keeping track of back-ends Virtual IDs and their mapping to key ranges managing joins and leaves. Example configuration used for evaluation: Intel Atom CPU (27 W)
Back-end Introduction A back-end runs one data store per key range. Each data store supports the basic key-value requests, as well as maintance operations (Split, Merge, Compact) Example configuration used for evaluation: AMD Geode LX CPU (500MHz) 256MB DDR SDRAM (400MHz) 100Mbps Ethernet Sandisk Extreme IV CompactFlash (4GB)
Back-ends, cont. Back-ends are organized in a logical ring which coincides with the key space (mod 2 160 ) Each back-end is assigned a fixed number of Virtual IDs in hopes of maintaining balance Virtual IDs are the lowest keys a node handles This allows for a well-defined successor relation on keys and virtual nodes More on this later.
Outline Introduction 1 Introduction 2 3 4 5
Peculiarities of flash storage Flash media differ from traditional HDDs in a number of ways, some of which seriously impact persistent data store designs. Random reads are nearly as fast as sequential reads Random writes are very inefficient (owing to the fact that a whole page needs to be flashed) Sequential writes perform admirably On modern devices, semi-random writes (random appends to a small number of files) are nearly as fast as sequential writes These features can be exploited by using a log-structured data store.
Introduction To take advantage of the properties of flash storage, is structured as follows: The key-value mappings are stored in a Data Log on the flash medium. This store is append-only. To provide fast random access, a hash index map into the data log is kept in RAM. In order to reduce the memory footprint, keys are reduced, inflicting as a trade-off a (configurable) chance of necessitating more than one flash access. To reclaim unused storage space, a Compact operation is introduced. It is designed to be as efficient as possible on flash, using only bulk sequential writes. In order to facilitate reconstruction of the in-memory index, checkpointing is utilized.
Lookup Introduction
Lookup cont. Introduction Two smaller numbers are extracted from the key: The index bits the lowest i bits key fragment the next lowest k bits The index bits serve as an index into the first in-memory hash index. If the bucket pointed to by the index bits is valid and the key fragments match, the data log entry is retrieved and the full keys compared. If keys match, the record is returned, otherwise the next bucket in the hash chain is examined as above. If nothing is found, an appropriate response is generated.
Lookup, now in pseudocode!
Store and Delete When a value is inserted into the store, it is simply appended to the data log and the corresponding bucket are changed to point to the new record. The valid bit is set to true. When a record is to be deleted, a delete entry is appended to the log (for fault-tolerance) and the valid bit in the corresponding bucket is set to false. Actual storage space is not reclaimed until a Compact is performed.
Maintenance operations Split is issued when the key range is divided as a new virtual node joins the ring. It scans the data log sequentially and writes out the appropriate entries into a new one. Merge is responsible for merging two data stores into one, encompassing the combined key range. It achieves this by copying entries from one log into the other. Compact copies the valid data store entries into a new log, skipping those that have been orphaned by puts and those that were actively deleted. Owing to the append-only design it is possible to perform these operations concurrently with normal requests, only locking to switch data stores while finalizing maintenance.
Outline Introduction 1 Introduction 2 3 4 5
In order to provide a robust, scalable service the back-ends running instances are joined together and managed by front-end nodes, which in turn in industry applications would be connected to a master node. Fault-tolerance is introduced via replication Each front-end is ideally responsible for some 80 back-ends and manages joins and leaves, exposing a simple put, get, delete interface Additionally, front-ends can route requests between themselves and cache responses, leaving the master node as an optimization and a convenience without leaving it a single point of failure
Life-cycle of a request
Life-cycle of a request, elaborated Each front-end is assigned a contiguous portion of the key space Upon receiving a request it either processes it using its managed back-ends or forwards it if the key belongs to a different front-end Front-ends maintain a list of virtual nodes and their corresponding addresses, and thus can instantly translate the request to the appropriate calls While the request is processed by back-ends, the front-end ensures replication is maintained
Replication in Chains
Replication in Chains, cont. Each key defines a chain in the virtual node ring A fixed number of nodes maintains copies of the mapping The nodes are obtained by iterating the successor function of the key The first node that contains a replica is the head of the chain The last node is the tail Every put request is issued to the head of the chain and waits for an acknowledgement from the tail. Every get is passed to the tail. This ensures consistency and proper ordering of changes throughout the change.
Replication of a put After receiving the put request, the head forwards the put along the chain and waits for an acknowledgement. If all goes well, the tail acknowledges both to the front-end and recursively to its predecessor.
How a join is handled When a (virtual) node joins the ring precisely one key range is split in two. To maintain replication the following happens: The current tail transmits its whole log to the new node (pre-copy) The front-end informs the nodes in the chain of the join via a chain membership message In response to said message, nodes flush updates received during pre-copy down the chain Please refer to the paper for details on how updates arriving during the flush are handled, as well as the special cases of joining as head or tail.
What happens when a node leaves When a node leaves the ring, each node that is supposed to take over the replicas in essence joins the replica chain at a different position in the key space, so the protocol is essentially the same as for a join. At this stage failure detection is achieved by a heartbeat. If a node misses a set number of heartbeat signals, the front-end initiates a leave and appropriate action is taken.
Outline Introduction 1 Introduction 2 3 4 5
Procedure description FAWN s performance was evaluated under a number of criteria: Single node efficiency (compared to baseline hardware capabilities) Cluster performance (tested on a 21 back-end/1 front-end system) Energy efficiency The results were then compared with a number of more traditional configurations.
Single node performance Baseline: Seq. read Rand. read Seq. write Rand. write 28.5 MBps 1424 QPS 24 MBps 110 QPS FAWN: Data size Rand read (1KB) Rand read (256B) 125MB 51968 QPS 65412 QPS 1GB 1595 QPS 1964 QPS 3.5GB 1150 QPS 1298 QPS
Gets vs Puts Introduction
Cluster performance and power consumption
Important points on power consumption The plot displayed does not take into account the front-end (further 27W) The networking hardware used takes 20W to operate (included in the plotted figure) Even factoring in the front-end, the system achieved 330 queries per Joule. A desktop computer can provide about 50 Q/J using SSD.
CDF of Query Latency
Comparison with alternative approaches (projected) Important point The FAWN entries in this table are expected performance measurements of systems built using state of the art components.
Solution space for system builders (projected)
Conclusions Introduction FAWN is demonstrated to be a viable approach to providing cost-efficient data stores Using wimpy processors in an array can reduce power consumption while retaining performance Barring breakthrough discoveries, FAWN-like technologies are expected to deliver the lowest TCO for a large portion of the problem space Larger scale testing is necessary to establish the correctness of these claims and to demonstrate scalability
References Introduction [FAWN] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan FAWN: A Fast Array of Wimpy Nodes Proceedings ACM SOSP 2009, Big Sky, MT, USA, October 2009. All images are taken from the FAWN paper.