FAWN - a Fast Array of Wimpy Nodes

Similar documents

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Benchmarking Cassandra on Violin

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Speeding Up Cloud/Server Applications Using Flash Memory

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Bigdata High Availability (HA) Architecture

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

G Porcupine. Robert Grimm New York University

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Benchmarking Hadoop & HBase on Violin

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Distributed File Systems

Virtuoso and Database Scalability

Distributed File Systems

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Couchbase Server Under the Hood

The Google File System

The Sierra Clustered Database Engine, the technology at the heart of

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Google File System. Web and scalability

Fusion iomemory iodrive PCIe Application Accelerator Performance Testing

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Hypertable Architecture Overview

The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Physical Data Organization

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Seeking Fast, Durable Data Management: A Database System and Persistent Storage Benchmark

NoSQL Data Base Basics

Parallels Cloud Storage

Accelerating Server Storage Performance on Lenovo ThinkServer

The Data Placement Challenge

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Using Synology SSD Technology to Enhance System Performance Synology Inc.

File Management. Chapter 12

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

Parallels Cloud Server 6.0

Best Practices for Deploying Citrix XenDesktop on NexentaStor Open Storage

The What, Why and How of the Pure Storage Enterprise Flash Array

Lecture 1: Data Storage & Index

All-Flash Storage Solution for SAP HANA:

MySQL Storage Engines

IOmark- VDI. HP HP ConvergedSystem 242- HC StoreVirtual Test Report: VDI- HC b Test Report Date: 27, April

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Distributed Data Stores

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Virtualization of the MS Exchange Server Environment

Evaluation of NoSQL databases for large-scale decentralized microblogging

PIONEER RESEARCH & DEVELOPMENT GROUP

Microsoft Exchange Server 2003 Deployment Considerations

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

MS Exchange Server Acceleration

Understanding Data Locality in VMware Virtual SAN

VDI Without Compromise with SimpliVity OmniStack and Citrix XenDesktop

How To Scale Myroster With Flash Memory From Hgst On A Flash Flash Flash Memory On A Slave Server

RevoScaleR Speed and Scalability

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Virtualizing SQL Server 2008 Using EMC VNX Series and Microsoft Windows Server 2008 R2 Hyper-V. Reference Architecture

A High-Throughput In-Memory Index, Durable on Flash-based SSD

High-Performance SSD-Based RAID Storage. Madhukar Gunjan Chakhaiyar Product Test Architect

Amazon Cloud Storage Options

File System Management

Upgrading Small Business Client and Server Infrastructure E-LEET Solutions. E-LEET Solutions is an information technology consulting firm

SQL Server Virtualization

Tableau Server Scalability Explained

Sistemas Operativos: Input/Output Disks

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

Chapter 18: Database System Architectures. Centralized Systems

The Hadoop Distributed File System

Microsoft SQL Server 2000 Index Defragmentation Best Practices

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

In-Memory Databases MemSQL

SMALL INDEX LARGE INDEX (SILT)

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

SSDs: Practical Ways to Accelerate Virtual Servers

6. Storage and File Structures

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Understanding Disk Storage in Tivoli Storage Manager

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

SSDs: Practical Ways to Accelerate Virtual Servers

Scaling Database Performance in Azure

Hardware/Software Guidelines

SALSA Flash-Optimized Software-Defined Storage

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Transcription:

University of Warsaw January 12, 2011

Outline Introduction 1 Introduction 2 3 4 5

Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed adds up to significant costs

Key issues Introduction Is there a way to exploit the CPU vs. I/O gap to the users advantage?

Observations Introduction Many industry problems exhibit massive data parallelism with relatively small computational demands A fair amount of real-life problems heavily depends on efficient, distributed key-value stores that span several gigabytes Such stores often contain millions of small items (on the order of kilobytes)

A motivating example Twitter A wonderfully popular service, Twitter has all the above-mentioned properties. Each tweet is limited to 140B. There is fairly little processing performed on the tweets, yet just the search system is stressed by an average of 12000 queries per second. There is a stream of over a thousand tweets per second entering the system. A high-performance key-value store is crucial to the operation. At the same time the cost of running a conventional cluster capable of meeting this demand is extremely high. Disclaimer To my knowledge, FAWN is not being used in Twitter. But it would probably make a lot of sense if it were. Thank you.

The problem, defined To engineer a fast, scalable key-value store for small (hundreds to thousands of bytes) items This store is expected to: respond to upwards from thousands of random queries per second (QPS) conserve power as much as possible meet service level agreements regarding latency scale well upwards as the system grows scale well downwards as demand fluctuates during operating hours

Possible solutions (1) A cluster of traditional servers with HDD as storage. Problems: very poor performance for random accesses, unless RAID or a similar disk array is used if RAID is to be used, both initial price and total cost of ownership skyrocket most of the power consumption is fixed not much power is conserved during low load periods

Possible solutions (2) A cluster of traditional servers with RAM as storage (think memcached) Problems: very high cost in terms of $/GB robustness is lost unless additional systems are employed power consumption is just as bad as before

Possible solutions (3) A cluster of traditional servers with SSD as storage Problems: while random reads are great, random writes are terrible (BerkleyDB running on SSD averages just 0.07MBps) power consumption is just as bad as before

Possible solutions (4) A combination of the above Problems: a combination of the above :)

Introducing FAWN A slightly different approach: Let s use energy-efficient, wimpy processors coupled with fast SSD storage. Design a custom key-value store exploiting the characteristics of flash storage. That way power consumption can be kept to a minimum while retaining high performance and robustness. The resulting system has a lower total cost of ownership and good scalability.

Outline Introduction 1 Introduction 2 3 4 5

Anatomy of a key-value data store A request can be either a get, put or delete Keys are 160-bit integers Values are small blobs (typically between 256B and 1KB) Each request pertains to a single key-value pair there is no relational overlay at this level

Overview Introduction

Overview Introduction The cluster is composed of Front-ends and Back-ends Front-ends forward requests to appropriate back-ends and return responses to clients The front-ends are responsible for maintaining order in the cluster Back-ends run the datastores (one per key-range) Together the machines form a single key-value store

Front-end Introduction Responsibilities: passing requests and responses keeping track of back-ends Virtual IDs and their mapping to key ranges managing joins and leaves. Example configuration used for evaluation: Intel Atom CPU (27 W)

Back-end Introduction A back-end runs one data store per key range. Each data store supports the basic key-value requests, as well as maintance operations (Split, Merge, Compact) Example configuration used for evaluation: AMD Geode LX CPU (500MHz) 256MB DDR SDRAM (400MHz) 100Mbps Ethernet Sandisk Extreme IV CompactFlash (4GB)

Back-ends, cont. Back-ends are organized in a logical ring which coincides with the key space (mod 2 160 ) Each back-end is assigned a fixed number of Virtual IDs in hopes of maintaining balance Virtual IDs are the lowest keys a node handles This allows for a well-defined successor relation on keys and virtual nodes More on this later.

Outline Introduction 1 Introduction 2 3 4 5

Peculiarities of flash storage Flash media differ from traditional HDDs in a number of ways, some of which seriously impact persistent data store designs. Random reads are nearly as fast as sequential reads Random writes are very inefficient (owing to the fact that a whole page needs to be flashed) Sequential writes perform admirably On modern devices, semi-random writes (random appends to a small number of files) are nearly as fast as sequential writes These features can be exploited by using a log-structured data store.

Introduction To take advantage of the properties of flash storage, is structured as follows: The key-value mappings are stored in a Data Log on the flash medium. This store is append-only. To provide fast random access, a hash index map into the data log is kept in RAM. In order to reduce the memory footprint, keys are reduced, inflicting as a trade-off a (configurable) chance of necessitating more than one flash access. To reclaim unused storage space, a Compact operation is introduced. It is designed to be as efficient as possible on flash, using only bulk sequential writes. In order to facilitate reconstruction of the in-memory index, checkpointing is utilized.

Lookup Introduction

Lookup cont. Introduction Two smaller numbers are extracted from the key: The index bits the lowest i bits key fragment the next lowest k bits The index bits serve as an index into the first in-memory hash index. If the bucket pointed to by the index bits is valid and the key fragments match, the data log entry is retrieved and the full keys compared. If keys match, the record is returned, otherwise the next bucket in the hash chain is examined as above. If nothing is found, an appropriate response is generated.

Lookup, now in pseudocode!

Store and Delete When a value is inserted into the store, it is simply appended to the data log and the corresponding bucket are changed to point to the new record. The valid bit is set to true. When a record is to be deleted, a delete entry is appended to the log (for fault-tolerance) and the valid bit in the corresponding bucket is set to false. Actual storage space is not reclaimed until a Compact is performed.

Maintenance operations Split is issued when the key range is divided as a new virtual node joins the ring. It scans the data log sequentially and writes out the appropriate entries into a new one. Merge is responsible for merging two data stores into one, encompassing the combined key range. It achieves this by copying entries from one log into the other. Compact copies the valid data store entries into a new log, skipping those that have been orphaned by puts and those that were actively deleted. Owing to the append-only design it is possible to perform these operations concurrently with normal requests, only locking to switch data stores while finalizing maintenance.

Outline Introduction 1 Introduction 2 3 4 5

In order to provide a robust, scalable service the back-ends running instances are joined together and managed by front-end nodes, which in turn in industry applications would be connected to a master node. Fault-tolerance is introduced via replication Each front-end is ideally responsible for some 80 back-ends and manages joins and leaves, exposing a simple put, get, delete interface Additionally, front-ends can route requests between themselves and cache responses, leaving the master node as an optimization and a convenience without leaving it a single point of failure

Life-cycle of a request

Life-cycle of a request, elaborated Each front-end is assigned a contiguous portion of the key space Upon receiving a request it either processes it using its managed back-ends or forwards it if the key belongs to a different front-end Front-ends maintain a list of virtual nodes and their corresponding addresses, and thus can instantly translate the request to the appropriate calls While the request is processed by back-ends, the front-end ensures replication is maintained

Replication in Chains

Replication in Chains, cont. Each key defines a chain in the virtual node ring A fixed number of nodes maintains copies of the mapping The nodes are obtained by iterating the successor function of the key The first node that contains a replica is the head of the chain The last node is the tail Every put request is issued to the head of the chain and waits for an acknowledgement from the tail. Every get is passed to the tail. This ensures consistency and proper ordering of changes throughout the change.

Replication of a put After receiving the put request, the head forwards the put along the chain and waits for an acknowledgement. If all goes well, the tail acknowledges both to the front-end and recursively to its predecessor.

How a join is handled When a (virtual) node joins the ring precisely one key range is split in two. To maintain replication the following happens: The current tail transmits its whole log to the new node (pre-copy) The front-end informs the nodes in the chain of the join via a chain membership message In response to said message, nodes flush updates received during pre-copy down the chain Please refer to the paper for details on how updates arriving during the flush are handled, as well as the special cases of joining as head or tail.

What happens when a node leaves When a node leaves the ring, each node that is supposed to take over the replicas in essence joins the replica chain at a different position in the key space, so the protocol is essentially the same as for a join. At this stage failure detection is achieved by a heartbeat. If a node misses a set number of heartbeat signals, the front-end initiates a leave and appropriate action is taken.

Outline Introduction 1 Introduction 2 3 4 5

Procedure description FAWN s performance was evaluated under a number of criteria: Single node efficiency (compared to baseline hardware capabilities) Cluster performance (tested on a 21 back-end/1 front-end system) Energy efficiency The results were then compared with a number of more traditional configurations.

Single node performance Baseline: Seq. read Rand. read Seq. write Rand. write 28.5 MBps 1424 QPS 24 MBps 110 QPS FAWN: Data size Rand read (1KB) Rand read (256B) 125MB 51968 QPS 65412 QPS 1GB 1595 QPS 1964 QPS 3.5GB 1150 QPS 1298 QPS

Gets vs Puts Introduction

Cluster performance and power consumption

Important points on power consumption The plot displayed does not take into account the front-end (further 27W) The networking hardware used takes 20W to operate (included in the plotted figure) Even factoring in the front-end, the system achieved 330 queries per Joule. A desktop computer can provide about 50 Q/J using SSD.

CDF of Query Latency

Comparison with alternative approaches (projected) Important point The FAWN entries in this table are expected performance measurements of systems built using state of the art components.

Solution space for system builders (projected)

Conclusions Introduction FAWN is demonstrated to be a viable approach to providing cost-efficient data stores Using wimpy processors in an array can reduce power consumption while retaining performance Barring breakthrough discoveries, FAWN-like technologies are expected to deliver the lowest TCO for a large portion of the problem space Larger scale testing is necessary to establish the correctness of these claims and to demonstrate scalability

References Introduction [FAWN] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan FAWN: A Fast Array of Wimpy Nodes Proceedings ACM SOSP 2009, Big Sky, MT, USA, October 2009. All images are taken from the FAWN paper.