RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Similar documents

Web Technologies: RAMCloud and Fiz. John Ousterhout Stanford University

1 The RAMCloud Storage System

Fast Crash Recovery in RAMCloud

Fast Crash Recovery in RAMCloud

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

A Study of Application Performance with Non-Volatile Main Memory

A Deduplication File System & Course Review

MinCopysets: Derandomizing Replication In Cloud Storage

Benchmarking Hadoop & HBase on Violin

A1 and FARM scalable graph database on top of a transactional memory layer

DURABILITY AND CRASH RECOVERY IN DISTRIBUTED IN-MEMORY STORAGE SYSTEMS

Distributed File Systems

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Memory and Object Management in RAMCloud. Steve Rumble December 3 rd, 2013

Hadoop Architecture. Part 1

Benchmarking Cassandra on Violin

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Datacenter Operating Systems

D1.2 Network Load Balancing

ioscale: The Holy Grail for Hyperscale

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Enabling High performance Big Data platform with RDMA

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

COS 318: Operating Systems

Rackspace Cloud Databases and Container-based Virtualization

CS 153 Design of Operating Systems Spring 2015

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Chapter 10: Mass-Storage Systems

Lecture 18: Reliable Storage

Flash-Friendly File System (F2FS)

Big Data With Hadoop

CSE 120 Principles of Operating Systems

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

SALSA Flash-Optimized Software-Defined Storage

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

Can High-Performance Interconnects Benefit Memcached and Hadoop?

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Caching Mechanisms for Mobile and IOT Devices

Promise of Low-Latency Stable Storage for Enterprise Solutions

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Chapter 12: Mass-Storage Systems

Berkeley Ninja Architecture

Ground up Introduction to In-Memory Data (Grids)

Configuring Apache Derby for Performance and Durability Olav Sandstå

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

FPGA-based Multithreading for In-Memory Hash Joins

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

Big Data Processing with Google s MapReduce. Alexandru Costan

Indexing on Solid State Drives based on Flash Memory

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

FAWN - a Fast Array of Wimpy Nodes

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Massive Data Storage

Amazon Cloud Storage Options

GraySort on Apache Spark by Databricks

Outline. Failure Types

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Zadara Storage Cloud A

Speeding Up Cloud/Server Applications Using Flash Memory

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE. Matt Kixmoeller, Pure Storage

COS 318: Operating Systems. Virtual Memory and Address Translation

NV-DIMM: Fastest Tier in Your Storage Strategy

Software-defined Storage Architecture for Analytics Computing

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Next Generation Operating Systems

Storage at a Distance; Using RoCE as a WAN Transport

PARALLELS CLOUD STORAGE

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

WHITE PAPER FUJITSU PRIMERGY SERVER BASICS OF DISK I/O PERFORMANCE

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Price/performance Modern Memory Hierarchy

The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production

Overview: X5 Generation Database Machines

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

AIX NFS Client Performance Improvements for Databases on NAS

Bigdata High Availability (HA) Architecture

Violin: A Framework for Extensible Block-level Storage

Quantum StorNext. Product Brief: Distributed LAN Client

Distribution transparency. Degree of transparency. Openness of distributed systems

SMB Direct for SQL Server and Private Cloud

Towards Rack-scale Computing Challenges and Opportunities

Transcription:

RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University

Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction RAMCloud: new class of storage for low-latency datacenters Potential impact: enable new class of applications September 16, 2014 RAMCloud/SDC Slide 2

How Many Datacenters? Suppose we capitalize IT at the same level as other infrastructure (power, water, highways, telecom): $1-10K per person? 0.5-5 datacenter servers/person? U.S. World Servers 0.15-1.5B 3.5-35B Datacenters 1500-15,000 35,000-350,000 (assumes 100,000 servers/datacenter) Computing in 10 years: Most non-mobile computing (i.e. Intel processors) will be in datacenters Devices provide user interfaces September 16, 2014 RAMCloud/SDC Slide 3

Evolution of Datacenters Phase 1: manage scale 10,000-100,000 servers within 50m radius 1 PB DRAM 100 PB disk storage Challenge: how can one application harness thousands of servers? Answer: MapReduce, etc. But, communication latency high: 300-500µs round-trip times Must process data sequentially to hide latency (e.g. MapReduce) Interactive applications limited in functionality September 16, 2014 RAMCloud/SDC Slide 4

Evolution of Datacenters, cont d Phase 2: low latency Speed-of-light limit: 1µs Round-trip time achievable today: 5-10µs Practical limit (5-10 years): 2-3µs September 16, 2014 RAMCloud/SDC Slide 5

Why Does Latency Matter? Traditional Application Web Application UI App. Logic Single machine Data Structures Application Servers UI App. Logic Datacenter << 1µs latency 0.5-10ms latency Storage Servers Large-scale apps struggle with high latency Random access data rate has not scaled! Facebook: can only make 100-150 internal requests per page September 16, 2014 RAMCloud/SDC Slide 6

MapReduce Computation..................... Data Sequential data access high data access rate Not all applications fit this model Offline September 16, 2014 RAMCloud/SDC Slide 7

Goal: Scale and Latency Traditional Application Web Application UI App. Logic Data Structures Single machine Application Servers << 1µs latency 0.5-10ms latency 5-10µs UI App. Logic Datacenter Storage Servers Enable new class of applications: Large-scale graph algorithms (machine learning?) Collaboration at scale? September 16, 2014 RAMCloud/SDC Slide 8

Large-Scale Collaboration Region of Consciousness Gmail: email for one user Facebook: 50-500 friends Data for one user Morning commute: 10,000-100,000 cars September 16, 2014 RAMCloud/SDC Slide 9

Getting To Low Latency Network switches (currently 10-30µs per switch): 10Gbit switches: 500ns per switch Radical redesign: 30ns per switch Must eliminate buffering Software (currently 60µs total): Kernel bypass (direct NIC access from applications): 2µs New protocols, threading architectures: 1µs NIC (currently 2-32µs per transit): Optimize current architectures: 0.75µs per transit New NIC CPU integration: 50ns per transit September 16, 2014 RAMCloud/SDC Slide 10

Achievable Round-Trip Latency Component Actual Today Possible Today 5-10 Years Switching fabric 100-300µs 5µs 0.2µs Software 60µs 2µs 1µs NIC 8-128µs 3µs 0.2µs Propagation delay 1µs 1µs 1µs Total 200-400µs 11µs 2.4µs September 16, 2014 RAMCloud/SDC Slide 11

RAMCloud Storage system for low-latency datacenters: General-purpose All data always in DRAM (not a cache) Durable and available Scale: 1000+ servers, 100+ TB Low latency: 5-10µs remote access September 16, 2014 RAMCloud/SDC Slide 12

RAMCloud Architecture 1000 100,000 Application Servers Appl. Library Appl. Library Appl. Library Appl. Library High-speed networking: 5 µs round-trip Full bisection bwidth Commodity Servers Datacenter Network Coordinator Master Backup Master Backup Master Backup Master Backup 64-256 GB per server 1000 10,000 Storage Servers September 16, 2014 RAMCloud/SDC Slide 13

Example Configurations 2012 2015-2020 # servers 2000 2000 GB/server 128 GB 512 GB Total capacity 256 TB 1 PB Total server cost $7M $7M $/GB $27 $7 For $100K today: One year of Amazon customer orders (5 TB?) One year of United flight reservations (2 TB?) September 16, 2014 RAMCloud/SDC Slide 14

Data Model: Key-Value Store Basic operations: read(tableid, key) => blob, version write(tableid, key, blob) => version delete(tableid, key) Other operations: cwrite(tableid, key, blob, version) => version Enumerate objects in table Efficient multi-read, multi-write Atomic increment (Only overwrite if version matches) Not currently available: Atomic updates of multiple objects Secondary indexes Tables Object Key ( 64KB) Version (64b) Blob ( 1MB) September 16, 2014 RAMCloud/SDC Slide 15

Using Infiniband networking (24 Gb/s, kernel bypass) Other networking also supported, but slower Reads: 100B objects: 5µs 10KB objects: 10µs Single-server throughput (100B objects): 700 Kops/sec. Small-object multi-reads: 1-2M objects/sec. Durable writes: RAMCloud Performance 100B objects: 16µs 10KB objects: 40µs Small-object multi-writes: 400-500K objects/sec. September 16, 2014 RAMCloud/SDC Slide 16

Data Durability Objects (eventually) backed up on disk or flash Logging approach: Each master stores its objects in a log Log divided into segments Segments replicated on multiple backups Segment replicas scattered across entire cluster For efficiency, updates buffered on backups Assume nonvolatile buffers (flushed during power failures) Segment 1 Segment 2 Segment 3 B3 B24 B19 B45 B7 B11 B12 B3 B28 September 16, 2014 RAMCloud/SDC Slide 17

1-2 Second Crash Recovery Each master scatters segment replicas across entire cluster ~1000 Backups ~100 Recovery Masters On crash: Coordinator partitions dead master s tablets. Partitions assigned to different recovery masters Crashed Master Backups read disks in parallel Log data shuffled from backups to recovery masters Recovery masters replay log entries, incorporate objects into their logs Fast recovery: 300 MB/s per recovery master Recover 40 GB in 1.8 seconds (80 nodes, 160 SSDs) September 16, 2014 RAMCloud/SDC Slide 18

Log-Structured Memory Don t use malloc for memory management Wastes 50% of memory Instead, structure memory as a log Allocate by appending Log cleaning to reclaim free space Control over pointers allows incremental cleaning Can run efficiently at 80-90% memory utilization September 16, 2014 RAMCloud/SDC Slide 19

New Projects Data model: Can RAMCloud support higher-level features? Secondary indexes Multi-object transactions Graph operations Impact on scale/latency? System architecture for datacenter RPC: Goals: large scale, low latency Current implementation works, but: Too slow (2µs software overhead per RPC) Sacrifices throughput for latency Too much state per connection (1M connections/server in future?) September 16, 2014 RAMCloud/SDC Slide 20

Threats to Latency Layering Great for software structuring Bad for latency E.g. RAMCloud threading structure costs 200-300ns/RPC Virtualization is potential problem Buffering Network buffers are the enemy of latency TCP will fill them, no matter how large Facebook measured 10 s of ms RPC delay because of buffering Need new networking architectures with no buffers Substitute switching bandwidth for buffers September 16, 2014 RAMCloud/SDC Slide 21

Datacenter revolution < half over: Scale is here Low latency is coming Next steps: New networking architectures New storage systems Ultimate result: Amazing new applications Conclusion September 16, 2014 RAMCloud/SDC Slide 22

Extra Slides

The Datacenter Opportunity Exciting combination of features: Concentrated compute power (~100,000 machines) Large amounts of storage: 1 Pbyte DRAM 100 Pbytes disk Potential for fast communication: Low latency (speed-of-light delay < 1µs) High bandwidth Homogeneous Controlled environment enables experimentation: E.g. new network protocols Huge Petri dish for innovation over next decade September 16, 2014 RAMCloud/SDC Slide 24