The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production

Similar documents

Speeding Up Cloud/Server Applications Using Flash Memory

SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-based Storage

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

FAWN - a Fast Array of Wimpy Nodes

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

A High-Throughput In-Memory Index, Durable on Flash-based SSD

Couchbase Server Under the Hood

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Seeking Fast, Durable Data Management: A Database System and Persistent Storage Benchmark

SQL Server 2014 New Features/In- Memory Store. Juergen Thomas Microsoft Corporation

Benchmarking Cassandra on Violin

Benchmarking Hadoop & HBase on Violin

Cloud Computing at Google. Architecture

IN-MEMORY DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: In-Memory DBMS - SoSe

In-Memory Performance for Big Data

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

ACM, This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.

! Volatile storage: ! Nonvolatile storage:

Recovery Principles in MySQL Cluster 5.1

Hypertable Architecture Overview

Configuring Apache Derby for Performance and Durability Olav Sandstå

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Scalable Architecture on Amazon AWS Cloud

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

COS 318: Operating Systems

Lecture 6 Cloud Application Development, using Google App Engine as an example

Physical Data Organization

Ryusuke KONISHI NTT Cyberspace Laboratories NTT Corporation

Distributed File Systems

Raima Database Manager Version 14.0 In-memory Database Engine

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

In-Memory Databases MemSQL

Can the Elephants Handle the NoSQL Onslaught?

Chapter 16: Recovery System

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Trekking Through Siberia: Managing Cold Data in a Memory-Optimized Database

File System Management

SQL Server Transaction Log from A to Z

Hekaton: SQL Server s Memory-Optimized OLTP Engine

NoSQL Data Base Basics

High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Lecture 1: Data Storage & Index

Windows Azure Storage Scaling Cloud Storage Andrew Edwards Microsoft

Cloud Computing: Meet the Players. Performance Analysis of Cloud Providers

Bigdata High Availability (HA) Architecture

Practical Cassandra. Vitalii

Google File System. Web and scalability

A Deduplication File System & Course Review

Cloud Scale Distributed Data Storage. Jürmo Mehine

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Outline. Failure Types

2 nd Semester 2008/2009

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

How To Recover From Failure In A Relational Database System

Apache HBase. Crazy dances on the elephant back

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Ground up Introduction to In-Memory Data (Grids)

A client side persistent block cache for the data center. Vault Boston Luis Pabón - Red Hat

Tivoli Storage Manager Explained

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Two Parts. Filesystem Interface. Filesystem design. Interface the user sees. Implementing the interface

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

Recovery and the ACID properties CMPUT 391: Implementing Durability Recovery Manager Atomicity Durability

Data Management in the Cloud

Challenges for Data Driven Systems

Chapter 11: File System Implementation. Operating System Concepts with Java 8 th Edition

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Promise of Low-Latency Stable Storage for Enterprise Solutions

Rethinking SIMD Vectorization for In-Memory Databases

Hekaton: SQL Server s Memory-Optimized OLTP Engine

Maximizing SQL Server Virtualization Performance

In Memory Accelerator for MongoDB

Deuteronomy: Transaction Support for Cloud Data

Wisdom from Crowds of Machines

Robin Dhamankar Microsoft Corporation One Microsoft Way Redmond WA

The What, Why and How of the Pure Storage Enterprise Flash Array

Blurred Persistence in Transactional Persistent Memory

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Virtuoso and Database Scalability

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK

SMALL INDEX LARGE INDEX (SILT)

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

The Data Placement Challenge

DataBlitz Main Memory DataBase System

Comparing SQL and NOSQL databases

LogBase: A Scalable Log structured Database System in the Cloud

The Classical Architecture. Storage 1 / 36

Boost SQL Server Performance Buffer Pool Extensions & Delayed Durability

Binary search tree with SIMD bandwidth optimization using SSE

Recovery Protocols For Flash File Systems

The Google File System

Transcription:

The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work with Justin Levandoski and David Lomet (Microsoft Research) And Microsoft Product Group Partners across SQL Server, Azure DocumentDB, and Bing ObjectStore

The B-Tree Key-ordered access to records Separator keys in internal nodes (to guide search) and full records in leaf nodes Efficient point and range lookups Balanced tree via page split and merge mechanisms In Memory data data data data data data On Disk

Design Tenets for A New B-Tree Lock-free operations for high concurrency Exploit modern multi-core processors Log-structured Storage Organization Exploit fast random read property of flash and work around inefficient random writes Delta updates to pages Reduces cache invalidation in memory hierarchy Reduces garbage creation and write amplification on flash, increases device lifetime

Bw-Tree Architecture B-Tree Layer Access Method Expose API B-tree search/update logic In-memory pages only Flash Layer LLAMA Cache Layer Logical page abstraction for B-tree layer Moves pages between memory and flash as necessary Reads/Writes from/to storage Storage management

The Mapping Table Expose logical pages to the access method layer Translates logical page ID to physical address Helps to isolate updates to a single page Central data structure for multithreaded concurrency control Page ID Also used for log-structured store mapping Mapping Table Updated in lock-free manner [using compareand-swap (CAS)] Physical Address Page Page Page Page flash/ mem flag address 1 bit 63 bits RAM Flash

Update record 35 Insert record 60 Page P Insert record on page P Mapping Table Page ID P Physical Address Δ: Update record 35 Δ: Insert record 60 Δ: Delete record 48 Δ: Insert record 50 Page P Consolidated Page P

Mapping Table PID Physical Address Index Entry Δ Split Δ Page 1 Page 2 Page 3 2 Page 4 4 Logical pointer Physical pointer

IOPS Flash SSDs: Log-Structured Storage Use flash in a log-structured manner 150000 125000 134725 134723 100000 75000 50000 25000 49059 3x 17492 FusionIO 160GB iodrive 0 seq-reads rand-reads seq-writes rand-writes

LLAMA Log-Structured Store Suitable for flash + other benefits Amortize cost of writes over many page updates Aggregate large amounts of new/changed data and append to the log in a single I/O Multiple random reads to fetch a logical page Okay for flash, in the order of few tens of usec Works well for hard disks also Benefit of amortizing page write cost Random reads incur seek latency but mitigated by capturing working set of pages in RAM B-Tree Layer Cache Layer Storage Layer (disk or flash or other NVM)

Sequential log Mapping table Base page Base page RAM -record Base page -record -record Write ordering in log Flash Memory

Departure from Tradition: Page Layout on Flash Logical pages are formed by linking together records on possibly different physical pages Logical pages do not correspond to whole physical pages on flash Physical pages on flash contain records from multiple logical pages Exploits random access nature of flash media No disk-like seek overhead in reading records in a logical page spread across multiple physical pages on flash Adapted from SkimpyStash (ACM SIGMOD 2011)

Write ordering in log RAM Mapping table RAM Base page Base page -record -record -record Base page Base page -record -record (Latch-free) Flush Buffer (8MB) -record Disk Log-structured Store on SSD

Reading a logical page may involve reading delta records from multiple physical pages Probably okay because of fast random access property of flash Mitigated by capturing working set of pages in memory But we can reduce read I/Os further Multiple delta records, when flushed together, are packed into a contiguous unit on flash (C-delta) Pages consolidated periodically in memory also get consolidated on flash when they are flushed Base page Multiple delta records written together (C-delta) Flash

Two types of record units in the log Valid Reachable from the flash offset in the mapping table Orphaned not reachable Garbage collection starts from oldest portion of log Earliest written record (base page) on a logical page is encountered first Avoid cascaded pointer updates up the chain => relocate entire logical page at a time, use this opportunity to consolidate GC point Write point Base page Flash Write order in log Mapping table

LLAMA: Cache Layer Provide abstraction of logical pages to access method layer Mapping table containing RAM pointers or flash offsets Read pages into RAM from stable storage Flush pages to stable storage Writes to flash ordered through flush buffers Swapout pages to reduce memory usage B-Tree Layer Cache Layer Storage Layer (disk or flash or other NVM)

LLAMA: Page Swapout Attempt to swapout pages when memory usage exceeds configurable threshold Uses variant of CLOCK algorithm Parallel page swapping functionality Each accessor to Bw-Tree does small amount of page swapping work ( CLOCK sweep ) if needed RAM pointer replaced by flash offset in mapping table Page structure deallocated using epoch based memory garbage collection Page Page Page Page Page RAM Flash

Bw-Tree/LLAMA Checkpointing B-Tree layer checkpointing (for durability) Flush pages to flush buffer and subsequently to storage Mapping Table Page ID Flash Offset GC Flash Log LLAMA checkpointing (for fast recovery) Write the mapping table to flash -> When an entry contains RAM address, obtain flash address from the in-memory page -> Unused entries are written as zeroes Record write position in log when the checkpoint started Alternate between two fixed regions on flash for each checkpoint RSP

Bw-Tree Fast Recovery Restore mapping table from latest checkpoint region Scan from log position recorded in checkpoint to end of log Read page ID from C-delta on log and update flash offset in mapping table Restore Bw-tree root page LPID Optimizations for fast cache warm-up MTable RAM Sequential log Base page Base page -record Base page -record -record Flash Memory

Deuteronomy Architecture Bw-Tree: Support for Transactions App Needing Transactional Key-Value Store App Needing Atomic Key-Value Store App Needing High Performance Log Structured Page Store Transactional Component Data Component Access Method Bw-Tree Latch Free Ordered Index Latch-Free Linear Hashing LLAMA: Page Storage Engine

End-to-end Crash Recovery Data Component (DC) recovery Bw-Tree fast recovery as described Transactional Component (TC) recovery Helps to recover unflushed data at DC up to end of stable log (WAL) at time of crash Requires DC to recover to a logically consistent state first Record Operations Transaction Component (TC) Control Operations Data Component (DC) Storage

Bw-Tree in Production Key-sequential index in SQL Server Hekaton Lock-free for high concurrency, consistent with Hekaton s overall non-blocking main memory architecture Indexing engine in Azure DocumentDB Rich query processing over a schema-free JSON model, with automatic indexing Sustained document ingestion at high rates Sorted key-value store in Bing ObjectStore Support range queries Optimized for flash SSDs ObjectStore

DocumentDB Relational Stores Fully schematized, relational queries, transactions (eg, SQL Azure, Amazon RDS, SQL IaaS) Key-Value/ Column Family Stores Schema-less with opaque values, lookups on keys (eg, Azure Tables, HBASE, BigTable, LevelDB, Cassandra, ) Key K K k1 k2 k3 Value V VXML NET Java (JSON) Document Stores Schema-less, rich hierarchical queries, (Javascript) sprocs/triggers/udfs (eg MongoDB, CouchDB, Espresso, ) { } "location": [ { "country": "USA", "city": "NYC" }, { "country": "Italy", "city": "Rome" } ], "main": "Pisa", "exports":[ { "city": "Oslo" }, { "city": "Lima" } ] country location main exports Pisa 0 1 /"exports"/?/"city"/"!"-> eval("js", "function(input, output) { outputresults= inputresultssort(); }") USA 0 city NYC 1 country Italy city Rome city Oslo city Lima Formal query model optimized for queries over schema-less documents at scale Support for relational and hierarchical projections Consistent indexing in face of rapid, sustained high volume writes (optimized for flash SSDs) Developer tunable consistency-availability tradeoffs with SLAs Low latency, (Javascript) language integrated, transactional CRUD on storage partitions Elastic scale, resource governed, multi-tenant PaaS

Write Consistent Indexing over schema-less documents is an overly constrained design space Sustained high volume writes without any term locality Extremely high concurrency Queries should honor various consistency levels Multi-tenancy with strict, reservation based, sub-process level resource governance

Key Challenges Index update: No key locality; cannot afford a Read to do the Write; low Write Amplification Queries: Low Read Amplification Frugal resource budget t {d 4 + } Blind update t {d 2 - } {d 1, d 2, d 3 } {d 4 + } {d 2 - } Modify Blind update t {d 4 + } Read Write Page Stub {d 1, d 3, d 4 } t {d 1, d 2, d 3 } t {d 1, d 2, d 3, d 4 } t {d 1, d 2, d 3 } t {d 1, d 3, d 4 } Page Base Page Consolidated Page

Bw-Tree Resource Governance CPU resource governance in- Threads calling into Bw-Tree do not block (upon I/O or memory page access) Top-level scheduler controls thread budget per replica Memory resource governance Dynamically configurable buffer pool limit IOPS resource governance Check resource usage before issuing I/O, retry after dynamically computed timeout interval Storage resource governance LLAMA log-structured store can grow/shrink dynamically Self-adjusting based on logical data size

Bringing up Bw-Tree Replica Obtain Bw-Tree physical state stream from primary LLAMA checkpoint file (most recent) Valid portion of LLAMA log (between GC and write points) Bring up Bw-Tree using fast recovery Catch up with primary Replay logical operations from primary with LSNs upward of last (contiguous) LSN in recovered Bw-Tree Reads Secondary Existing Replica Bw-Tree Index Reads/Writes Primary Document Replication + Indexing Reads Secondary New Replica Bw-Tree Index Log Structured Store

Bw-Tree: Summary Classical B-Tree redesigned from ground up for modern hardware and cloud Lock-free for high concurrency on multi-core processors Delta updating of pages in memory for cache efficiency Log-structured storage organization for flash SSDs Flexible resource governance in multi-tenant setting Transactional component can be layered above as part of Deuteronomy architecture Shipping in Microsoft s server/cloud offerings Key-sequential index in SQL Server Hekaton Indexing engine in Azure DocumentDB Sorted key-value store in Bing ObjectStore Going forward Layer a transactional component on top as per Deuteronomy architecture (CIDR 2015, VLDB 2016) Open-source the codebase