Flash Memory Aware Software Architectures and Applications

Similar documents
Speeding Up Cloud/Server Applications Using Flash Memory

SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-based Storage

ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory

BloomFlash: Bloom Filter on Flash-based Storage

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production

File System Management

A Deduplication File System & Course Review

Benchmarking Cassandra on Violin

DEXT3: Block Level Inline Deduplication for EXT3 File System

Reducing Replication Bandwidth for Distributed Document Databases

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

FAWN - a Fast Array of Wimpy Nodes

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

A Data De-duplication Access Framework for Solid State Drives

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Data Backup and Archiving with Enterprise Storage Systems

Primary Data Deduplication Large Scale Study and System Design

Theoretical Aspects of Storage Systems Autumn 2009

The What, Why and How of the Pure Storage Enterprise Flash Array

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

A SCALABLE DEDUPLICATION AND GARBAGE COLLECTION ENGINE FOR INCREMENTAL BACKUP

TBF: A Memory-Efficient Replacement Policy for Flash-based Caches

IS IN-MEMORY COMPUTING MAKING THE MOVE TO PRIME TIME?

SALSA Flash-Optimized Software-Defined Storage

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

A High-Throughput In-Memory Index, Durable on Flash-based SSD

Indexing on Solid State Drives based on Flash Memory

idedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson Garth Goodson, Kaladhar Voruganti

Choosing Storage Systems

The Data Placement Challenge

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

idedup: Latency-aware, inline data deduplication for primary storage

How SSDs Fit in Different Data Center Applications

89 Fifth Avenue, 7th Floor. New York, NY White Paper. HP 3PAR Thin Deduplication: A Competitive Comparison

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

Storage in Database Systems. CMPSCI 445 Fall 2010

Data Deduplication and Tivoli Storage Manager

Optimizing SQL Server Storage Performance with the PowerEdge R720

COS 318: Operating Systems. Storage Devices. Kai Li and Andy Bavier Computer Science Department Princeton University

An Overview of Flash Storage for Databases

Methods to achieve low latency and consistent performance

VNX HYBRID FLASH BEST PRACTICES FOR PERFORMANCE

Seeking Fast, Durable Data Management: A Database System and Persistent Storage Benchmark

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

Understanding EMC Avamar with EMC Data Protection Advisor

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Data Reduction Methodologies: Comparing ExaGrid s Byte-Level-Delta Data Reduction to Data De-duplication. February 2007

HP Smart Array Controllers and basic RAID performance factors

Flash Performance in Storage Systems. Bill Moore Chief Engineer, Storage Systems Sun Microsystems

89 Fifth Avenue, 7th Floor. New York, NY White Paper. HP 3PAR Adaptive Flash Cache: A Competitive Comparison

Best Practices for Optimizing SQL Server Database Performance with the LSI WarpDrive Acceleration Card

Understanding Data Locality in VMware Virtual SAN

ACM, This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.

CAVE: Channel-Aware Buffer Management Scheme for Solid State Disk

Everything you need to know about flash storage performance

Lab Evaluation of NetApp Hybrid Array with Flash Pool Technology

Managing Storage Space in a Flash and Disk Hybrid Storage System

Trends in Enterprise Backup Deduplication

StarWind Virtual SAN for Microsoft SOFS

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Inline Deduplication

Chapter 11: File System Implementation. Operating System Concepts with Java 8 th Edition

FLASH IMPLICATIONS IN ENTERPRISE STORAGE ARRAY DESIGNS

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

NetApp FAS Hybrid Array Flash Efficiency. Silverton Consulting, Inc. StorInt Briefing

Using Synology SSD Technology to Enhance System Performance Synology Inc.

HyperQ Storage Tiering White Paper

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Solid State Storage in Massive Data Environments Erik Eyberg

Flash Accel, Flash Cache, Flash Pool, Flash Ray Was? Wann? Wie?

NetApp Data Compression and Deduplication Deployment and Implementation Guide

Using Synology SSD Technology to Enhance System Performance Synology Inc.

PARALLELS CLOUD STORAGE

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

Building a High Performance Deduplication System Fanglu Guo and Petros Efstathopoulos

Physical Data Organization

The assignment of chunk size according to the target data characteristics in deduplication backup system

Multi-level Metadata Management Scheme for Cloud Storage System

Offline Deduplication for Solid State Disk Using a Lightweight Hash Algorithm

Reference Guide WindSpring Data Management Technology (DMT) Solving Today s Storage Optimization Challenges

Tradeoffs in Scalable Data Routing for Deduplication Clusters

White Paper. Educational. Measuring Storage Performance

Hardware Configuration Guide

RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups

SOLUTION BRIEF. Resolving the VDI Storage Challenge

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

A Survey on Aware of Local-Global Cloud Backup Storage for Personal Purpose

Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Exploiting Self-Adaptive, 2-Way Hybrid File Allocation Algorithm

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage

ABSTRACT 1 INTRODUCTION

The Use of Flash in Large-Scale Storage Systems.

Transcription:

Flash Memory Aware Software Architectures and Applications Sudipta Sengupta and Jin Li Microsoft Research, Redmond, WA, USA Contains work that is joint with Biplob Debnath (Univ. of Minnesota)

Flash Memory Used for more than a decade in consumer device storage applications Very recent use in desktops and servers New access patterns (e.g., random writes) pose new challenges for delivering sustained high throughput and low latency Higher requirements in reliability, performance, data life Challenges being addressed at different layers of storage stack Flash device vendors: device driver/ inside device System builders: OS and application layers, e.g., Focus of this talk Slide 2

Flash Aware Applications System builders: Don t just treat flash as disk replacement Make the OS/application layer aware of flash Exploit its benefits Embrace its peculiarities and design around them Identify applications that can exploit sweet spot between cost and performance Device vendors: You can help by exposing more APIs to the software layer for managing storage on flash Can help to squeeze better performance out of flash with application knowledge Slide 6

Flash for Speeding Up Cloud/Server Applications FlashStore [VLDB 2] High throughput, low latency persistent key-value store using flash as cache above HDD ChunkStash [USENIX ATC 2] Efficient index design on flash for high throughput data deduplication BloomFlash [ICDCS 2011] Bloom filter design for flash SkimpyStash [ACM SIGMOD 2011] Key-value store with ultra-low RAM footprint at about 1-byte per k-v pair Flash as block level cache above HDD Either application managed or OS managed SSD buffer pool extension in database server SSD caching tier in cloud storage Slide 7

IOPS Flash Memory: Random Writes Need to optimize the storage stack for making best use of flash Random writes not efficient on flash media Flash Translation Layer (FTL) cannot hide or abstract away device constraints 150000 134725 134723 125000 100000 75000 50000 25000 49059 3x 17492 FusionIO 160GB iodrive 0 seq-reads rand-reads seq-writes rand-writes Slide 8

FlashStore: High Throughput Persistent Key-Value Store

Design Goals and Guidelines Support low latency, high throughput operations as a keyvalue store Exploit flash memory properties and work around its constraints Fast random (and sequential) reads Reduce random writes Non-volatile property Low RAM footprint per key independent of key-value pair size Slide 10

FlashStore Design: Flash as Cache Low-latency, high throughput operations Use flash memory as cache between RAM and hard disk RAM RAM Flash Memory Disk... Disk... Current (bottlenecked by hard disk seek times ~ 10msec) FlashStore (flash access times are of the order of 10-100 µsec)

IOPS FlashStore Design: Flash Awareness Flash aware data structures and algorithms Random writes, in-place updates are expensive on flash memory Flash Translation Layer (FTL) cannot hide or abstract away device constraints Sequential writes, Random/Sequential reads great! Use flash in a log-structured manner 150000 125000 134725 134723 100000 75000 50000 25000 49059 3x 17492 FusionIO 160GB iodrive 0 seq-reads rand-reads seq-writes rand-writes

FlashStore Architecture RAM write buffer for aggregating writes into flash Key-value pairs organized on flash in log-structured manner Recently unused key-value pairs destaged to HDD RAM read cache for recently accessed key-value pairs Key-value pairs on flash indexed in RAM using a specialized space efficient hash table

FlashStore Design: Low RAM Usage High hash table load factors while keeping lookup times fast Collisions resolved using cuckoo hashing Key can be in one of K candidate positions Later inserted keys can relocate earlier keys to their other candidate positions K candidate positions for key x obtained using K hash functions h 1 (x),, h K (x) In practice, two hash functions can simulate K hash functions using h i (x) = g 1 (x) + i*g 2 (x) System uses value of K=16 and targets 90% hash table load factor Insert X

Low RAM Usage: Compact Key Signatures Compact key signatures stored in hash table 2-byte key signature (vs. key length size bytes) Key x stored at its candidate position i derives its signature from h i (x) False flash read probability < 0.01% Total 6-10 bytes per entry (4-8 byte flash pointer) Related work on key-value stores on flash media MicroHash, FlashDB, FAWN, BufferHash Slide 18

FlashStore Performance Evaluation Hardware Platform Intel Processor, 4GB RAM, 7200 RPM Disk, fusionio SSD Cost without flash = $1200 Cost of fusionio 80GB SLC SSD = $2200 (circa 2009) Trace Xbox LIVE Primetime Storage Deduplication

FlashStore Performance Evaluation How much better than simple hard disk replacement with flash? Impact of flash aware data structures and algorithms in FlashStore Comparison with flash unaware key-value store FlashStore-SSD BerkeleyDB-HDD BerkeleyDB-SSD BerkeleyDB used as the flash unaware index on HDD/SSD FlashStore-SSD-HDD (evaluate impact of flash recycling activity) Slide 22

Throughput (get-set ops/sec) 5x 8x 24x 60x Slide 23

Performance per Dollar From BerkeleyDB-HDD to FlashStore-SSD Throughput improvement of ~ 40x Flash investment = 50% of HDD capacity (example) = 5x of HDD cost (assuming flash costs 10x per GB) Throughput/dollar improvement of about 40/6 ~ 7x Slide 24

SkimpyStash: Ultra-Low RAM Footprint Key-Value Store on Flash

Aggressive Design Goal for RAM Usage Target ~1 byte of RAM usage per key-value pair on flash Tradeoff with key access time (#flash reads per lookup) Preserve log-structured storage organization on flash Slide 28

SkimpyStash: Base Design Resolve hash table collisions using linear chaining Multiple keys resolving to a given hash table bucket are chained in a linked list Storing the linked lists on flash itself Preserve log-structured organization with later inserted keys pointing to earlier keys in the log Each hash table bucket in RAM contains a pointer to the beginning of the linked list on flash Slide 29

Sequential log Hash table directory key key value value null null ptr key value null... key... value key value RAM key value Keys ordered by write time in log Flash Memory Slide 30

SkimpyStash: Page Layout on Flash Logical pages are formed by linking together records on possibly different physical pages Hash buckets do not correspond to whole physical pages on flash but to logical pages Physical pages on flash contain records from multiple hash buckets Exploits random access nature of flash media No disk-like seek overhead in reading records in a hash bucket spread across multiple physical pages on flash Slide 31

Base Design: RAM Space Usage k = average #keys per bucket Critical design parameter (4/k) bytes of RAM per k-v pair Pointer to chain on flash (4 bytes) per slot Example: k=10 Average of 5 flash reads per lookup = ~50 usec 0.5 bytes in RAM per k-v pair on flash Slide 33

The Tradeoff Curve Slide 34

Base Design: Room for Improvement? Large variations in average lookup times across buckets Skewed distribution in number of keys in each bucket chain Lookups on non-existing keys Require entire bucket (linked list) to be searched on flash Slide 35

Improvement Idea 1: Load Balancing across Buckets Two-choice based load balancing across buckets Hash each key to two buckets and insert in least-loaded bucket 1-byte counter per bucket Lookup times double Need to search both buckets during lookup Fix? Slide 36

Improvement Idea 2: Bloom Filter per Bucket Bloom Filter per Bucket Lookup checks BF before searching linked list on flash Sized for ~k keys => k-bytes per hash table directory slot Other benefits Lookups on non-existing keys faster (almost always no flash access) Benefits from load balancing Balanced chains help to improve BF accuracy (false positives) Symbiotic relationship! Slide 37

Enhanced Design: RAM Space Usage k = average #keys per bucket (1 + 5/k) bytes of RAM per k-v pair Pointer to chain on flash (4 bytes) Bucket size (1 byte) Bloom filter (k bytes) Example: k=10 Average of 5 flash reads per lookup = ~50 usec 1.5 bytes in RAM per k-v pair on flash Hash Table Directory Slot Slide 38

Compaction to Improve Read Performance When enough records accumulate in a bucket to fill a flash page Place them contiguously on one or more flash pages (m records per page) Average #flash reads per lookup = k/2m Garbage created in the log Compaction Updated or deleted records Slide 39

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

Deduplication of Storage Detect and remove duplicate data in storage systems e.g., Across multiple full backups Storage space savings Faster backup completion: Disk I/O and Network bandwidth savings Feature offering in many storage systems products Data Domain, EMC, NetApp Backups need to complete over windows of few hours Throughput (MB/sec) important performance metric High-level techniques Content based chunking, detect/store unique chunks only Object/File level, Differential encoding

Impact of Dedup Savings Across Full Backups Source: Data Domain white paper

Content based Chunking Calculate Rabin fingerprint hash for each sliding window (16 byte) 100 000 000 001 100 110

Content based Chunking Calculate Rabin fingerprint hash for each sliding window (16 byte) 100 000 000 001 100 110

Content based Chunking Calculate Rabin fingerprint hash for each sliding window (16 byte) 100 000 000 001 100 110 Hash 4 2 0-2 -4 0 2 4 6

Content based Chunking Calculate Rabin fingerprint hash for each sliding window (16 byte) 100 000 000 001 100 110 Hash 4 2 0-2 -4 0 2 4 6

Content based Chunking Calculate Rabin fingerprint hash for each sliding window (16 byte) 100 000 000 001 100 110 Hash 4 2 0-2 -4 0 2 4 6

Content based Chunking Calculate Rabin fingerprint hash for each sliding window (16 byte) 100 000 000 001 100 110 Hash 4 2 0-2 -4 0 2 4 6

Content based Chunking Calculate Rabin fingerprint hash for each sliding window (16 byte) 100 000 000 001 100 110 Hash 4 2 0-2 -4 0 2 4 6

Content based Chunking Calculate Rabin fingerprint hash for each sliding window (16 byte) 100 000 000 001 100 110 Hash 4 2 0-2 -4 0 2 4 6

Content based Chunking Calculate Rabin fingerprint hash for each sliding window (16 byte) 100 000 000 001 100 110 If Hash matches a particular pattern, Declare a chunk boundary 4 2 0-2 -4 Hash 0 2 4 6

Content based Chunking Calculate Rabin fingerprint hash for each sliding window (16 byte) 100 000 000 001 100 110 If Hash matches a particular pattern, Declare a chunk boundary 4 2 0-2 -4 Hash 0 2 4 6

Content based Chunking Calculate Rabin fingerprint hash for each sliding window (16 byte) 100 000 000 001 100 110 If Hash matches a particular pattern, Declare a chunk boundary 4 2 0-2 -4 Hash 0 2 4 6

Content based Chunking Calculate Rabin fingerprint hash for each sliding window (16 byte) 100 000 000 001 100 110 If Hash matches a particular pattern, Declare a chunk boundary 4 2 0-2 -4 Hash 0 2 4 6

Content based Chunking Calculate Rabin fingerprint hash for each sliding window (16 byte) 100 000 000 001 100 110 3 Chunks If Hash matches a particular pattern, Declare a chunk boundary 4 2 0-2 -4 Hash 0 2 4 6

Index for Detecting Duplicate Chunks Chunk hash index for identifying duplicate chunks Key = 20-byte SHA-1 hash (or, 32-byte SHA-256) Value = chunk metadata, e.g., length, location on disk Key + Value 64 bytes Essential Operations Lookup (Get) Insert (Set) Need a high performance indexing scheme Chunk metadata too big to fit in RAM Disk IOPS is a bottleneck for disk-based index Duplicate chunk detection bottlenecked by hard disk seek times (~10 msec)

Disk Bottleneck for Identifying Duplicate Chunks 20 TB of unique data, average 8 KB chunk size 160 GB of storage for full index (2.5 10 9 unique chunks @64 bytes per chunk metadata) Not cost effective to keep all of this huge index in RAM Backup throughput limited by disk seek times for index lookups 10ms seek time => 100 chunk lookups per second => 800 KB/sec backup throughput Container No locality in the key space for chunk hash lookups Prefetching into RAM index mappings for entire container... to exploit sequential predictability of lookups during 2 nd and subsequent full backups (Zhu et al., FAST 2008)

Storage Deduplication Process Schematic Chunk (RAM) (Chunks in currently open container) (RAM) Chunk Index Index on on Flash HDD HDD HDD

ChunkStash: Chunk Metadata Store on Flash RAM write buffer for chunk mappings in currently open container Chunk metadata organized on flash in logstructured manner in groups of 1023 chunks => 64 KB logical page (@64-byte metadata/ chunk) Prefetch cache for chunk metadata in RAM for sequential predictability of chunk lookups Chunk metadata indexed in RAM using a specialized space efficient hash table Slide 68

Performance Evaluation Comparison with disk index based system Disk based index (Zhu08-BDB-HDD) SSD replacement (Zhu08-BDB-SSD) SSD replacement + ChunkStash (ChunkStash-SSD) ChunkStash on hard disk (ChunkStash-HDD) Prefetching of chunk metadata in all systems Three datasets, 2 full backups for each BerkeleyDB used as the index on HDD/SSD

Performance Evaluation Dataset 2 1.2x 65x 3.5x 1.8x 25x 3x Slide 75

Performance Evaluation Disk IOPS Slide 77

Flash Memory Cost Considerations Chunks occupy an average of 4KB on hard disk Store compressed chunks on hard disk Typical compression ratio of 2:1 Flash storage is 1/64-th of hard disk storage 64-byte metadata on flash per 4KB occupied space on hard disk Flash investment is about 16% of hard disk cost 1/64-th additional storage @10x/GB cost = 16% additional cost Performance/dollar improvement of 22x 25x performance at 1.16x cost Further cost reduction by amortizing flash across datasets Store chunk metadata on HDD and preload to flash Slide 80

Summary System builders: Don t just treat flash as disk replacement Make the OS/application layer aware of flash Exploit its benefits Embrace its peculiarities and design around them Identify applications that can exploit sweet spot between cost and performance Device vendors can help by exposing more APIs to the software layer for managing storage on flash Can help to squeeze better performance out of flash with application knowledge E.g., Trim(), newly proposed ptrim(), exists() from fusionio Slide 82

Thank You! Email: {sudipta, jinl}@microsoft.com Slide 83