Freezing Exabytes of Data at Facebook s Cold Storage. Kestutis Patiejunas (kestutip@fb.com)



Similar documents
Sistemas Operativos: Input/Output Disks

COS 318: Operating Systems. Storage Devices. Kai Li Computer Science Department Princeton University. (

COS 318: Operating Systems. Storage Devices. Kai Li and Andy Bavier Computer Science Department Princeton University

The Data Placement Challenge

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE. Matt Kixmoeller, Pure Storage

Hadoop: Embracing future hardware

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

System Architecture. In-Memory Database

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Parallels Cloud Storage

Massive Data Storage

Applied Storage Performance For Big Analytics. PRESENTATION TITLE GOES HERE Hubbert Smith LSI

ebay Storage, From Good to Great

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Q & A From Hitachi Data Systems WebTech Presentation:

High-Performance SSD-Based RAID Storage. Madhukar Gunjan Chakhaiyar Product Test Architect

RozoFS: a fault tolerant I/O intensive distributed file system based on Mojette erasure code

Lab Evaluation of NetApp Hybrid Array with Flash Pool Technology

Advanced Knowledge and Understanding of Industrial Data Storage

HPe in Datacenter HPe3PAR Flash Technologies

Flash for Databases. September 22, 2015 Peter Zaitsev Percona

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

Client-aware Cloud Storage

How To Fix A Fault Fault Fault Management In A Vsphere 5 Vsphe5 Vsphee5 V2.5.5 (Vmfs) Vspheron 5 (Vsphere5) (Vmf5) V

VNX HYBRID FLASH BEST PRACTICES FOR PERFORMANCE

PARALLELS CLOUD STORAGE

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

SOLID STATE DRIVES AND PARALLEL STORAGE

The next step in Software-Defined Storage with Virtual SAN

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

Building Storage Clouds for Online Applications A Case for Optimized Object Storage

Increasing Storage Performance

Quantcast Petabyte Storage at Half Price with QFS!

File System Management

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Data Protection Technologies: What comes after RAID? Vladimir Sapunenko, INFN-CNAF HEPiX Spring 2012 Workshop

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Comparison of NAND Flash Technologies Used in Solid- State Storage

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Outline. CS 245: Database System Principles. Notes 02: Hardware. Hardware DBMS Data Storage

Sujee Maniyam, ElephantScale

Solid State Drive Architecture

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Introduction to I/O and Disk Management

Reliability and Fault Tolerance in Storage

Reference Design: Scalable Object Storage with Seagate Kinetic, Supermicro, and SwiftStack

ARCHITECTURE WHITE PAPER ARCHITECTURE WHITE PAPER

In Memory Accelerator for MongoDB

HadoopTM Analytics DDN

Important Differences Between Consumer and Enterprise Flash Architectures

FUJITSU Enterprise Product & Solution Facts

Hadoop Size does Hadoop Summit 2013

SSDs: Practical Ways to Accelerate Virtual Servers

1 Storage Devices Summary

Efficiency Analysis of AWS Offering Vs. Private/Hybrid Implementations or Traditional Colo

Technology Trends in the Storage Universe

Zadara Storage Cloud A

Getting the Most Out of Flash Storage

Scality Conversations (Episode 3) Ever Evolving Data Center Hardware. Leo Leung, VP of Corporate Marketing

Automated Data-Aware Tiering

Design and Evolution of the Apache Hadoop File System(HDFS)

SSDs: Practical Ways to Accelerate Virtual Servers

UCS M-Series Modular Servers

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

DELL SOLID STATE DISK (SSD) DRIVES

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

Performance Benchmark for Cloud Block Storage

Flash Use Cases Traditional Infrastructure vs Hyperscale

Total Cost of Solid State Storage Ownership

RAID Implementation for StorSimple Storage Management Appliance

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

StorPool Distributed Storage Software Technical Overview

SALSA Flash-Optimized Software-Defined Storage

Lambert: Achieve High Durability, Low Cost & Flexibility at Same Time

AMD SEAMICRO OPENSTACK BLUEPRINTS CLOUD- IN- A- BOX OCTOBER 2013

Analysis of VDI Storage Performance During Bootstorm

Today s Papers. RAID Basics (Two optional papers) Array Reliability. EECS 262a Advanced Topics in Computer Systems Lecture 4

IBM System x GPFS Storage Server

Optimizing Large Arrays with StoneFly Storage Concentrators

Application-Focused Flash Acceleration

NextGen Infrastructure for Big DATA Analytics.

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

IBM System x GPFS Storage Server

Low-cost

Big Fast Data Hadoop acceleration with Flash. June 2013

Bringing Greater Efficiency to the Enterprise Arena

IOmark-VM. DotHill AssuredSAN Pro Test Report: VM a Test Report Date: 16, August

Transcription:

Freezing Exabytes of Data at Facebook s Cold Storage Kestutis Patiejunas (kestutip@fb.com)

1990 vs. 2014 Seagate 94171-327 (300MB) iphone 5 16 GB Specs Value Form 3.5" Platters 5 Heads 9 Capacity 300MB Interface SCSI Seek time 17ms Data transfer rate 1 MB/sec

History of Hard Drive data transfer rates

Building Facebook HDD Cold Storage Distinct goals and principles (otherwise we will get another HDFS)

Goals and non goals 1. Durable 2. High efficiency 3. Reasonable availability 4. Scale 5. Support evolution 1. Have low latency for write/read operations 2. Have high availability 3. Be efficient for small objects 4. Be efficient for the objects with short lifetime 6. Gets better as it gets bigger

Principles #1. Durability comes from eliminating single points of failure and ability to recover full system out of the remaining portions. #2. High efficiency is from batching and trading latency for the efficiency. We spend mostly on the storing the data and not the metadata. #3. Simplicity leads to reliability. Trade complexity and features for simplicity, gain durability and reliability. #4. Handle failures from the day one. Distributed systems fail even on the sunny day, we learn about the mistakes when we find that intended recovery doesn t work.

Architecture from 36,000 feet

Facebook HDD Cold Storage HW parts of the solution 1/3 The cost of conventional storage servers 1/5 The cost of conventional data centers

Software architecture when we started in 2013

Raid rebuild vs Distributed volume rebuild

Gets better as it gets bigger

The Real Software architecture after 9 months in production

Raw disk storage Problem: Takes 12h to fill 4TB HDD XFS can be formatted/erased in ~1sec Solution Custom Raw disk storage format Metadata stored in 3 places Metadata is distributed Have to do full disk overwrite to erase data

Is this good enough? What if we had a simplest roof water leak? Disgruntled employee? Software bug? Fire? Storm? Earthquake?

What if we use Reed-Solomon across datacenters?

Conclusion: trade between storage, network and CPU Like RAID systems do Like HDFS and similar systems do Just do this at the datacenter level (can mix Cold and Regular datacenters)

So was Jim Gray 100% right about the future?

Questions and possibilities for mass storage industry Hard drives: hit density wall with PMR 1TB/platter adding more platters 4-8TB adding SMR (only 15-20% increase) Optical: 100GB/disc is cheap 300GB within 18 months 500GB 2-3 years Sony and Panasonic has 1TB/disc on the roadmap waiting for HAMR! going back to 5 factor?

Questions and possibilities for mass storage industry Hard drives: Less demand from IOPS intensive workload (either shifting to SSD, or can t use all of the capacity) Optical: 4k or 8k movies will need lots of storage Small demand from consumers for large capacity

Facebook Blu-Ray storage rack

Facebook Blu-Ray storage rack

Facebook Blu-Ray storage rack

BluRay rack software/hardware components Control commands and data to read or write from Cold storage service BluRay Storage rack boundary 2 x 10 gbps public Ethernet Data to DVD write/read path Head node computer software fully controlled by Facebook BluRay Rack agent -receive/sent data to network -read/write to DVD 0 1 5 -call to Robotics for load/ a1 b1 2 Vcc1 6 a2 b2 3 7 unload 4 a3 HBAb3 a4 GND 0 b4 8 Private network Commands only Ethernet SATA Robotics control unit software/firmware provided by HIT Ethernet 12 BluRay Burners Readers

Details data write path into BluRay racks 1 Public Thrift interface: Migrate to Blu-Ray Blu-Ray Data Migrator Worker #1 Data: -RS decoded from Cold Storage -RS encoded for Blu-Ray nodes DataMigratio Plane Get portion of Migration Plan, Work on moving data Update with info about migration completed 4 Thrift interface: Get portion of Migration Plan FindVolumeForWrite 7 Control/Front End plane 5 6 Blu-Ray Pool Manager Update DB with inventory 3 Blu-Ray racks Empty/Full Media Volumes Single CDB for one deploye ment Calls to SN to read Chunk #1, #2,..#N (20/28) Calls to Blu-Ray rack to wrte Chunk #1, #2,..#N (maybe 4/5) 2 Management calls (inventory, etc) CreateFile CreateBlock Files Blocks Sharded per user per bucket Thrift interface: Write chunk Read Chunk Verify Chunk Get inventory Thrift interface: Write chunk Read Chunk Verify Chunk Get inventory Thrift interface: API Similar to HDD storage nodes Thrift interface: API Similar to HDD storage nodes StorageNode #1 StorageNode #N Blu-Ray Node #1 Blu-Ray Node #N 4 Cold Storage Knox nodes Blu-Ray Storage racks

Limits of BluRay storage, pros and cons vs HDD storage 1. Load time Time to load media for read/write ~30s loadings should be amortized for reading/writing big data sets (full discs). 2. IO speed Current time to read/write 100GB disc is 1.3 hours (about 5-6x longer than on HDDs). 3. Small fraction of active storage BluRay rack has only 12 burners or 1.2TB of active storage. HDD rack has 32 active HDDs or 128TB of active storage. Write/read strategy is to cluster the data across the time and spread across multiple racks. 4. Efficiency and longevity Optical has big edge vs. HDD

Conclusions on Cold Storage When data amounts are massive efficiency is important Specialization allows to achieve efficiency If we are approaching the end of Moore s and Kryder s laws which of the storage media has more iterations left: silicon, magnetic or optical? If we can t see the future can we hedge our bets and how far we can push unexplored technologies to extract extra efficiency?