Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems



Similar documents
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Today s Papers. RAID Basics (Two optional papers) Array Reliability. EECS 262a Advanced Topics in Computer Systems Lecture 4

Finding a needle in Haystack: Facebook s photo storage

CLOUD scale storage Anwitaman DATTA SCE, NTU Singapore CE 7490 ADVANCED TOPICS IN DISTRIBUTED SYSTEMS

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

The Google File System

Distributed File Systems

Distributed File Systems

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

The Google File System

Snapshots in Hadoop Distributed File System

Design and Evolution of the Apache Hadoop File System(HDFS)

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Google File System. Web and scalability

Improving Scalability Of Storage System:Object Storage Using Open Stack Swift

Apache Hadoop FileSystem and its Usage in Facebook

Scalable Multiple NameNodes Hadoop Cloud Storage System

Hadoop & its Usage at Facebook

Maurice Askinazi Ofer Rind Tony Wong. Cornell Nov. 2, 2010 Storage at BNL

CSE 120 Principles of Operating Systems

Network File System (NFS) Pradipta De

Ceph. A file system a little bit different. Udo Seidel

Large scale processing using Hadoop. Ján Vaňo

Hadoop & its Usage at Facebook

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

XenData Archive Series Software Technical Overview

Data Center Performance Insurance

Putting Apache Kafka to Use!

A Deduplication File System & Course Review

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Scala Storage Scale-Out Clustered Storage White Paper

POWER ALL GLOBAL FILE SYSTEM (PGFS)

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Massive Data Storage

Big Data With Hadoop

Big Table A Distributed Storage System For Data

THE HADOOP DISTRIBUTED FILE SYSTEM

WOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief

Distributed File Systems

Filesystems Performance in GNU/Linux Multi-Disk Data Storage

Hadoop Big Data for Processing Data and Performing Workload

Chapter 11: File System Implementation. Operating System Concepts 8 th Edition

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Large Scale file storage with MogileFS. Stuart Teasdale Lead System Administrator we7 Ltd

The Cloud Trade Off IBM Haifa Research Storage Systems

Web Caching and CDNs. Aditya Akella

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

HADOOP MOCK TEST HADOOP MOCK TEST I

Benchmarking Cassandra on Violin

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Hadoop Distributed File System. Dhruba Borthakur June, 2007

RAID Performance Analysis

Distributed Filesystems

Web DNS Peer-to-peer systems (file sharing, CDNs, cycle sharing)

Suresh Lakavath csir urdip Pune, India

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

File-System Implementation

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Distributed File Systems

Best practices for operational excellence (SharePoint Server 2010)

Multi-Terabyte Archives for Medical Imaging Applications

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Cloud Computing at Google. Architecture

How to Choose your Red Hat Enterprise Linux Filesystem

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Datacenter Operating Systems

Hypertable Architecture Overview

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Comparative analysis of Google File System and Hadoop Distributed File System

Hadoop implementation of MapReduce computational model. Ján Vaňo

HRG Assessment: Stratus everrun Enterprise

Zadara Storage Cloud A

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Long term retention and archiving the challenges and the solution

CS 153 Design of Operating Systems Spring 2015

Network File System (NFS)

Apache Hadoop FileSystem Internals

Apache HBase. Crazy dances on the elephant back

ovirt and Gluster Hyperconvergence

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Accelerate SQL Server 2014 AlwaysOn Availability Groups with Seagate. Nytro Flash Accelerator Cards

Review. Lecture 21: Reliable, High Performance Storage. Overview. Basic Disk & File System properties CSC 468 / CSC /23/2006

Resource control in ATLAS distributed data management: Rucio Accounting and Quotas

International Journal of Advance Research in Computer Science and Management Studies

The Hadoop Distributed File System

RAID Storage, Network File Systems, and DropBox

Transcription:

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems 1

Some Numbers (2010) Over 260 Billion images (20 PB) 65 Billion X 4 different sizes for each image. 1 Billion (60 TB) are uploaded each week Serves over 1 Million images per second at peak. 2

Motivation Started with traditional NFS system shouldered by a CDN. The long tail pattern of access to photos, leaves much of the traffic to the NFS system. Key observation: The NFS system does not withstand the amount of requests due to excessive amount of metadata disk operations 3

The NFS Based Design 4 80% CDN Hit Rate. Rest of 20% are on a long tail distribution, which is not cacheable. Picture taken from Finding a needle in Haystack: Facebook s photo storage, OSDI'10 Proceedings

NFS and The Metadata Bottleneck 1. Starting point: More than 10 disk operations to retrieve a single image (thousands of images per directory) 2. Reducing directory size to hundreds images led to 3 disk operations Read directory metadata Load inode Read file content 3. Caching inodes Caching all inodes is an expensive requirement for current filesystems Last Recently Used approach does not improve much 5

The New Approach Reduce the amount of filesystem per image metadata so it can all fit into main memory. Aggregate a 100GB worth of images into one single file or volume. Given an image id looking up the offset and size can be done in memory. 6

The Usual Design Goals 1/2 A storage system for write once, read often, never modified, and rarely deleted data. High Throughput and Low Latency: Need to facilitate good user experience Measurements show up to 12 ms (measured on the storage machine) Achieved by: - keeping all metadata in main memory (ala GFS) - Log structured multi writes/append only operations Fault-tolerance: Replication in geographically distinct locations When a replica is lost, a new one is created - Replication unit is fixed (~100GB) 7

The Usual Design Goals 2/2 Cost Effective - Comparing their previous NFS Solution: Cost per usable terabyte of storage is 28% less X4 application layer read rate per terabyte of usable storage Simple 8

Overview of the Haystack Architecture Maintains logical to physical mapping Additional CDN Layer 1. 10TB of Server Capacity is organized as 100 physical volumes of 100 GB of storage each. 2. Physical volumes are grouped into logical volumes. 3. When a photo is stored in a logical volume, it is written to all corresponding physical volumes Picture taken from Finding a needle in Haystack: Facebook s photo storage, OSDI'10 Proceedings 9

Serving A Photo 10 http://<cdn>/<cache>/<machine id>/<logical volume, Photo> Picture taken from Finding a needle in Haystack: Facebook s photo storage, OSDI'10 Proceedings

Uploading A Photo 11 Picture taken from Finding a needle in Haystack: Facebook s photo storage, OSDI'10 Proceedings

The Haystack Directory Mapping from logical volumes to physical volumes. (Placement table) What about photo id to logical volumes mapping? Identify read-only logical volumes Reached their storage capacity Due to operational reasons Load balance writes across write-enabled logical storage volumes Decide on whether the request should be served from CDN or cache 12

The Haystack Store Each store machine manages multiple physical volumes Each physical volume can be thought of as large file (~100GB) saved as /hay/haystack_<logical volume id> Keeps open file descriptor for each managed physical volume (xfs) Keeps in memory mapping: <photo ID> ---> <file, offset, size> No metadata disk operations are necessary 13

The Physical Volume Structure On Disk In Memory Photo ID File,offset, size Picture taken from Finding a needle in Haystack: Facebook s photo storage, OSDI'10 Proceedings 14

Store Basic Operations Read Get <logical volume id, key, alternate key, cookie> from Cache Lookup in memory metadata, if the photo exists/not marked as deleted seek read the entire needle (data + metadata) Verify cookie, integrity Return data to cache machine Write Get <logical volume id, key, alternate key, cookie, data> from web server Synchronously append needle images to the appropriate physical volume Update in memory structure Modify (e.g. when a photo is rotated) The new version is either written to a new logical volume, requiring a metadata update by the directory Or the new version is written to the same physical volume in a higher offset Delete and Compact Set the delete flag both in memory and on disk synchronously Write the whole logical file into a new one skipping deleted photos. 25% of the photos gets deleted. 15

Recovery from Failures Arsenal Replication RAID-6 pitchfork background process that: - Tests connections to store machines - Checks availability of volume files - Attempts to read data from store machines Diagnosis and fixing is done offline 16

Reference Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010. Finding a needle in Haystack: facebook's photo storage. In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 1-8. 17