Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems 1
Some Numbers (2010) Over 260 Billion images (20 PB) 65 Billion X 4 different sizes for each image. 1 Billion (60 TB) are uploaded each week Serves over 1 Million images per second at peak. 2
Motivation Started with traditional NFS system shouldered by a CDN. The long tail pattern of access to photos, leaves much of the traffic to the NFS system. Key observation: The NFS system does not withstand the amount of requests due to excessive amount of metadata disk operations 3
The NFS Based Design 4 80% CDN Hit Rate. Rest of 20% are on a long tail distribution, which is not cacheable. Picture taken from Finding a needle in Haystack: Facebook s photo storage, OSDI'10 Proceedings
NFS and The Metadata Bottleneck 1. Starting point: More than 10 disk operations to retrieve a single image (thousands of images per directory) 2. Reducing directory size to hundreds images led to 3 disk operations Read directory metadata Load inode Read file content 3. Caching inodes Caching all inodes is an expensive requirement for current filesystems Last Recently Used approach does not improve much 5
The New Approach Reduce the amount of filesystem per image metadata so it can all fit into main memory. Aggregate a 100GB worth of images into one single file or volume. Given an image id looking up the offset and size can be done in memory. 6
The Usual Design Goals 1/2 A storage system for write once, read often, never modified, and rarely deleted data. High Throughput and Low Latency: Need to facilitate good user experience Measurements show up to 12 ms (measured on the storage machine) Achieved by: - keeping all metadata in main memory (ala GFS) - Log structured multi writes/append only operations Fault-tolerance: Replication in geographically distinct locations When a replica is lost, a new one is created - Replication unit is fixed (~100GB) 7
The Usual Design Goals 2/2 Cost Effective - Comparing their previous NFS Solution: Cost per usable terabyte of storage is 28% less X4 application layer read rate per terabyte of usable storage Simple 8
Overview of the Haystack Architecture Maintains logical to physical mapping Additional CDN Layer 1. 10TB of Server Capacity is organized as 100 physical volumes of 100 GB of storage each. 2. Physical volumes are grouped into logical volumes. 3. When a photo is stored in a logical volume, it is written to all corresponding physical volumes Picture taken from Finding a needle in Haystack: Facebook s photo storage, OSDI'10 Proceedings 9
Serving A Photo 10 http://<cdn>/<cache>/<machine id>/<logical volume, Photo> Picture taken from Finding a needle in Haystack: Facebook s photo storage, OSDI'10 Proceedings
Uploading A Photo 11 Picture taken from Finding a needle in Haystack: Facebook s photo storage, OSDI'10 Proceedings
The Haystack Directory Mapping from logical volumes to physical volumes. (Placement table) What about photo id to logical volumes mapping? Identify read-only logical volumes Reached their storage capacity Due to operational reasons Load balance writes across write-enabled logical storage volumes Decide on whether the request should be served from CDN or cache 12
The Haystack Store Each store machine manages multiple physical volumes Each physical volume can be thought of as large file (~100GB) saved as /hay/haystack_<logical volume id> Keeps open file descriptor for each managed physical volume (xfs) Keeps in memory mapping: <photo ID> ---> <file, offset, size> No metadata disk operations are necessary 13
The Physical Volume Structure On Disk In Memory Photo ID File,offset, size Picture taken from Finding a needle in Haystack: Facebook s photo storage, OSDI'10 Proceedings 14
Store Basic Operations Read Get <logical volume id, key, alternate key, cookie> from Cache Lookup in memory metadata, if the photo exists/not marked as deleted seek read the entire needle (data + metadata) Verify cookie, integrity Return data to cache machine Write Get <logical volume id, key, alternate key, cookie, data> from web server Synchronously append needle images to the appropriate physical volume Update in memory structure Modify (e.g. when a photo is rotated) The new version is either written to a new logical volume, requiring a metadata update by the directory Or the new version is written to the same physical volume in a higher offset Delete and Compact Set the delete flag both in memory and on disk synchronously Write the whole logical file into a new one skipping deleted photos. 25% of the photos gets deleted. 15
Recovery from Failures Arsenal Replication RAID-6 pitchfork background process that: - Tests connections to store machines - Checks availability of volume files - Attempts to read data from store machines Diagnosis and fixing is done offline 16
Reference Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010. Finding a needle in Haystack: facebook's photo storage. In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 1-8. 17