1 ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems Finding a needle in Haystack: Facebook s photo storage
2 Overview Haystack is an object storage designed for sharing photos on Facebook where data is written once, read often, never modified, and rarely deleted. Use a network attached storage (NAS) appliance mounted over NFS implies disadvantages when metadata are handled. For the Photos application most of this metadata is unused and the more significant cost is that file s metadata must be read from disk into memory in order to find the file itself Multiplied over billion of photos, accessing metadata is the throughput bottleneck Using disk IOs for metadata is the limiting factor to read throughput Haystack served photos quickly using at most one disk operation. It is possible keeping all metadata in main memory Acknowledgment: Few slides are adapted from slides made by Ewa Syta.
3 Facebook Photos in Numbers The biggest photo sharing website in the world For each photo Facebook generates and stores four images Currently stores over 260 billion-> over 20 petabytes One billion new photos uploaded each week One million images per second at peak
4 How FacebookPhotosare used? Profile pictures and pictures recently uploaded Very frequently accessed right after being uploaded Likely to be accessed by different users More likely to be deleted Likely to be cached Long Tail Album photos and older photos Less popular but still frequently accessed Oftenrequestedin a sequenceby the sameuser Likelynotto be in cache and to be retrievedfrom the storage hosts
5 TypicalDesign How web servers, contentdelivery network and storage systems interact to serve photoson a popularsite. HTTP requestissentto a web server Web server generatesthe markup for the brower to render; for eachimage isconstructeda URL directingthe browser to a location from whichto download the data ThisURL pointsto a CDN ; ifthe CDN hasthe image cachedthencdn responds with the data Otherwise, the CDN, via embedded information in the URL, retrieves the photo from the site s storage system The CDN updates its cache data Send the image to the user s browser
6 NFS-based Design Stores each photo in its own file on a set of commercial NAS appliances Photo Store servers mount all the volume exported over NFS : Process HTTP request for images Extracts the volume and full path to the file from an image s URL Read the data over NFS Returns the result to the CDN.
7 NFS Design s Issues Thousands of files stored in each directory of an NFS volume Excessive number of disk operations to read even a single image because of metadata lookups Most of metadata not used for photos Waste of storage capacity Requires disk read operations to find the file itself It was common to incur more than 10 disk operations to retrieve a single image The key problem: Accessing metadata is the throughput bottleneck
8 NFS Design s Issues (contd) Two attempts to reduce disk operations: Reducing directory sizes to hundreds of images per directory. It decrease the disk operations to 3: I. read the directory metadata into memory, II. load the inode into memory, III. read the file contents Let the Photo Store servers explicitly cache file handles returned by NAS Focusing only on caching has limited impact for reducing disk operations. The storage system ends up processing the long tail of requests for less popular photos, which are not available in the CDN and are thus likely to miss in our cache. Using disk IOs for metadata is the limiting factor for the read throughput
9 Why build a custom storage system Traditional filesystems perform poorly under Facebook s workload Existing systems lack the right RAM-to-disk ratio Having enough main memory may helps to cache all the filesystemmetadata? No, it s not cost-effective in this approach: one photo corresponds to one file and each file requires at least one inode, which is hundreds of bytes large. Serving photo requests in the long tail represents a problem in that case. Facebook decided to build a custom storage system that reduces the amount of filesystemmetadata per photo so that having enough main memory is dramatically cost-effective.
10 Haystack Haystack is designed to achieve four main goals: High throughput and low latency Keep up with the users requests Facilitate a good user experience serving photos quickly Fault-tolerant Users should not experience errors despite the inevitable server crashes and hard drive failures Cost-effective Cost per terabyte of usable storage Read rate normalized for each terabyte of usable storage Simple Straight-forward design
11 Design Use a CDN to serve popular images Leverage Haystack to respond to photo requests in the long tail efficiently Reduce the memory used for filesystem metadata Store multiple photos in a single file and therefore maintains very large files. Two kinds of metadata: Application metadata : information needed to construct a URL for the browser Filesystemmetadata: data necessary to retrieve a photos on a host s disk.
12 Design (contd) 3 core components: Haystack Directory Haystack Cache Haystack Store (The three components fit into the canonical interactions between a user s browser, web server, CDN and storage system.) The browser can be directed to either the CDN or the Cache. Having an internal caching infrastructure gives us the ability to reduce our dependence on external CDNs.
13 Haystack Directory Provides a mapping from logical volumes to physical volumes. Useful for web servers for upload photos and construct the image URLs for a page request. id>/<logical volume, Photo> The URL contains several pieces of information, each piece corresponding to the sequence of steps from when a user s browser contact the CDN (or Cache) to ultimately retrieving a photo from a machine in the Store. Load balances writes across logical volumes and reads across physical volumes. Determines whether a photo request should be handled by the CDN or by the Cache Identifies read-only logical volumes Machine is marked read-only when it exhausts its capacity or for operational reasons The Directory stores its information in a replicated database accessed via a PHP interface that leverages memcache to reduce latency.
14 Haystack Cache Functions as an internal CDN Organized as a distributed hash table and use a photo s id as the key to locate cached data It catches a photo only if: The request comes directly from a user and not the CDN Post-CDN caching is ineffective The photo is fetched from a write-enabled Store machine Photos are most heavily accessed soon after they are uploaded Haystack performs better when doing either reads or writes
15 Haystack Store Encapsulates the storage system for photos and manages the filesystem metadata for photos Organized by physical volumes 10 terabytes of physical storage divided into 100 physical volumes 100 gigabytes each Physical volumes on different machines grouped into logical volumes Mitigate data loss Manages operations Read Write Delete
16 Haystack Store (contd) Access a photo quickly using only the id of the corresponding logical volume and the file offset at which the photo resides. Keystone of the Haystack design: retrieving the filename, offset, and size for a particular photo without needing disk operations. Keeps open file descriptors for each physical volume and also an in-memory mapping of photo ids to the filesystem metadata critical for retrieving that photo.
17 Physical Volume Layout Store machine represent a physical volume as a large file consisting of a superblock followed by a sequence of needles. Each needle represent a photo stored in Haystack Think of a physical volume as a very large file saved as /hay/haystack/<logical volume id> To retrieve needles quickly each Store machine maintain an in-memory data structure for each of its volumes, which maps pairs of(key, alternative key) to the corresponding needle s flag, size in bytes and volume offset.
18 Physical Volume Layout
19 Photo Read When a Cache machine requests a photo it supplies the logical volume id, key, alternate key and cookie Cookie s value is randomly assigned by and stored in the Directory at the time the photo is uploaded Used to eliminate attacks aimed at guessing valid URLs for photos. Store machine looks up the relevant metadata in its inmemory mappings Checks if it s not deleted Seeks to the appropriate offset in the volume file Reads the entire needle from disk Verifies the cookie and the integrity of the data Returns the photo if checks passed
20 Photo Write Web servers provide: Logical volume id, key, alternate key, cookie and data to Store machines Each machine synchronously appends needle images to its physical volume files and updates in-memory mappings as needed Volumes are append-only so photos can only be modified by adding an update needle with the same key and alternate key o Different logical volume: the Directory updates its application metadata and future requests will never fetch the older version o Same logical volume: duplicate distinguished based on their offsets: highest offset = latest version
21 Photo Write (contd) Uploading a Photo
22 Photo Delete Very straightforward Sets the delete flag in both the in-memory mapping and synchrounously in the volume file Space occupied by deleted needles is lost for some time and reclaimed later via compaction Online operation that reclaims the space used by deleted and duplicate needles Needles are copied into a new file and the new file replaced the current file The pattern for deletes is similar to photo views Young photos are a lot more likely to be deleted Over the year 25% of the photos get deleted
23 Index File Store machines maintain an index file for each of their volumes Checkpoint of the in-memory data structures used to locate needles efficiently on disk Allows a Store machine to build its in-memory mappings quickly, shortening restart time
24 Index File (contd) Layout of Haystack Index file
25 Index File (contd) Is updated asynchronously Allows write and delete operations to return faster Two sides effects: Needles can exist without corresponding index records (Orphans) olast record in the index corresponds to last non-orphan needle in the volume file Index records do not reflect deleted photos oif a needle is marked as deleted the Store machine update its in-memory mapping and notifies Cache
26 Filesystem Store machine uses XFS, an extent based file system Doesn t need much memory to be able to perform random seeks within a large file quickly The blockmapsfor several contiguous large files can be small enough to be stored in main memory XFS provides efficient preallocationand mitigating fragmentation
27 Recovery from failures Haystack needs to tolerate a variety of failures: faulty hard drives, misbehaving RAID controllers, bad motherboards, etc Two straight forward techniques to tolerate failures: A background task, pitchfork, periodically checks the availability of each volume file, tests the connection to each Store machine and attempts to read data. If the check fails the logical volume are marked as read-only To fix the problem sometimes (a few each month) is required a bulk sync operation in which we reset the data of a Store machine using the volume files supplied by a replica.
28 Haystack Optimizations Haystack is an object store designed for sharing photos on Facebook where data is written once, read often, never modified, and rarely deleted. It keeps all metadata in main memory and requires at most one disk operation per read The per photo metadata to find a photo on disk are reduced It replicates each photo in geographically distinct locations Each usable terabyte costs ~28% less and processes ~4x more reads per second than an equivalent terabyte on a NAS appliance
29 Haystack Optimizations (contd) Store machines reduce their main memory footprints by 20% : For deleted photos the offset is 0 The supplied cookie are checked after reading a needle from disk It uses on average 10 bytes of main memory per photo. Each photo is scaled to four photos with the same key(64 bits), different alternate key(32 bits) and different data size(16 bits) In addition, 2 bytes per image in overheads due to hash tables, bringing the total for four scaled photos of the same image to 40 bytes For comparison, xfs_inode_t structure in Linux is 536 bytes
30 Traffic Volume The number of Haystack photos written is 12 times the number of photos uploaded Haystack responds to approximately 10% of all photo requests from CDNs The smaller images account for most of the photo viewed Reading smaller images is typically a more latency sensitive operation for Facebook
31 Evaluation Directory Directory balances very effectively writes across Store machines well balanced behavior
32 Evaluation Cache The Cache is effective in dramatically reducing the read request rate for the machines that would be most affected These photos are relatively recent, which explains the high hit rates of ~80%
33 Evaluation Store Two benchmarks: Randomio and Haystress Haystack delivers 85% of the raw throughput of the device while incurring only 17% higher latency The Store delivers high read throughput even in the presence of writes
34 Evaluation Store (contd) The latency of multi-write operations is fairly low and stable even as the volume of traffic varies dramatically
35 Questions Haystack sets its goal as having at most one disk operation per read. To this end, it must keep all metadata in main memory. What are barriers to achieve this objective? (Section 1) What does long tail refer to regarding Haystack s workload and access pattern? Why does long tail make caches in Photo Store Servers deployed before NAS servers less effective? (Section 2) To retrieve needles quickly, what does the in-memory mapping data structure include? Image that each photo is stored as a file in a conventional file system, to locate the photo (or the data) what metadata are required? (Section 3.4) Apparently Haystack provides much small metadata. Do you think whether it is necessary to introduce the Haystack s technique into the conventional/general-purpose file systems to improve the performance? Why? If it is indeed adopted, what is the disadvantage?