9-131 Quantcast Petabyte Storage at Half Price with QFS Presented by Silvius Rus, Director, Big Data Platforms September 2013
Quantcast File System (QFS) A high performance alternative to the Hadoop Distributed File System (HDFS). Manages multi-petabyte Hadoop workloads with significantly faster I/O than HDFS and uses only half the disk space. Offers massive cost savings to large scale Hadoop users (fewer disks = fewer machines). Production hardened at Quantcast under massive processing loads (multi exabyte). Fully Compatible with Apache Hadoop. 100% Open Source. 2
Quantcast Technology Innovation Timeline Quantcast Measurement Launched Quantcast Advertising Launched Launch QFS Receiving 1TB/day Receiving 10TB/day Receiving 20TB/day Receiving 40TB/day 2006 2007 2008 2009 2010 2011 2012 2013 Processing 1PB/day Processing 10PB/day Processing 20PB/day Started using Hadoop Using and sponsoring KFS Turned off HDFS 3
Architecture Client Implements high level file interface (read/write/delete) On write, RS encodes chunks and distributes stripes to nine chunk servers. On read, collects RS stripes from six chunk servers and recomposes chunk. Client Read/write RS encoded data from/to chunk servers Rack 1 Chunk servers Metaserver Maps /file/paths to chunk ids Manages chunk locations Directs clients to chunk servers Locate or allocate chunks Chunk replication and rebalancing instructions Copy/Recover chunks Chunk servers Chunk Server Handles IO to locally stored 64MB chunks Monitors host file system health Replicates and recovers chunks as metaserver directs Metaserver Rack 2 4
QFS vs. HDFS Broadly comparable feature set, with significant storage efficiency advantages. Feature QFS HDFS Scalable, distributed storage designed for efficient batch processing ü ü Open source ü ü Hadoop compatible ü ü Unix style file permissions ü ü Error Recovery mechanism Reed-Solomon encoding Multiple data copies Disk space required (as a multiple of raw data) 1.5x 3x 5
Reed-Solomon Error Correction Leveraging high-speed modern networks HDFS optimizes toward data locality for older networks. 1. Break original data into 64K stripes. Reed-Solomon Parallel Data I/O 10Gbps networks are now common, making disk I/O a more critical bottleneck. QFS leverages faster networks to achieve better parallelism and encoding efficiency. Result: higher error tolerance, faster performance, with half the disk space. 2. Reed-Solomon generates three parity stripes for every six data strips 3. Write those to nine different drives. 4. Up to three stripes can become unreadable... 5. yet the original data can still be recovered Every write parallelized across 9 drives, every read across 6 6
MapReduce on 6+3 Erasure Coded Files versus 3x Replicated Files Positives Negatives Writing is ½ off, both in terms of space and time Any 3 broken or slow devices will be tolerated vs. any 2 with 3-way replication Re-executed stragglers run faster due to reading from multiple devices (striping) There is no locality, reading will require the network On read failure, recovery is needed however it s lightning fast on modern CPUs (2 GB/s per core) Writes don t achieve network line rate as original + parity data is written by a single client 7
Read/Write Benchmarks End-to-end time (minutes) 18 16 14 12 10 8 6 HDFS 64 MB HDFS 2.5 GB QFS 64 MB Host network behavior during tests QFS write = ½ disk I/O of HDFS write QFS write à network/disk = 8/9 HDFS write à network/disk = 6/9 QFS read à network/disk = 1 HDFS read à network/disk = very small 4 2 0 Write Read End-to-end 20 TB write test End-to-end 20 TB read test 8,000 workers * 2.5 GB each Tests ran as Hadoop MapReduce jobs 8
Metaserver Performance Intel E5-2670 64 GB RAM 70 million directories stat rmdir mkdir ls QFS HDFS 0 50 100 150 200 250 300 Operations per second (thousands) 9
Production Hardening for Petascale Continuous I/O Balancing Optimization Operations Full feedback loop Metaserver knows the I/O queue size of every device Activity biased towards under-loaded chunkservers Direct I/O = short loop Direct I/O and fixed buffer space = predictable RAM and storage device usage C++, own memory allocation and layout Vector instructions for Reed Solomon coding Hibernation Evacuation through recovery Continuous space/ consistency rebalancing Monitoring and alerts 10
Use Case Quantsort: All I/O over QFS http://qc.st/qcquantsort Concurrent append. 10,000 writers append to same file at once. Largest sort = 1 PB Daily = 1 to 2 PB, max = 3 PB 11
Use Case Fast Broadcast through Wide Striping 100.0 90.0 94.5 80.0 Broadcast Time (s) 70.0 60.0 50.0 40.0 30.0 20.0 16.7 10.0 8.5 4.8 0.0 HDFS Default HDFS Small Blocks QFS on Disk QFS in RAM Configuration 12
Refreshingly Fast Command Line Tool hadoop fs -ls / versus qfs ls / 800 700 600 500 400 300 200 100 0 700 HDFS Time (msec) 7 QFS Time (msec) 13
How Well Does It Work Reliable at Scale Hundreds of days of metaserver uptime common Quantcast MapReduce sorter uses QFS as distributed virtualized store instead of local disk 8 petabytes of compressed data Close to 1 billion chunks 7,500 I/O devices 14
How Well Does It Work Reliable at Scale Fast and Large Hundreds of days of metaserver uptime common Quantcast MapReduce sorter uses QFS as distributed virtualized store instead of local disk 8 petabytes of compressed data Close to 1 billion chunks 7,500 I/O devices Ran petabyte sort last weekend. Direct I/O not hurting fast scans: Sawzall query performance similar to Presto: Presto/ HDFS Turbo/ QFS Seconds 16 16 Rows 920 M 970 M Bytes 31 G 294 G Rows/sec 57.5 M 60.6 M Bytes/sec 2.0 G 18.4 G 15
How Well Does It Work Reliable at Scale Fast and Large Easy to Use Hundreds of days of metaserver uptime common Quantcast MapReduce sorter uses QFS as distributed virtualized store instead of local disk 8 petabytes of compressed data Close to 1 billion chunks 7,500 I/O devices Ran petabyte sort last weekend. Direct I/O not hurting fast scans: Sawzall query performance similar to Presto: Presto/ HDFS Turbo/ QFS Seconds 16 16 Rows 920 M 970 M Bytes 31 G 294 G Rows/sec 57.5 M 60.6 M Bytes/sec 2.0 G 18.4 G 1 Ops Engineer for QFS and MapReduce on 1,000+ node cluster Neustar set up multi petabyte instance without help from Quantcast Migrate from HDFS using hadoop distcp Hadoop MapReduce just works on QFS 16
Metaserver Statistics in Production QFS metaserver statistics over Quantcast production file systems in July 2013. High Availability is nice to have but not a must-have for MapReduce. There are certainly other use cases where High Availability is a must. Federation may be needed to support file systems beyond 10 PB, depending on file size 17
Who will find QFS valuable? Likely to benefit from QFS May find HDFS a better fit Existing Hadoop users with large-scale data clusters. Data heavy, tech savvy organizations for whom performance and efficient use of hardware are high priorities. Small or new Hadoop deployments, as HDFS has been deployed in a broader variety of production environments. Clusters with slow or unpredictable network connectivity. Environments needing specific HDFS features such as head node federation or hot standby. 18
Summary Key Benefits of QFS Delivers stable high performance alternative to HDFS in a production-hardened 1.0 release Offers high performance management of multi-petabyte workloads Faster I/O than HDFS with half the disk space. Fully Compatible with Apache Hadoop 100% Open Source Quantcast 2012
Future Work What QFS Doesn t Have Just Yet Kerberos Security under development HA No strong case at Quantcast, but nice to have Federation Not a strong case either at Quantcast Contributions welcome Quantcast 2012
Thank You. Questions? Download QFS for free at: github.com/quantcast/qfs San Francisco 201 Third Street San Francisco, CA 94103 New York 432 Park Avenue South New York, NY 10016 London 48 Charlotte Street London, W1T 2NS Quantcast File System 9-13 Quantcast 2012 21