Declustered RAID & HDFS Replicated Systems

Comparison of Data Durability in Parity Declustered RAID & HDFS Replicated Systems & Dimitar Vlassarev Cloud Modeling & Data Analytics Seagate Technology RMACC 4 th High Performance Computing Symposium Boulder CO, August 12-13, 2014

Introduction HPC simulations generate a lot of data that are amenable to analysis in the Hadoop ecosystem Requires data migration from the HPC file system (FS to HDFS (Hadoop File System Can data migration be avoided and Hadoop run of HPC FS? Meet Lustre a high performance parallel file system which is largely the mainstay of many HPC systems Hadoop on Lustre FS should enable avoiding data migration Main questions of interest: How does performance and data durability of Lustre compare with that of HDFS for Hadoop applications? This talk focuses on Data Durability

HDFS Replication and Re-Replication HDFS breaks each file into blocks of certain size (128MB current default, earlier 64MB and stores replicas (three by default of each block on different nodes Two blocks are stored in the same rack (using rack-awareness so as to enhance read performance, the remaining replica helps ensure availability If a node is down and doesn t send heartbeat to the NameNode, then re-replication of the blocks lost are triggered and copies made on any of the remaining DataNodes (parallel rebuild that scales with increasing cluster size Pictures courtesy of HDFS Apache & bradheadlund.com

Parity Declustered (PD RAID with Spare Blocks for Lustre RAID - 5 Parity Declustered RAID - 5 D0.0 D0.1 D0.2 P0 S0 D1.0 D1.1 P1 D1.2 S1 D2.0 P2 D2.1 D2.2 S2 P3 D3.0 D3.1 D3.2 S3 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 D0.0 D0.1 D0.2 P0 S0 D1.0 D1.1 P1 S1 D1.2 D2.0 P2 S2 D2.2 D2.1 P3 S3 D3.1 D3.2 D3.0 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Involves stretching out record size across may drives while permuting blocks of original RAID configuration (N+M, N is data blocks and M is the parity blocks Failure of a drive would lead to independent rebuild of blocks instead of all blocks in a device like RAID, and hence involves partial data extraction from all remaining drives and writes to dispersed spare blocks decreasing rebuild time drastically parallel rebuild Above shows examples of rebuilds for RAID - 5 and Parity declustered RAID 5 (3+1+1 (with spare blocks S i s spread out among all disks

Markov Model Framework µ (1-D 1 µ (1-D 2 µ (1-D 3 µ (1-D 4 Healthy T λ -1 HDD (T-1 λ -2 HDDs (T-2 λ -3 HDDs (T-3 λ -4 HDDs µd 2 µd 3 µd 1 µd 4 µd 0 Definitions λ HDD Failure Rate (hrs -1 µ HDD Repair Rate (hrs -1 T Number of HDDs Probability of Repair Failure D i Data Loss * Repair of hard drives occurs simultaneously in parallel

Repair Failure Probability for Parity Declustered (PD RAID D(N,M,j the probability of data loss for a particular block when j drives fail for an underlying RAID (N+M with T total drives in the PD RAID set is: where, the probability of loosing a block due to placement is given by: = + = k i p T i M N j T i j k D 0 1 ( k M br M k p D k D j M N D = = * (,, ( 0 and the probability of loosing data due to bit-rot is given by: D j the probability of data loss (at each PD-RAID level due to j failed drives is: = + i M N T 0 ( _per_grid num_blocks,, ( 1 1 j M N D j D = (in bits block_size (1 1 UER D br = Random placement of blocks assumed here All blocks of uniform size

Rebuild Time for Parity Declustered (PD RAID When a drive fails, all remaining drives are used to rebuild the blocks of data (nblocks in the failed drive Involves three operations Reading (N*nblocks from the remaining hard drives Reconstructing the lost nblocks Writing the nblocks to all the remaining hard drives So time to repair =T_repair = max (T_read, T_reconstruct, T_write T_read = (N * nblocks * block_size/(read_speed*remaining_hdds T_write = (nblocks * block_size/(write_speed * remaining_hdds T_reconstruct = needs to be modeled (set to zero here read_speed and write_speed is the speed to read/write to a disk (is assumed constant here So, repair rate = µ = (1/T_repair

Repair Failure Probability for HDFS D(N,j the probability of data loss for a particular block when j drives fail with T total drives and a N replication strategy in HDFS is: D( N, N k br where, the probability of loosing a block due to placement is given by: j k D p ( k = T k and the probability of loosing data due to bit-rot is given by: N j = D ( k * D k = 0 p D br = 1 (1 UER block_size (in bits D j the probability of data loss due to j failed drives is: total_num_ blocks ( 1 ( N, M, D j = 1 D j

Rebuild Time for HDFS Repair Speed = (2/3*(2 racks * n DNs *min(nwbw DN-TOR *0.93,(1/2*HDDspeed*n HDD/DN + (1/3*(n Racks *min(nwbw TOR-TOR *0.93,(1/2*HDDspeed*n HDD/RACK Assumptions for Repair Speed Calculations: (2/3 of the blocks in a drive have a copy in the same rack and one on another (1/3 of the blocks in a drive have both the remaining blocks on another rack This analysis is limited to the number of drives of the range of 100000 and beyond that durability will be affected

Results: 1 st Year System Level Durability Table No of Drives / Capacity RAID6 (8+2 RAID6 (4+2 PD-RAID (8+2 HDFS (3 Replica Storage Overhead 25 % 50 % 25 % 200 % 82 (~ 0.3PB 4 nines 5 nines 10 nines 7 nines ~0.2PB usable storage ~0.15PB usable storage ~0.2PB usable storage ~0.1PB usable storage 492 (~ 1.9PB 3 nines 4 nines 9 nines 6 nines ~1.4 PB usable storage ~0.9PB usable storage ~1.4 PB usable storage ~0.6PB usable storage 1386 (~ 5.5PB 3 nines 3 nines 8 nines 6 nines ~4.1PB usable storage ~2.7PB usable storage ~4.1PB usable storage ~1.8PB usable storage 69300 (~ 277PB 1 nine 2 nines 7 nines 6 nines ~207PB usable storage ~138PB usable storage ~207PB usable storage ~92PB usable storage System level durability implies: Probability of not loosing any data in 1 year Choices of parameters: MTBF=1.4 Million, Speed=100MB/sec, UER=1e-15, Disk Size=4TB, Block Size=128MB PD-RAID delivers better/comparable system reliability in comparison to HDFS, but HDFS would beat PD-RAID at scale > several 100 PB s

Conclusions Parity-declustered RAID helps achieve HDFS level data durability At scale > several 100PB, HDFS durability thrives, but small scale relevant for almost all current HPC systems Switching to RAID6 (4+2 could provide marginal increase in durability with 25% increase in overhead from RAID6 (8+2(still better than HDFS overhead Bottom Line: Parity Declustered RAID a must to make Luster attractive for Hadoop applications Plus it also enables storage overhead reduction by a factor of ~ 8x THANK YOU!!

Synopsis Comparison of Data Durability in Parity Declustered RAID and HDFS Replicated Systems & Dimitar Vlassarev Cloud Modeling and Data Analytics Seagate Technology 389 Disc Dr., Longmont, CO 80503 Data stored on large scale high-performance computing systems is in many situations suitable for analysis within the Hadoop ecosystem. To facilitate that analysis efficiently, avoiding data migration from RAID backed storage systems to the replicated Hadoop File System (HDFS is essential. For these applications, high-performance parallel file systems like Lustre can offer an appealing alternative to HDFS. Two important considerations in comparing the two systems are their performance and data durability. Here we compare the data durability of HDFS s replication strategy to that of a Parity Declustered RAID backed Lustre system. A Continuous Time Markov Chains model for data durability of the two systems suggest that the Parity Declustered RAID backed Lustre solution can be as resilient as the replicated HDFS solution. QUESTIONS??