Declustered RAID & HDFS Replicated Systems



Similar documents
Hadoop Architecture. Part 1

IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO)

Reliability of Data Storage Systems

Big Data Technology Core Hadoop: HDFS-YARN Internals

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Design and Evolution of the Apache Hadoop File System(HDFS)

HADOOP MOCK TEST HADOOP MOCK TEST I

Distributed File Systems

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Hadoop: Embracing future hardware

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

Bright Cluster Manager

Parallels Cloud Storage

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Understanding Hadoop Performance on Lustre

Apache Hadoop. Alexandru Costan

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe

MinCopysets: Derandomizing Replication In Cloud Storage

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Xyratex Update. Michael K. Connolly. Partner and Alliances Development

Understanding Hadoop Clusters and the Network

Big Data With Hadoop

THE HADOOP DISTRIBUTED FILE SYSTEM

ISSN: Keywords: HDFS, Replication, Map-Reduce I Introduction:

Replication and Erasure Coding Explained

Improving Lustre OST Performance with ClusterStor GridRAID. John Fragalla Principal Architect High Performance Computing

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Reliability and Fault Tolerance in Storage

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

HDFS Space Consolidation

IBM System x GPFS Storage Server

Efficient Data Replication Scheme based on Hadoop Distributed File System

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Big Data Analytics. Lucas Rego Drumond

Introduction to HDFS. Prasanth Kothuri, CERN

Data Protection Technologies: What comes after RAID? Vladimir Sapunenko, INFN-CNAF HEPiX Spring 2012 Workshop

The Hadoop Distributed File System

Apache Hadoop FileSystem and its Usage in Facebook

HadoopTM Analytics DDN

The Greenplum Analytics Workbench

PIONEER RESEARCH & DEVELOPMENT GROUP

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

International Journal of Advance Research in Computer Science and Management Studies

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

NLSS: A Near-Line Storage System Design Based on the Combination of HDFS and ZFS

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

An Affordable Commodity Network Attached Storage Solution for Biological Research Environments.

Distributed File Systems

Apache Hadoop FileSystem Internals

Storage Architectures for Big Data in the Cloud

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

HPC data becomes Big Data. Peter Braam

PARALLELS CLOUD STORAGE

Load Balancing in Fault Tolerant Video Server

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Nutanix Tech Note. Failure Analysis All Rights Reserved, Nutanix Corporation

IBM System x GPFS Storage Server

Storage node capacity in RAID0 is equal to the sum total capacity of all disks in the storage node.

High Availability on MapR

Hadoop IST 734 SS CHUNG

Why RAID is Dead for Big Data Storage. The business case for why IT Executives are making a strategic shift from RAID to Information Dispersal

HDFS Architecture Guide

Quantcast Petabyte Storage at Half Price with QFS!

Designing a Cloud Storage System

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Enabling High performance Big Data platform with RDMA

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

<Insert Picture Here> Big Data

Fault Tolerance in Hadoop for Work Migration

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Practical Applications of Lustre/ZFS Hybrid Systems LUG 2014 Miami FL

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data - Infrastructure Considerations

Sistemas Operativos: Input/Output Disks

HDFS Users Guide. Table of contents

Yuji Shirasaki (JVO NAOJ)

Detailed Outline of Hadoop. Brian Bockelman

Transcription:

Comparison of Data Durability in Parity Declustered RAID & HDFS Replicated Systems & Dimitar Vlassarev Cloud Modeling & Data Analytics Seagate Technology RMACC 4 th High Performance Computing Symposium Boulder CO, August 12-13, 2014

Introduction HPC simulations generate a lot of data that are amenable to analysis in the Hadoop ecosystem Requires data migration from the HPC file system (FS to HDFS (Hadoop File System Can data migration be avoided and Hadoop run of HPC FS? Meet Lustre a high performance parallel file system which is largely the mainstay of many HPC systems Hadoop on Lustre FS should enable avoiding data migration Main questions of interest: How does performance and data durability of Lustre compare with that of HDFS for Hadoop applications? This talk focuses on Data Durability

HDFS Replication and Re-Replication HDFS breaks each file into blocks of certain size (128MB current default, earlier 64MB and stores replicas (three by default of each block on different nodes Two blocks are stored in the same rack (using rack-awareness so as to enhance read performance, the remaining replica helps ensure availability If a node is down and doesn t send heartbeat to the NameNode, then re-replication of the blocks lost are triggered and copies made on any of the remaining DataNodes (parallel rebuild that scales with increasing cluster size Pictures courtesy of HDFS Apache & bradheadlund.com

Parity Declustered (PD RAID with Spare Blocks for Lustre RAID - 5 Parity Declustered RAID - 5 D0.0 D0.1 D0.2 P0 S0 D1.0 D1.1 P1 D1.2 S1 D2.0 P2 D2.1 D2.2 S2 P3 D3.0 D3.1 D3.2 S3 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 D0.0 D0.1 D0.2 P0 S0 D1.0 D1.1 P1 S1 D1.2 D2.0 P2 S2 D2.2 D2.1 P3 S3 D3.1 D3.2 D3.0 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Involves stretching out record size across may drives while permuting blocks of original RAID configuration (N+M, N is data blocks and M is the parity blocks Failure of a drive would lead to independent rebuild of blocks instead of all blocks in a device like RAID, and hence involves partial data extraction from all remaining drives and writes to dispersed spare blocks decreasing rebuild time drastically parallel rebuild Above shows examples of rebuilds for RAID - 5 and Parity declustered RAID 5 (3+1+1 (with spare blocks S i s spread out among all disks

Markov Model Framework µ (1-D 1 µ (1-D 2 µ (1-D 3 µ (1-D 4 Healthy T λ -1 HDD (T-1 λ -2 HDDs (T-2 λ -3 HDDs (T-3 λ -4 HDDs µd 2 µd 3 µd 1 µd 4 µd 0 Definitions λ HDD Failure Rate (hrs -1 µ HDD Repair Rate (hrs -1 T Number of HDDs Probability of Repair Failure D i Data Loss * Repair of hard drives occurs simultaneously in parallel

Repair Failure Probability for Parity Declustered (PD RAID D(N,M,j the probability of data loss for a particular block when j drives fail for an underlying RAID (N+M with T total drives in the PD RAID set is: where, the probability of loosing a block due to placement is given by: = + = k i p T i M N j T i j k D 0 1 ( k M br M k p D k D j M N D = = * (,, ( 0 and the probability of loosing data due to bit-rot is given by: D j the probability of data loss (at each PD-RAID level due to j failed drives is: = + i M N T 0 ( _per_grid num_blocks,, ( 1 1 j M N D j D = (in bits block_size (1 1 UER D br = Random placement of blocks assumed here All blocks of uniform size

Rebuild Time for Parity Declustered (PD RAID When a drive fails, all remaining drives are used to rebuild the blocks of data (nblocks in the failed drive Involves three operations Reading (N*nblocks from the remaining hard drives Reconstructing the lost nblocks Writing the nblocks to all the remaining hard drives So time to repair =T_repair = max (T_read, T_reconstruct, T_write T_read = (N * nblocks * block_size/(read_speed*remaining_hdds T_write = (nblocks * block_size/(write_speed * remaining_hdds T_reconstruct = needs to be modeled (set to zero here read_speed and write_speed is the speed to read/write to a disk (is assumed constant here So, repair rate = µ = (1/T_repair

Repair Failure Probability for HDFS D(N,j the probability of data loss for a particular block when j drives fail with T total drives and a N replication strategy in HDFS is: D( N, N k br where, the probability of loosing a block due to placement is given by: j k D p ( k = T k and the probability of loosing data due to bit-rot is given by: N j = D ( k * D k = 0 p D br = 1 (1 UER block_size (in bits D j the probability of data loss due to j failed drives is: total_num_ blocks ( 1 ( N, M, D j = 1 D j

Rebuild Time for HDFS Repair Speed = (2/3*(2 racks * n DNs *min(nwbw DN-TOR *0.93,(1/2*HDDspeed*n HDD/DN + (1/3*(n Racks *min(nwbw TOR-TOR *0.93,(1/2*HDDspeed*n HDD/RACK Assumptions for Repair Speed Calculations: (2/3 of the blocks in a drive have a copy in the same rack and one on another (1/3 of the blocks in a drive have both the remaining blocks on another rack This analysis is limited to the number of drives of the range of 100000 and beyond that durability will be affected

Results: 1 st Year System Level Durability Table No of Drives / Capacity RAID6 (8+2 RAID6 (4+2 PD-RAID (8+2 HDFS (3 Replica Storage Overhead 25 % 50 % 25 % 200 % 82 (~ 0.3PB 4 nines 5 nines 10 nines 7 nines ~0.2PB usable storage ~0.15PB usable storage ~0.2PB usable storage ~0.1PB usable storage 492 (~ 1.9PB 3 nines 4 nines 9 nines 6 nines ~1.4 PB usable storage ~0.9PB usable storage ~1.4 PB usable storage ~0.6PB usable storage 1386 (~ 5.5PB 3 nines 3 nines 8 nines 6 nines ~4.1PB usable storage ~2.7PB usable storage ~4.1PB usable storage ~1.8PB usable storage 69300 (~ 277PB 1 nine 2 nines 7 nines 6 nines ~207PB usable storage ~138PB usable storage ~207PB usable storage ~92PB usable storage System level durability implies: Probability of not loosing any data in 1 year Choices of parameters: MTBF=1.4 Million, Speed=100MB/sec, UER=1e-15, Disk Size=4TB, Block Size=128MB PD-RAID delivers better/comparable system reliability in comparison to HDFS, but HDFS would beat PD-RAID at scale > several 100 PB s

Conclusions Parity-declustered RAID helps achieve HDFS level data durability At scale > several 100PB, HDFS durability thrives, but small scale relevant for almost all current HPC systems Switching to RAID6 (4+2 could provide marginal increase in durability with 25% increase in overhead from RAID6 (8+2(still better than HDFS overhead Bottom Line: Parity Declustered RAID a must to make Luster attractive for Hadoop applications Plus it also enables storage overhead reduction by a factor of ~ 8x THANK YOU!!

Synopsis Comparison of Data Durability in Parity Declustered RAID and HDFS Replicated Systems & Dimitar Vlassarev Cloud Modeling and Data Analytics Seagate Technology 389 Disc Dr., Longmont, CO 80503 Data stored on large scale high-performance computing systems is in many situations suitable for analysis within the Hadoop ecosystem. To facilitate that analysis efficiently, avoiding data migration from RAID backed storage systems to the replicated Hadoop File System (HDFS is essential. For these applications, high-performance parallel file systems like Lustre can offer an appealing alternative to HDFS. Two important considerations in comparing the two systems are their performance and data durability. Here we compare the data durability of HDFS s replication strategy to that of a Parity Declustered RAID backed Lustre system. A Continuous Time Markov Chains model for data durability of the two systems suggest that the Parity Declustered RAID backed Lustre solution can be as resilient as the replicated HDFS solution. QUESTIONS??