CSE-E5430 Scalable Cloud Computing P Lecture 5

Similar documents
RAID HARDWARE. On board SATA RAID controller. RAID drive caddy (hot swappable) SATA RAID controller card. Anne Watson 1

How To Create A Multi Disk Raid

technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels

RAID. Contents. Definition and Use of the Different RAID Levels. The different RAID levels: Definition Cost / Efficiency Reliability Performance

Lecture 36: Chapter 6

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Design and Evolution of the Apache Hadoop File System(HDFS)

PARALLELS CLOUD STORAGE

Growing Pains of Cloud Storage

IncidentMonitor Server Specification Datasheet

Definition of RAID Levels

Data Integrity: Backups and RAID

Using RAID6 for Advanced Data Protection

Top 10 RAID Tips. Everything you forgot to consider when building your RAID. by

A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster

Data Storage - II: Efficient Usage & Errors

An Introduction to RAID 6 ULTAMUS TM RAID

PIONEER RESEARCH & DEVELOPMENT GROUP

VERY IMPORTANT NOTE! - RAID

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

CS161: Operating Systems

Designing a Cloud Storage System

CSE-E5430 Scalable Cloud Computing Lecture 2

Overview of I/O Performance and RAID in an RDBMS Environment. By: Edward Whalen Performance Tuning Corporation

Disk Array Data Organizations and RAID

Storage Architectures for Big Data in the Cloud

RAID Overview: Identifying What RAID Levels Best Meet Customer Needs. Diamond Series RAID Storage Array

Data Protection Technologies: What comes after RAID? Vladimir Sapunenko, INFN-CNAF HEPiX Spring 2012 Workshop

What is RAID and how does it work?

Outline. Database Management and Tuning. Overview. Hardware Tuning. Johann Gamper. Unit 12

Intel RAID Software User s Guide:

Intel RAID Software User s Guide:

Guide to SATA Hard Disks Installation and RAID Configuration

RAID Made Easy By Jon L. Jacobi, PCWorld

How to choose the right RAID for your Dedicated Server

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Practical issues in DIY RAID Recovery

MANAGING DISK STORAGE

Distribution One Server Requirements

Intel RAID Software User s Guide:

ZFS Backup Platform. ZFS Backup Platform. Senior Systems Analyst TalkTalk Group. Robert Milkowski.

IBM ^ xseries ServeRAID Technology

The idea behind RAID is to have a number of disks co-operate in such a way that it looks like one big disk.

Why disk arrays? CPUs speeds increase faster than disks. - Time won t really help workloads where disk in bottleneck

Storage node capacity in RAID0 is equal to the sum total capacity of all disks in the storage node.

Distributed File Systems

RAID Level Descriptions. RAID 0 (Striping)

SSDs and RAID: What s the right strategy. Paul Goodwin VP Product Development Avant Technology

RAID Basics Training Guide

Intel RAID Controllers

Massive Data Storage

How To Understand And Understand The Power Of Aird 6 On Clariion

High Availability Databases based on Oracle 10g RAC on Linux

Availability and Disaster Recovery: Basic Principles

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Chapter 2 Array Configuration [SATA Setup Utility] This chapter explains array configurations using this array controller.

Xserve G5 Using the Hardware RAID PCI Card Instructions for using the software provided with the Hardware RAID PCI Card

Storing Data: Disks and Files

RAID 6 with HP Advanced Data Guarding technology:

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Guide to SATA Hard Disks Installation and RAID Configuration

RAID Implementation for StorSimple Storage Management Appliance

Intel Rapid Storage Technology

Flash for Databases. September 22, 2015 Peter Zaitsev Percona

Reliability and Fault Tolerance in Storage

Exercise 2 : checksums, RAID and erasure coding

UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure

Understanding Microsoft Storage Spaces

RAID 5 rebuild performance in ProLiant

Filesystems Performance in GNU/Linux Multi-Disk Data Storage

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Areas Covered. Chapter 1 Features (Overview/Note) Chapter 2 How to Use WebBIOS. Chapter 3 Installing Global Array Manager (GAM)

Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe

Outline. Failure Types

INCREASING EFFICIENCY WITH EASY AND COMPREHENSIVE STORAGE MANAGEMENT

RAID technology and IBM TotalStorage NAS products

3PAR Fast RAID: High Performance Without Compromise

RAID: Redundant Arrays of Independent Disks

RAID Utility User s Guide Instructions for setting up RAID volumes on a computer with a MacPro RAID Card or Xserve RAID Card.

ZFS Administration 1

Education. Servicing the IBM ServeRAID-8k Serial- Attached SCSI Controller. Study guide. XW5078 Release 1.00

StorTrends RAID Considerations

GraySort and MinuteSort at Yahoo on Hadoop 0.23

CS420: Operating Systems

Snapshots in Hadoop Distributed File System

RAID. Tiffany Yu-Han Chen. # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead

Configuring ThinkServer RAID 500 and RAID 700 Adapters. Lenovo ThinkServer

ZADARA STORAGE. Managed, hybrid storage EXECUTIVE SUMMARY. Research Brief

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Getting Started With RAID

RAID Technology Overview

This chapter explains how to update device drivers and apply hotfix.

Best Practices RAID Implementations for Snap Servers and JBOD Expansion

HP Smart Array Controllers and basic RAID performance factors

RAID. Storage-centric computing, cloud computing. Benefits:

TECHNOLOGY BRIEF. Compaq RAID on a Chip Technology EXECUTIVE SUMMARY CONTENTS

Transcription:

CSE-E5430 Scalable Cloud Computing P Lecture 5 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 12.10-2015 1/34

Fault Tolerance Strategies for Storage Hard disks are a common cause of failures in large scale systems with field reports giving around 3% average yearly replacement rates There are many ways to make more reliable storage out of unreliable hard disks. Here we list some examples only, from smaller scale to larger scale: RAID - Redundant array of independent (originally: inexpensive) disks. Commonly used levels for improving reliability: RAID 1, RAID 5, RAID 6, RAID 10. Used inside servers & specialized SAN/NAS hardware Replication and/or erasure coding of data inside datacenter: Example: Three way default replication in HDFS, with data residing on at least two racks Geographical replication across datacenters: Example: Amazon S3 storage service offers geographic replication 2/34

RAID - Redundant Array of Independent Disks Most commonly used fault tolerance mechanism in small scale RAID 0 : Striping data over many disks - No fault tolerance, improves performance Commonly used levels for fault tolerance: RAID 1 : Mirroring - All data is stored on two hard disks RAID 5 : Block-level striping with distributed parity RAID 6 : Block-level striping with double distributed parity RAID 10 : A nested RAID level: A stripe of mirrored disk pairs 3/34

RAID - Redundant Array of Independent Disks (cnt.) For additional info, see: http://en.wikipedia.org/ wiki/standard_raid_levels, from where the RAID images in the following pages are obtained Terminology used: IOPS - I/O operations per second - the number of small random reads/writes per second 4/34

RAID 0 - Striping Stripes data over a large number of disks to improve sequential+random reads&writes Very bad choice for fault tolerance, only for scratch data that can be regenerated easily 5/34

RAID 1 - Mirroring Each data block is written to two hard disks: first one is the master copy, and secondly a mirror slave copy is written after the master Reads are served by either one of the two hard disks Loses half the storage space and halves write bandwidth/iops compared to using two drives 6/34

RAID 1 - Mirroring (cnt.) Data is available if either hard disk is available Easy repair: Replace the failed hard disk and copy all of the data over to the replacement drive 7/34

RAID 5 - Block-level striping with distributed parity Data is stored on n + 1 hard disks, where a parity checksum block is stored for each n blocks of data. The parity blocks are distributed over all disks. Tolerates one hard disk failure 8/34

RAID 5 - Properties Sequential reads and writes are striped over all disks Loses only one hard disk worth of storage to parity Sequential read and write speeds are good, random read IOPS are good Random small write requires reading one data block + parity block, and writing back modified data block + modified parity block, requiring 4 x IOPS in the worst case (battery backed up caches try to minimize this overhead.) 9/34

RAID 5 - Properties (cnt.) Rebuilding a disk requires reading all of the contents of the other n disks to regenerate the missing disk using parity - this is a slow process Slow during rebuild: When one disk has failed, each missing data block read requires reading n blocks of data when rebuild is in progress Vulnerability to a second hard disk failure during array rebuild and long rebuild times make most vendors instead recommend RAID 6 (see two slides ahead) with large capacity SATA drives 10/34

RAID 5 - Properties (cnt.) RAID 5 Write Hole : A power failure during writes to a stripe (remember, n +1 disks need to be updated together in the worst case for large writes) can leave the array in a corrupted state - battery backed up memory caches (and UPS) are used to try to get rid of this problem RAID 5 does not protect from a single drive with bad blocks from corrupting the array if a disk gives out bad data during parity calculations. Use of hot spares recommended to speed up recovery start 11/34

RAID 6 - Block-level striping with double distributed parity Data is stored on n + 2 hard disks, where two parity checksum blocks are stored for each n blocks of data. The parity blocks are distributed over all disks. Tolerates two hard disk failures 12/34

RAID 6 - Properties Sequential reads and writes are striped over all disks Loses two hard disk worth of storage to parity Sequential read and write speeds are good, random read IOPS are good Random small write requires reading one data block + two parity blocks, and writing back modified data block + two modified parity blocks, requiring 6 x IOPS in the worst case (battery backed up caches try to minimize this overhead) Slow during rebuild: When 1-2 disks have failed, each missing data block read requires reading n blocks of data when rebuild is in progress 13/34

RAID 6 - Properties (cnt.) Can tolerate one additional disk failure during rebuild, contents of any two missing disks can be rebuilt using the remaining n disks RAID 6 Write Hole : A power failure during writes to a stripe (remember, n + 2 disks need to be updated together in the worst case for large writes) can leave the array in a corrupted state - battery backed up memory caches (and UPS) are used to try to get rid of this problem In two disk failures: RAID 6 does not protect from a single drive with bad blocks from corrupting the array if a disk gives out bad data during parity calculations 14/34

RAID 10 - Stripe of Mirrors Data is stored on 2n hard disks, each mirror pair consists of two hard disks and the pairs are striped together to a single logical device (see: http: //en.wikipedia.org/wiki/nested_raid_levels). 15/34

RAID 10 - Properties Tolerates only one hard disk failure in the worst case (n - one disk from each mirror pair - in the best case) Loses half of the storage space Sequential reads and writes are striped over all disks Sequential read and write speeds are good, random read and write IOPS are good 16/34

RAID 10 - Properties (cnt.) Each data block is written to two hard disks. Random small write require 2 x IOPS in the worst case Quicker during rebuild: Missing data is just copied over from the other disk of the mirror. Use of hot spares recommended to speed up recovery start Can not tolerate the failure of both disks in the same mirror 17/34

RAID 10 - Properties (cnt.) On unclean shutdown missing changes should be copied from the first master drive to the second slave drive - No Write hole problem if such ordering can be guaranteed. (Example: Solaris RAID1 with write policy serial.) If writes are unordered (for performance reasons), the write hole problem is re-introduced. RAID 10 does not protect from a single drive with bad blocks from corrupting the array during rebuild 18/34

RAID Usage Scenarios RAID 0: Temporary space for data that can be discarded RAID 1: Small scale installations, server boot disks, etc. RAID 5: Some fault tolerance with minimum storage overhead, small random writes can be problematic, RAID 5 Write Hole problem without special hardware (battery backed up cache / UPS) 19/34

RAID Usage Scenarios (cnt.) RAID 6: Fairly good fault tolerance with reasonable storage overhead, small random writes can be even more problematic than in RAID 5, RAID 6 Write Hole problem without special hardware (battery backed up cache / UPS) RAID 10: Less fault tolerant than RAID 6 but more fault tolerant than RAID 5. Loses half of the storage capacity. Good small random write performance makes this often the choice for database workloads. Can avoid the Write hole problem under write ordering guarantees. Remember: RAID is no substitute for backups! 20/34

RAID Hardware Issues RAID 1 and RAID 10 can be made safe with proper configuration to use without special hardware support RAID 5 and RAID 6 are more storage efficient than RAID 10 but require specialized hardware to protect against corruption due to unclean shutdown (battery backed up cache / UPS) 21/34

More on RAID Write Hole Problem RAID 5 and RAID 6 need to atomically update all the drives in a stripe to remain consistent We know a way to do this from the world of Databases: Atomic Transactions How are transactions implemented in databases? By using a persistent log of modifications to be performed and replaying any missing modifications in case of a shutdown in the middle of an atomic sequence of modifications Thus RAID 5 and RAID 6 need a persistent transaction log in order to offer atomic transactions for updating a stripe 22/34

More on RAID Write Hole Problem (cnt.) Battery backed up cache is one way to implement a persistent transaction log Another way is to implement a log structured filesystem that is aware of the underlying storage subsystem: Example: Sun (Oracle) ZFS: http://en.wikipedia.org/wiki/zfs ZFS implements RAID 5 and RAID 6 equivalent fault tolerance without specialized hardware It a filesystem that is given full control over the raw disk devices, it uses a copy-on-write transactional object model to achieve this ZFS also has data integrity checksums 23/34

Scaling Up Storage There are quite a few problems when scaling up storage to thousands of hard disks: One can buy large NAS/SAN systems handling thousands of hard disks. Scaling up vs. scaling out. High end NAS/SAN are not a commodity Their pricing is very high compared to storage capacity The performance is often not as good as with a scale out solution Often still the best solution for modest storage needs 24/34

Scaling Out Storage There are also quite a few problems when scaling out storage to thousands of hard disks over hundreds of small servers: Using standard hardware RAID will handle hard disk failures but does not protect against storage server/raid controller failures When having hundreds of servers and RAID controllers, failures are going to be inevitable To protect against data unavailability during server / RAID controller repair, data must be replicated on several servers 25/34

Scaling Out Storage If we need to replicate data over several servers anyhow, why spend money on specialized HW RAID controllers? We just need the software to create a distributed fault tolerant storage subsystem HDFS is a prime example of such a system 26/34

HDFS Design Decisions HDFS replication on a block level RAID 10 (+ data integrity sub-block CRC-32 checksums) For small installations replicating each block twice in HDFS is enough, and we have exactly RAID 10 storage layout in HDFS For large installations each HDFS block is replicated three times, also Linux RAID 10 can be configured to use a three hard disks per mirror set, matching HDFS default replication, and allowing two hard disk failures per mirror set 27/34

What about RAID 6 for HDFS? RAID 6 can improve the storage efficiency over RAID 10 significantly, as only two disks are lost to parity, and stripes can be made quite wide Even worse, HDFS has default three way replication that has a 3x storage overhead 28/34

Erasure Codes Many RAID 6 implementations use an erasure code, for example a Reed-Solomon code to implement the parity calculations and recovery of any two missing disks An Erasure code can be extended to recover any number of missing hard disks. Basically this is a so called forward error correction code Reed-Solomon codes are also used in CD error correction See also: http://en.wikipedia.org/wiki/erasure_code and http://en.wikipedia.org/wiki/reed%e2%80% 93Solomon_error_correction 29/34

Erasure Codes in Storage The Windows Azure Storage uses erasure coding: Cheng Huang et al.: Erasure Coding in Windows Azure Storage, Proceedings of the 2012 USENIX conference on Annual Technical Conference, USENIX ATC 12. https://www.usenix.org/system/files/conference/ atc12/atc12-final181_0.pdf Example: In HDFS-RAID an erasure code can be devised that stores 10 data blocks and 4 parity blocks (1.4x storage overhead), and this tolerates four disk failures See http://wiki.apache.org/hadoop/hdfs-raid for an erasure code implementation for HDFS employed at Facebook 30/34

Erasure Codes in Storage See also an approach with better performance under failures: Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris S. Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, Dhruba Borthakur: XORing Elephants: Novel Erasure Codes for Big Data. PVLDB 6(5): 325-336 (2013) http://www.vldb.org/pvldb/vol6/ p325-sathiamoorthy.pdf Xorbas Hadoop project Web page: http://smahesh.com/hadoopusc/ 31/34

HDFS-RAID Basically implements an erasure encoding of data: RAID 6 style but with larger stripes and more parity blocks allowing for more simultaneous hard disk failures Also file block allocation policy needs to be changed to never allocate blocks of the same stripe on the same physical hardware HDFS-RAID requires very large files to get a full stripe of data with the large HDFS block size Exactly the same efficiency during rebuild (RAID 10) vs. storage efficiency (RAID 6) considerations apply when deciding whether to use erasure codes at scale Another Erasure coding approach is represented in HDFS-7285 under the name Native HDFS erasure coding 32/34

Avoiding RAID Hole Problem One should avoid update in place, for example by using the write-once-read-many (WORM) pattern to never modify data blocks once written In general the only hope to avoid the RAID 6 hole problem is to have database style transactions built into RAID 6 implementation: Either in the hardware RAID controller or in a filesystem that does database style transaction logging One nice thing about a centralized HDFS NameNode is that it can be used as the central place to do atomic transactions on the HDFS filesystem state that are logged using database transaction logs 33/34

Avoiding RAID Hole Problem (cnt.) More examples: ZFS and btrfs filesystems: As HDFS they also never updates data in place in disk arrays but always do a fresh copy of it before modifications Jim Gray: Update in place is a poison apple, from: Jim Gray: The Transaction Concept: Virtues and Limitations (Invited Paper) VLDB 1981: 144-154. Bottom line: Basically to do reliable distributed updates, one must adopt database techniques (transaction logging) to implement atomic updates of the RAID stripe reliably HDFS tries to minimize the amount of transactions by only doing them at file close time. No transactions for stripe updates are needed because stripe updates are disallowed! 34/34