Disk Array Data Organizations and RAID

Similar documents

Lecture 36: Chapter 6

How To Write A Disk Array

Definition of RAID Levels

Why disk arrays? CPUs speeds increase faster than disks. - Time won t really help workloads where disk in bottleneck

Why disk arrays? CPUs improving faster than disks

Using RAID6 for Advanced Data Protection

Reliability and Fault Tolerance in Storage

How To Create A Multi Disk Raid

RAID Made Easy By Jon L. Jacobi, PCWorld

Data Storage - II: Efficient Usage & Errors

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

What is RAID? data reliability with performance

How To Improve Performance On A Single Chip Computer

W4118: RAID. Instructor: Junfeng Yang

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

RAID. Contents. Definition and Use of the Different RAID Levels. The different RAID levels: Definition Cost / Efficiency Reliability Performance

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

California Software Labs

CS161: Operating Systems

technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels

Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe

CS420: Operating Systems

CS 61C: Great Ideas in Computer Architecture. Dependability: Parity, RAID, ECC

RAID Level Descriptions. RAID 0 (Striping)

An Introduction to RAID. Giovanni Stracquadanio

RAID Basics Training Guide

RAID Technology Overview

RAID: Redundant Arrays of Independent Disks

RAID Levels and Components Explained Page 1 of 23

1 Storage Devices Summary

Introduction. What is RAID? The Array and RAID Controller Concept. Click here to print this article. Re-Printed From SLCentral

Review. Lecture 21: Reliable, High Performance Storage. Overview. Basic Disk & File System properties CSC 468 / CSC /23/2006

RAID. Storage-centric computing, cloud computing. Benefits:

CSE-E5430 Scalable Cloud Computing P Lecture 5

RAID Rebuilding. Objectives CSC 486/586. Imaging RAIDs. Imaging RAIDs. Imaging RAIDs. Multi-RAID levels??? Video Time

IBM ^ xseries ServeRAID Technology

RAID Technology. RAID Overview

Storage Technologies - 2

Best Practices RAID Implementations for Snap Servers and JBOD Expansion

RAID 5 rebuild performance in ProLiant

RAID Overview: Identifying What RAID Levels Best Meet Customer Needs. Diamond Series RAID Storage Array

PIONEER RESEARCH & DEVELOPMENT GROUP

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Storage node capacity in RAID0 is equal to the sum total capacity of all disks in the storage node.

Operating Systems. RAID Redundant Array of Independent Disks. Submitted by Ankur Niyogi 2003EE20367

RAID HARDWARE. On board SATA RAID controller. RAID drive caddy (hot swappable) SATA RAID controller card. Anne Watson 1

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Module 6. RAID and Expansion Devices

an analysis of RAID 5DP

Outline. Database Management and Tuning. Overview. Hardware Tuning. Johann Gamper. Unit 12

RAID Technology White Paper

VERY IMPORTANT NOTE! - RAID

Filing Systems. Filing Systems

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

CS 6290 I/O and Storage. Milos Prvulovic

RAID technology and IBM TotalStorage NAS products

Improving Lustre OST Performance with ClusterStor GridRAID. John Fragalla Principal Architect High Performance Computing

COMP 7970 Storage Systems

How To Understand And Understand The Power Of Aird 6 On Clariion

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

Storing Data: Disks and Files

Sistemas Operativos: Input/Output Disks

Striped Set, Advantages and Disadvantages of Using RAID

RAID Implementation for StorSimple Storage Management Appliance

IBM System x GPFS Storage Server

Chapter 6 External Memory. Dr. Mohamed H. Al-Meer

BrightStor ARCserve Backup for Windows

NVIDIA RAID Installation Guide

Surviving Two Disk Failures. Introducing Various RAID 6 Implementations

How to choose the right RAID for your Dedicated Server

RAID-DP: NetApp Implementation of Double- Parity RAID for Data Protection

RAID Utility User s Guide Instructions for setting up RAID volumes on a computer with a MacPro RAID Card or Xserve RAID Card.

GENERAL INFORMATION COPYRIGHT... 3 NOTICES... 3 XD5 PRECAUTIONS... 3 INTRODUCTION... 4 FEATURES... 4 SYSTEM REQUIREMENT... 4

Technical White paper RAID Protection and Drive Failure Fast Recovery

User s Manual. Home CR-H BAY RAID Storage Enclosure

SSDs and RAID: What s the right strategy. Paul Goodwin VP Product Development Avant Technology

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

An Introduction to RAID 6 ULTAMUS TM RAID

Price/performance Modern Memory Hierarchy

Reliability of Data Storage Systems

HP Smart Array Controllers and basic RAID performance factors

Hard Disk Drives and RAID

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

Storage. The text highlighted in green in these slides contain external hyperlinks. 1 / 14

Xserve G5 Using the Hardware RAID PCI Card Instructions for using the software provided with the Hardware RAID PCI Card

RAID-DP : NETWORK APPLIANCE IMPLEMENTATION OF RAID DOUBLE PARITY FOR DATA PROTECTION A HIGH-SPEED IMPLEMENTATION OF RAID 6

Summer Student Project Report

VIA RAID Installation Guide

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Intel RAID Software User s Guide:

Database Management Systems

Transcription:

Guest Lecture for 15-440 Disk Array Data Organizations and RAID October 2010, Greg Ganger 1

Plan for today Why have multiple disks? Storage capacity, performance capacity, reliability Load distribution problem and approaches disk striping Fault tolerance replication parity-based protection RAID and the Disk Array Matrix Rebuild October 2010, Greg Ganger 2

Why multi-disk systems? A single storage device may not provide enough storage capacity, performance capacity, reliability So, what is the simplest arrangement? October 2010, Greg Ganger 3

Just a bunch of disks (JBOD) A0 B0 C0 D0 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 Yes, it s a goofy name industry really does sell JBOD enclosures October 2010, Greg Ganger 4

Disk Subsystem Load Balancing I/O requests are almost never evenly distributed Some data is requested more than other data Depends on the apps, usage, time, October 2010, Greg Ganger 5

Disk Subsystem Load Balancing I/O requests are almost never evenly distributed Some data is requested more than other data Depends on the apps, usage, time, What is the right data-to-disk assignment policy? Common approach: Fixed data placement Your data is on disk X, period! For good reasons too: you bought it or you re paying more Fancy: Dynamic data placement If some of your files are accessed a lot, the admin (or even system) may separate the hot files across multiple disks In this scenario, entire files systems (or even files) are manually moved by the system admin to specific disks October 2010, Greg Ganger 6

Disk Subsystem Load Balancing I/O requests are almost never evenly distributed Some data is requested more than other data Depends on the apps, usage, time, What is the right data-to-disk assignment policy? Common approach: Fixed data placement Your data is on disk X, period! Fancy: Dynamic data placement If some of your files are accessed a lot, we may separate the hot files across multiple disks In this scenario, entire files systems (or even files) are manually moved by the system admin to specific disks Alternative: Disk striping Stripe all of the data across all of the disks October 2010, Greg Ganger 7

Disk Striping Interleave data across multiple disks File Foo: " Large file streaming can enjoy parallel transfers High throughput requests can enjoy thorough load balancing If blocks of hot files equally likely on all disks (really?) stripe unit or block Stripe" October 2010, Greg Ganger 8

Disk striping details How disk striping works Break up total space into fixed-size stripe units Distribute the stripe units among disks in round-robin Compute location of block #B as follows disk# = B % N (%=modulo, N = # of disks) LBN# = B / N (computes the LBN on given disk) October 2010, Greg Ganger 9

Now, What If A Disk Fails? In a JBOD (independent disk) system one or more file systems lost In a striped system a part of each file system lost Backups can help, but backing up takes time and effort (later in term) backup doesn t help recover data lost during that day any data loss is a big deal to a bank or stock exchange October 2010, Greg Ganger 10

And they do fail October 2010, Greg Ganger 11

Sidebar: Reliability metric Mean Time Between Failures (MTBF) Usually computed by dividing a length of time by the number of failures during that time (averaged over a large population of items) Basically, we divide the time by Example: 10 Devices measured the number over 1000 of failures days per device. This gives us the Day 0 average time per failure per Day 999 device. MTBF 1000 Days 5 failures / 10 devices 2000 Days [per device] Note: NOT a guaranteed lifetime for a particular item! October 2010, Greg Ganger 12

Sidebar: Availability metric Fraction of time that server is able to handle requests Computed from MTBF and MTTR (Mean Time To Repair) Availability MTBF _ MTBF + MTTR Installed Fixed Fixed Fixed TBF 1 TTR 1 TBF 2 TTR 2 TBF 3 TTR 3 Available during these 3 periods of time. October 2010, Greg Ganger 13

How often are failures? MTBF (Mean Time Between Failures) MTBF disk ~ 1,200,00 hours (~136 years, <1% per year) pretty darned good, if you believe the number MTBF mutli-disk system = mean time to first disk failure which is MTBF disk / (number of disks) For a striped array of 200 drives MTBF array = 136 years / 200 drives = 0.65 years October 2010, Greg Ganger 14

Tolerating and masking disk failures If a disk fails, it s data is gone may be recoverable, but may not be To keep operating in face of failure must have some kind of data redundancy Common forms of data redundancy replication erasure-correcting codes error-correcting codes October 2010, Greg Ganger 15

Redundancy via replicas Two (or more) copies mirroring, shadowing, duplexing, etc. Write both, read either 0 0 2 2 1 1 3 3 October 2010, Greg Ganger 16

Mirroring & Striping Mirror to 2 virtual drives, where each virtual drive is really a set of striped drives Provides reliability of mirroring Provides striping for performance (with write update costs) October 2010, Greg Ganger 17

Implementing Disk Mirroring Mirroring can be done in either software or hardware Software solutions are available in most OS s Windows2000, Linux, Solaris Hardware solutions Could be done in Host Bus Adaptor(s) Could be done in Disk Array Controller October 2010, Greg Ganger 18

Lower Cost Data Redundancy Single failure protecting codes general single-error-correcting code is overkill General code finds error and fixes it disk failures are self-identifying (a.k.a. erasures) Don t have to find the error fact: N-error-detecting code is also N-erasure-correcting Error-detecting codes can t find an error, just know its there But if you independently know where error is, allows repair Parity is single-disk-failure-correcting code recall that parity is computed via XOR it s like the low bit of the sum October 2010, Greg Ganger 19

Simplest approach: Parity Disk One extra disk All writes update parity disk potential bottleneck A B C A B C A B C A B C Ap Bp Cp D D D D Dp October 2010, Greg Ganger 20

Aside: What s Parity? Parity count number of 1 s in a byte and store a parity bit with each byte of data Solve equation XOR-sum( data bits, parity ) = 0 parity bit is computed as If the number of 1 s is odd, store a 1 If the number of 1 s is even, store a 0 This is called even parity (# of ones is even) Example: 0x54 == 0101 0100 2 (Three 1ʼs --> parity bit is set to 1 )" Store 9 bits: 0101 0100 1 Enables: Detection of single-bit errors (equation won t work if one bit flipped) Reconstruction of single-bit erasures (reverse equation for unknown) October 2010, Greg Ganger 21

Aside: What s Parity (con t) Example 0x54 == 0101 0100 2 (Three 1ʼs --> parity bit is set to 1 )" Store 9 bits: 0101 0100 1 What if we want to update bit 3 from 1 to 0 Could completely recompute the parity 0x54 0x44 0x44 == 0100 0100 2 (Two 1ʼs --> parity bit is set to 0 )" Store 9 bits: 0100 0100 0" Or, we could subtract out old data, then add in new data" How do we subtract out old data?" olddata XOR oldparity" 1 XOR 1 == 0 " this is parity w/out databit3" How do we add in new data" newdata XOR newparity" 0 XOR 0 == 0 " " this is parity w/new databit3" Therefore, updating new data doesnʼt require one to re-read all of the data" Or, for each data bit that toggles, toggle the parity bit" October 2010, Greg Ganger 22

Updating and using the parity Fault-Free Read Fault-Free Write 2 1 3 4 D D D P D D D P Degraded Read Degraded Write D D D P D D D P October 2010, Greg Ganger 23

The parity disk bottleneck Reads go only to the data disks But, hopefully load balanced across the disks All writes go to the parity disk And, worse, usually result in Read-Modify-Write sequence So, parity disk can easily be a bottleneck October 2010, Greg Ganger 24

Solution: Striping the Parity Removes parity disk bottleneck A A A A Ap B B B Bp B C C Cp C C D Dp D D D October 2010, Greg Ganger 25

RAID Taxonomy Redundant Array of Inexpensive Independent Disks Constructed by UC-Berkeley researchers in late 80s (Garth) RAID 0 Course-grained Striping with no redundancy RAID 1 Mirroring of independent disks RAID 2 Fine-grained data striping plus Hamming code disks Uses Hamming codes to detect and correct multiple errors Originally implemented when drives didn t always detect errors Not used in real systems RAID 3 Fine-grained data striping plus parity disk RAID 4 Course-grained data striping plus parity disk RAID 5 Course-grained data striping plus striped parity RAID 6 Course-grained data striping plus 2 striped codes October 2010, Greg Ganger 26

RAID 6 P+Q Redundancy Protects against multiple failures using Reed-Solomon codes Uses 2 parity disks P is parity Q is a second code It s two equations with two unknowns, just make bigger bits Group bits into nibbles and add different co-efficients to each equation (two independent equations in two unknowns) Similar to parity striping De-clusters both sets of parity across all drives For small writes, requires 6 I/Os Read old data, old parity1, old parity2 Write new data, new parity1, new parity2 October 2010, Greg Ganger 27

Disk array subsystems Sets of disks managed by a central authority e.g., file system (within OS) or disk array controller Data distribution squeezing maximum performance from the set of disks several simultaneous considerations intra-access parallelism: parallel transfer for large requests inter-access parallelism: concurrent accesses for small requests load balancing for heavy workloads Redundancy scheme achieving fault tolerance from the set of disks several simultaneous considerations space efficiency number/type of faults tolerated performance October 2010, Greg Ganger 28

The Disk Array Matrix Independent Fine Striping Course Striping None JBOD RAID0 Replication Mirroring RAID1 RAID0+1 Parity Disk RAID3 RAID4 Striped Parity Gray90 RAID5 October 2010, Greg Ganger 29

Back to Mean Time To Data Loss (MTTDL) MTBF (Mean Time Between Failures) MTBF disk ~ 1,200,00 hours (~136 years) pretty darned good, if you believe the number MTBF mutli-disk system = mean time to first disk failure which is MTBF disk / (number of disks) For a striped array of 200 drives MTBF array = 136 years / 200 drives = 0.65 years October 2010, Greg Ganger 30

Reliability without rebuild 200 data drives with MTBF drive MTTDL array = MTBF drive / 200 Add 200 drives and do mirroring MTBF pair = (MTBF drive / 2) + MTBF drive = 1.5 * MTBF drive MTTDL array = MTBF pair / 200 = MTBF drive / 133 Add 50 drives, each with parity across 4 data disks MTBF set = (MTBF drive / 5) + (MTBF drive / 4) = 0.45 * MTBF drive MTTDL array = MTBF set / 50 = MTBF drive / 111 October 2010, Greg Ganger 31

Rebuild: restoring redundancy after failure After a drive failure data is still available for access but, a second failure is BAD So, should reconstruct the data onto a new drive on-line spares are common features of high-end disk arrays reduce time to start rebuild must balance rebuild rate with foreground performance impact a performance vs. reliability trade-offs How data is reconstructed Mirroring: just read good copy Parity: read all remaining drives (including parity) and compute October 2010, Greg Ganger 32

Reliability consequences of adding rebuild No data loss, if fast enough That is, if first failure fixed before second one happens New math is MTTDL array = MTBF firstdrive * (1 / prob of 2 nd failure before repair) which is MTTR drive / MTBF seconddrive For mirroring MTBF pair = (MTBF drive / 2) * (MTBF drive / MTTR drive ) For 5-disk parity-protected arrays MTBF set = (MTBF drive / 5) * (MTBF drive / 4 / MTTR drive ) October 2010, Greg Ganger 33

Three modes of operation Normal mode everything working; maximum efficiency Degraded mode some disk unavailable must use degraded mode operations October 2010, Greg Ganger 34

Three modes of operation Normal mode everything working; maximum efficiency Degraded mode some disk unavailable must use degraded mode operations Rebuild mode reconstructing lost disk s contents onto spare degraded mode operations plus competition with rebuild October 2010, Greg Ganger 35

Mechanics of rebuild Background process use degraded mode read to reconstruct data then, write it to replacement disk Implementation issues Interference with foreground activity and controlling rate rebuild is important for reliability foreground activity is important for performance Using the rebuilt disk for rebuilt part, reads can use replacement disk must balance performance benefit with rebuild interference October 2010, Greg Ganger 36