Theoretical Aspects of Storage Systems Autumn 2009 Chapter 2: Double Disk Failures André Brinkmann
Data Corruption in the Storage Stack What are Latent Sector Errors What is Silent Data Corruption Checksum Mismatches Identity Discrepancies and Parity Inconsistencies Experiences concerning Silent Data Corruption How to detect these errors and how do they occur Are there any differences between nearline disks and enterprise class disks Is the age of a disk significant Is there any temporal / spatial locality Is the block number important What are the influences of silent data corruption and latent sector errors on RAID recovery?
Outline Importance of multi-error correcting codes RAID 6 strategies Reed-Solomon Codes and Galois fields RAID with Double Parity-Encodings Disk Arrays Row Diagonal Parity EVENODD
Double Disk Failures Assumptions for storage cluster environment: 1 PByte of data stored on 2000 computers Environment is grouped into 200 RAID 5 sets with 10 disks each MTBF of each computer (including disks) is 1000 days Recovery of a computer is 1 day MTTDL= 200 1000d ( ) 2 10 9 1d 55d Protection against single disk failures not enough in large scale environments RAID 6 is mandatory in large environments Recovery time has to be minimized Example taken from Lustre Manual v1.6, August 2007
RAID 6 Any form of RAID that can continue to execute read and write requests to all of an array s virtual disks in the presence of two concurrent disk failures. Both dual check data computations (parity and Reed-Solomon) and orthogonal dual parity have been proposed for RAID Level 6. SNIA: The Dictionary if Storage Networking Terminology Coding / Implementation is not standardized
Terms and Definitions Number of data disks: n Number of coding disks: m Rate of a code: R = n/(n+m) Identifiable Failure: Erasure
Issues with Erasure Coding Failure Coverage - Four ways to specify Specified by a threshold: (e.g. 3 erasures always tolerated) Specified by an average: (e.g. can recover from an average of 11.84 erasures). Specified as MDS (Maximum Distance Separable): MDS: Threshold = average = m. Space optimal. Specified by Overhead Factor f: f = factor from MDS = m/average. f is always >= 1 f = 1 is MDS. J. Plank: Erasure Codes for Storage Applications.Tutorial given at FAST-2005
Problem Definition Partitioning of (n+m) disks into n data disks d i,,d n and m checksum devices c 1,, c m Every disk can store k Bytes Aim: Up to m disks can fail without data loss Capacity of each disk is given in chunks of the size of a words l = ( k Bytes) 8 bits 1 word = 8k byte w bits w Encoding function F i for chunk j on checksum disk i is only based on corresponding words on data disks c i, j = F i (d 1, j,d 2, j,,d n, j ) Update from data word d j to d j on disk x only requires an update for the checksums of the same row ' c i, j ' = G i (d x, j,d x, j,c i, j ) J. Plank: A Tutorial on Reed-Solomon Coding for Fault-Tolerant RAID-like Systems
RAID 5 properties For RAID 5, the number of checksum devices is m=1 and word length is w=1 The checksum is computed as follows: c 1 = F 1 (d 1,,d n ) = d 1 d 2 d n c can be recalculated from the parity of its old value and the old and new data word c 1 ' ' ' = G 1, j (d j,d j,c 1 ) = c 1 d j d j Each word of a failed device can be restored as the parity of the corresponding words on the remaining devices: d j = d 1 d j 1 d j +1 d n c 1 J. Plank: A Tutorial on Reed-Solomon Coding for Fault-Tolerant RAID-like Systems
Reed-Solomon Codes The only MDS coding technique for arbitrary n and m This means that m erasures are always tolerated Have been around for decades Operate on binary words of data, composed of w bits, where 2 w n+m Expensive J. Plank: Erasure Codes for Storage Applications.Tutorial given at FAST-2005
Multi-error correcting Reed-Solomon codes Standard approach for disk protection Use matrices A,E with and the following properties: The system stays linearly independent after the elimination of m rows in A or E Data can be reconstructed using the Gaussian elimination method Derive from a Vandermonde Matrix J. Plank: A Tutorial on Reed-Solomon Coding for Fault-Tolerant RAID-like Systems
Challenges of Reed-Solomon Codes It has to hold that 2 w > n + m Word size is an issue: If n+m 256, we can use bytes as words. If n+m 65,536, we can use shorts as words Arithmetic has to be closed under addition and multiplication (and has to contain the corresponding inverse) Holds without problems for infinite precision real numbers, but not for fixed sized words or even integers Frequent Error: calculate using integer modulo 2 w Division not defined for all elements elimination method is not possible Use Galois Fields with 2 w elements GF(2 w ) J. Plank: A Tutorial on Reed-Solomon Coding for Fault-Tolerant RAID-like Systems
Galois Field Arithmetic GF(2 w ) has elements 0, 1, 2,, 2 w -1 Addition = XOR Easy to implement Nice and Fast Multiplication hard to explain If w small ( 8), use multiplication table If w bigger ( 16), use log/anti-log tables Otherwise, use an iterative process J. Plank: Erasure Codes for Storage Applications.Tutorial given at FAST-2005
Orthogonal Parity RAID Disks are organized as a 2-dimensional matrix Parity is computed for Each row Each column Advantages Allows failure of many (at least 2) disks Disadvantages More parity blocks needed Slow writes: Each write requires 3 read and 4 write operations Each block represents a dedicated disk 0 1 2 3 4 5 6 7 8 9 10 11 Single disk failure handled similar to RAID 5 12 13 14 15 All double disk failures can be handled by row and/or column Parity
Row-Diagonal Parity (RDP) RAID Dedicated RAID Double Parity implementation from NetApp Reduces number of necessary parity disks at the cost that not every failure can be directly resolved First parity dimension is performed as RAID 5 over each row and -parities are calculated for each stripe Stripe consits of p disk with p is prime number and disks 0 p-1 include parity disks Data blocks are stored in disks 0 p 2 Block (i,k) belongs to parity set (i+k) mod p First or last diagonal cannot be used to build own parity DP DP DP DP Corbet et. al: Row-Diagonal Parity for Double Disk Failure Correction, FAST 2004
Erasures and RDP Possible cases: Only a single disk fails: No difference compared to RAID 4 Two disks fail: Block (0,1) can be restored based on a diagonal Block (0,0) can be reconstructed AFTERWARDS based on a stripe Block (3,0) will be restored based on a diagonal... DP DP DP DP
RDP and Parity distribution Original layout of RDP has same bottleneck as RAID 4 A rotation of the parity disk after each stripe seems at least difficult If RDP pattern consists of m rows then it is possible to rotate the meaning of the disks every m rows m m
EVENODD EVENODD has been proposed in 1994 First MDS code, which has been soley based on parities and which corrects two erasures Can be seen as foundation of RDP Higher number of operations for update and reconstruction as RDP Every storage node forms one column Number of data nodes: m with m is prime Number of nodes: m + 2 Blaum, Bradey et. al: EvenOdd An optimal scheme for tolerating double disk Failures in RAID architectures
EVENODD Codes Definition of the blocks as matrix of dimension (m-1) x (m+1) with elements a i,j Element a i,j with 0 < i < m-2 and 0 < j < m is symbol j on disk I Disks m and m+1 store redundant information Imaginary Zero-row a m-1,j as last row Calculation of stripe parities: Calculation of diagonal parities:
EVENODD Codes parity I parity II adjuster Parity node1: Simple horizontal parity Parity node 2: Diagonal parity including additional adjuster Slides is based on: Huang, Xu:Star an Efficient Encoding Scheme for Correcting Triple Storage Node Failures, Fast 2005
Evenodd Decoding Note that S is parity of second parity column There will be at least one diagonal that is missing just one data word. Decode it / them Then there will be at least one row missing just one data word: Decode it / them Continue this process until all the data words are decoded J. Plank: Erasure Codes for Storage Applications.Tutorial given at FAST-2005
STAR: 3-error correcting code Extention of EVENODD Additional parity row parity III Folie von: Huang, Xu:Star an Efficient Encoding Scheme for Correcting Triple Storage Node Failures, Fast 2005
Decoding Complexity Comparison between Star, Evenodd, an additional codes from Blaum, and a purely XOR-basierten RS-Code from Blömer et al. Slide based on Huang, Xu:Star an Efficient Encoding Scheme for Correcting Triple Storage Node Failures, Fast 2005
Slide based on Huang, Xu:Star an Efficient Encoding Scheme for Correcting Triple Storage Node Failures, Fast 2005 Decoding Performance