Reliability of Systems and Components



Similar documents
CS161: Operating Systems

Lecture 36: Chapter 6

technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels

CS420: Operating Systems

Operating Systems. RAID Redundant Array of Independent Disks. Submitted by Ankur Niyogi 2003EE20367

Using RAID6 for Advanced Data Protection

SSDs and RAID: What s the right strategy. Paul Goodwin VP Product Development Avant Technology

Security+ Guide to Network Security Fundamentals, Fourth Edition. Chapter 13 Business Continuity

Price/performance Modern Memory Hierarchy

RAID Technology Overview

RAID. Contents. Definition and Use of the Different RAID Levels. The different RAID levels: Definition Cost / Efficiency Reliability Performance

Definition of RAID Levels

PARALLEL I/O FOR HIGH PERFORMANCE COMPUTING

RAID Overview

RAID Storage System of Standalone NVR

Chapter 9: Peripheral Devices: Magnetic Disks

How To Create A Multi Disk Raid

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

VERY IMPORTANT NOTE! - RAID

CS 61C: Great Ideas in Computer Architecture. Dependability: Parity, RAID, ECC

An Introduction to RAID. Giovanni Stracquadanio

Reliability and Fault Tolerance in Storage

CS 6290 I/O and Storage. Milos Prvulovic

Case for storage. Outline. Magnetic disks. CS2410: Computer Architecture. Storage systems. Sangyeun Cho

What is RAID and how does it work?

RAID Made Easy By Jon L. Jacobi, PCWorld

High Performance Computing. Course Notes High Performance Storage

Storing Data: Disks and Files

W4118: RAID. Instructor: Junfeng Yang

PIONEER RESEARCH & DEVELOPMENT GROUP

Hard Disk Drives and RAID

an analysis of RAID 5DP

Data Integrity: Backups and RAID

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

RAID Overview: Identifying What RAID Levels Best Meet Customer Needs. Diamond Series RAID Storage Array

IBM System x GPFS Storage Server

EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage

RAID HARDWARE. On board SATA RAID controller. RAID drive caddy (hot swappable) SATA RAID controller card. Anne Watson 1

Why disk arrays? CPUs speeds increase faster than disks. - Time won t really help workloads where disk in bottleneck

RAID Basics Training Guide

Module 6. RAID and Expansion Devices

Storage node capacity in RAID0 is equal to the sum total capacity of all disks in the storage node.

Filing Systems. Filing Systems

RAID Level Descriptions. RAID 0 (Striping)

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

Disk Array Data Organizations and RAID

Striped Set, Advantages and Disadvantages of Using RAID

IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO)

Continuous Data Protection in iscsi Storages

REMOTE OFFICE BACKUP

ES-1 Elettronica dei Sistemi 1 Computer Architecture

System Availability and Data Protection of Infortrend s ESVA Storage Solution

HDD Ghosts, Goblins and Failures in RAID Storage Systems

WHITE PAPER. Business Continuity Planning

Guide to SATA Hard Disks Installation and RAID Configuration

NVIDIA RAID Installation Guide

How To Understand And Understand The Power Of Aird 6 On Clariion

Cisco Small Business NAS Storage

Data Storage - II: Efficient Usage & Errors

How To Write A Disk Array

Chapter 6 External Memory. Dr. Mohamed H. Al-Meer

ISTANBUL AYDIN UNIVERSITY

Storage Technologies - 2

RAID. Storage-centric computing, cloud computing. Benefits:

Guide to SATA Hard Disks Installation and RAID Configuration

Alternatives to Big Backup

Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe

Distribution One Server Requirements

1 Storage Devices Summary

COMP 7970 Storage Systems

Reliability of Data Storage Systems

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

Theoretical Aspects of Storage Systems Autumn 2009

Shared Machine Room / Service Opportunities. Bruce Campbell November, 2011

Sistemas Operativos: Input/Output Disks

Factors rebuilding a degraded RAID

Top 10 RAID Tips. Everything you forgot to consider when building your RAID. by

What is RAID? data reliability with performance

Non-Redundant (RAID Level 0)

MICROSOFT EXCHANGE best practices BEST PRACTICES - DATA STORAGE SETUP

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

Disaster Recovery Cloud Computing Changes Everything. Mark Hadfield CEO - LegalCloud Peter Westerveld IT Director - Minter Ellison Lawyers

RAID5 versus RAID10. First let's get on the same page so we're all talking about apples.

How Does the ECASD Network Work?

Block1. Block2. Block3. Block3 Striping

How to choose the right RAID for your Dedicated Server

High-Performance SSD-Based RAID Storage. Madhukar Gunjan Chakhaiyar Product Test Architect

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

Three-Dimensional Redundancy Codes for Archival Storage

Our Colorado region is offering a FREE Disaster Recovery Review promotional through June 30, 2009!

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Availability and Disaster Recovery: Basic Principles

Joint Universities Computer Centre Limited ( JUCC ) Information Security Awareness Training- Session Four

NSS Volume Data Recovery

HA / DR Jargon Buster High Availability / Disaster Recovery

RAID. Tiffany Yu-Han Chen. # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead

HDD & RAID. White Paper. English

Combining Onsite and Cloud Backup

Transcription:

Reliability of Systems and Components GNS Systems GmbH Am Gaußberg 2 38114 Braunschweig / Germany Fri 13 Apr 2012, Revision 0.1.1 Please send comments about this document to: norman.krebs@gns-systems.de

Contents 1 Business Continuity 3 2 Down Times 3 3 Availability 4 4 Reliability of RAID 5 4.1 Annualized Failure Rate............................ 5 4.2 RAID 0 - Stripe................................. 6 4.3 RAID 1 / RAID 5............................... 6 4.4 RAID 6..................................... 7 4.5 RAID 01..................................... 7 4.6 RAID 10..................................... 7 4.7 RAID 01 vs. RAID 10............................. 8 4.8 Estimation of Failure.............................. 9 4.9 Bit Error Rate................................. 10 5 RPO and RTO 11 5.1 Sample Recovery Levels............................ 11 References 12

2 DOWN TIMES The reliability theory and practice consists of several objectives. This document mentions a few of them to show some key words and calculations briefly. 1 Business Continuity Business continuity consists of: High availability - keep the IT infrastructure running in case of minor outages Continuous operations - find a way to accomplish maintenance and backups without or with less disturbance of the running processes Disaster Recovery - keep the IT infrastructure running or get it running again quickly after major outages 2 Down Times availability down time/year 98% 7.3 days 99% 3.65 days 99.5% 1.83 days 99.9% 8.76 hours 99.99% 52.6 minutes 99.999% 5.26 minutes 99.9999% 31.5 seconds 3

3 AVAILABILITY 3 Availability U time EE U MT T F MT T R MT BF short unit description EE failure U recovery MT T R [h] mean time to return MT T F [h] mean time to failure M T BF [h] mean time between failure A probability availability A p probability availability - parallel components A s probability availability - serial components Availability as a function of times: A = MT T F MT T F + MT T R (1) Figure 1: parallel (C 1 /C 2 ) and serial (C 3 /C 4 ) components The components C 1 and C 2 with the availabilities A 1 and A 2 are installed parallely and have the combined availability A p : A p = 1 ((1 A 1 ) (1 A 2 )) (2) The components C 3 and C 4 with the availabilities A 3 and A 4 are installed serially and have the combined availability A s : A s = A 3 A 4 (3) 4

4 RELIABILITY OF RAID 4 Reliability of RAID This section contains the equations to calculate the MTTF-rates for several RAID levels. This is needed to calculate the probability of failing within a special time interval. short unit description MT T F [h] mean time to failure; this value is given by the hardware vendor. He gains it from a mix of knowledge and experience and the ruthless lies of the marketing. It means the failure of the whole disk and is now within the range of 500000 hours. MT T R [h] mean time to return; it consists of the whole time needed to rebuild a proper array MT T R = MF DT + MHRT + MT DR + MT RA M F DT [h] mean failure detection time; the time needed to detect a failure M HRT [h] mean human reaction time; the time needed to bring a skilled person to handle the problem MT DR [h] mean time to replace the disk MT RA [h] mean time to rebuild the array A(x) probability of failure within x years N number of drives in the array P oh(x) [h] power on hours per x years % AF R annualized failure rate ([%] = [ 1 ] :keep this in mind) year 100 K 1 correction factor for the MTTF of aged disks - first failure correction factor for the MTTF of aged disks - second failure K 2 4.1 Annualized Failure Rate The AFR is the probability of failure for a component. It is the most important value for the user. AF R disk = 100 P oh(1) disk MT T F disk (4) MT T F disk = 100 P oh(1) disk AF R disk (5) 5

4.2 RAID 0 - Stripe 4 RELIABILITY OF RAID 4.2 RAID 0 - Stripe A RAID 0 is defined as a set of at least two striped disks. There is no redundancy and no safety but its the fastest. MT T F R0 = MT T F disk N (6) A(1) R0 = P oh(1) R0 MT T F R0 = P oh(1) R0 N MT T F disk (7) 4.3 RAID 1 / RAID 5 RAID 1 is defined as a set of two mirrored drives. RAID 5 is defined as a set of N disks with parity blocks on all disks. The volume size is N 1 disks. One RAID stripe consists of N blocks distributed over the disks. One block contains the parity. RAID 5 writes slowly. The change of a block requires: N 2 reads to get the informations to recalculate the parity 2 writes: the data block itself + the parity block The disks in a RAID are usually bought from one production charge and may have the similar failures. They have got the same influences of transportation, temperature, falling down etc. Thus the probability is high, that after the first failure a second one happens. To represent this fact the correction factors can reduce the remaining MTTF: p.e.: K 1 = 0.1 and K 2 = 0.01. MT T F R15 = MT T F disk MT T F disk K 1 } {{ N } (N 1) MT T R } {{ } 1st failure 2nd failure (8) = MT T F 2 disk K 1 N (N 1) MT T R A(1) R15 = P oh(1) R15 MT T F R15 (9) 6

4.4 RAID 6 4 RELIABILITY OF RAID 4.4 RAID 6 RAID 6 is defined as a set of N disks. The redundant information or parity is stored on all disks. The volume size is N 2 disks. The change of a block requires: N 3 reads to get the informations to recalculate the parity 3 writes: the data block itself + two parity blocks MT T F R6 = MT T F disk N } {{ } MT T F disk K 1 (N 1) MT T R } {{ } MT T F disk K 2 (N 2) MT T R } {{ } 1st failure 2nd failure 3rd failure (10) = MT T F 3 disk K 1 K 2 N(N 1)(N 2) MT T R 2 A(1) R6 = P oh(1) R6 MT T F R6 (11) 4.5 RAID 01 The RAID 01 is defined as a mirror of striped disks. MT T F R01 = 2MT T F disk 2MT T F disk K 1 } {{ N } } N MT {{ T R } 1st failure 2nd failure (12) = 4MT T F 2 disk K 1 N 2 MT T R 4.6 RAID 10 The RAID 10 is defined as a stripe of mirrored disks. MT T F R10 = MT T F disk 2 } {{ } MT T F disk K 1 MT T R } {{ } 1st failure 2nd failure 2 N (13) = MT T F 2 disk K 1 N MT T R 7

4.7 RAID 01 vs. RAID 10 4 RELIABILITY OF RAID 4.7 RAID 01 vs. RAID 10 MT T F R10 1 N (14) MT T F R01 4 N 2 (15) The table shows, depending an the number of disks, the increasing probability of failure of a RAID 01 against a RAID 10. N A R10 A R01 4 4 4 6 6 9 8 8 16 10 10 25 12 12 36 20 20 100 Avoid RAID 01! 8

4.8 Estimation of Failure 4 RELIABILITY OF RAID 4.8 Estimation of Failure short unit description r [pcs] number of failing drives N [pcs] number of drives r [%] relative number of failing drives f(t) [pcs/h] relative number of failing drives per time interval λ [h 1 ] failure rate F (t) probability probability of failure R(t) probability probability of survival While the mechanism of failing can be described by a exponential or Weibull distribution, we can use the following expressions: The relative number of failing drives is: r = r N The number of failing drives per time interval, or the density function is: (16) The probability of failure is: f(t) = dr dt F (t) = (17) f(t)dt (18) The probability of survival is: The failure rate λ is: R(t) = 1 F (t) (19) λ = f(t) R(t) The percental rate of failing disks within a time interval is: R(t) = e ( (20) t MT T F )b (21) MT T F ( R(MT T F ) = e MT T F ) = e 1 = 1 e = 0.368 (22) This tells us that 36.8% of the drives will survive the MTTF period. 9

4.9 Bit Error Rate 4 RELIABILITY OF RAID 4.9 Bit Error Rate short unit description N [pcs] number of disks M disk [bit] amount of bits in one disk S disk [byte] hard disk size M RAID [bit] amount of bits in the RAID BER [bit] probability of one uncorrectable bit read error. It is in the range of 10 14 to 10 15 - One bit error every 10-100TB read. The BER is value given by the vendor for one harddrive. It contains all sources of bit failures like the disks themselves, the eletronic and mechanic components of the harddrive and the assumed influences from the outer space around the harddrive. The number of bits in the RAID: M RAID = S disk 8 N (23) If one drive fails in a RAID, all data have to be read to restore parity. The probability to read all bits in the RAID without error during a restore after loosing one drive: A rrx = (1 BER) M disk+n (24) The probability to loose bits: A rrx = 1 (1 BER) M disk+n (25) The probability to loose bits depends on the amount of bits read and written, not on time. So we cannot calculate the the probability of bit loss within a time interval without knowing the amount of data shoveled around in this time. Example: BER = 10 14 M RAID = 10 14 A rrx 0.36 A be 0.64 10

5 RPO AND RTO 5 RPO and RTO P time EE U RP O RT O P last backup EE failure U recovery RT O Recovery Time Objective RP O Recovery Point Objective RPO RTO maximum tolerable data loss (time since last backup) time needed from failure to recover and resume to business Both objectives have to be considered separately. 5.1 Sample Recovery Levels RT O[h] RP O[h] Immediate Recovery 4 0 Rapid Recovery 48 < 24 Medium Recovery 120 < 24 Best-Effort Recovery > 120 < 48 11

REFERENCES REFERENCES References [1] D.A.Patterson,G.Gibson,R.Katz A case for redundant arrays of inexpensive disks (RAID) UC Berkeley, Dept. of Electrical Engineering and Computer Sciences, 1988 [2] Scott Speaks Reliability and MTBF Overview VICOR Reliability Engineering, 2004 [3] Holger Wilker Weibull-Statistik in der Praxis Lauffen, 2004 12