Today s Papers. RAID Basics (Two optional papers) Array Reliability. EECS 262a Advanced Topics in Computer Systems Lecture 4



Similar documents
Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

Lecture 36: Chapter 6

CS 153 Design of Operating Systems Spring 2015

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Reliability and Fault Tolerance in Storage

Why disk arrays? CPUs improving faster than disks

CSE 120 Principles of Operating Systems

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

How To Write A Disk Array

RAID. Contents. Definition and Use of the Different RAID Levels. The different RAID levels: Definition Cost / Efficiency Reliability Performance

technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels

RAID. Storage-centric computing, cloud computing. Benefits:

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

Why disk arrays? CPUs speeds increase faster than disks. - Time won t really help workloads where disk in bottleneck

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

1 Storage Devices Summary

HP Smart Array Controllers and basic RAID performance factors

RAID technology and IBM TotalStorage NAS products

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

Definition of RAID Levels

Review. Lecture 21: Reliable, High Performance Storage. Overview. Basic Disk & File System properties CSC 468 / CSC /23/2006

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Price/performance Modern Memory Hierarchy

CS161: Operating Systems

Data Storage - II: Efficient Usage & Errors

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Operating Systems. RAID Redundant Array of Independent Disks. Submitted by Ankur Niyogi 2003EE20367

RAID Overview: Identifying What RAID Levels Best Meet Customer Needs. Diamond Series RAID Storage Array

CSAR: Cluster Storage with Adaptive Redundancy

Outline. Database Management and Tuning. Overview. Hardware Tuning. Johann Gamper. Unit 12

How To Improve Performance On A Single Chip Computer

RAID 6 with HP Advanced Data Guarding technology:

RAID Storage, Network File Systems, and DropBox

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

Managing Storage Space in a Flash and Disk Hybrid Storage System

How To Create A Multi Disk Raid

Striped Set, Advantages and Disadvantages of Using RAID

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

PIONEER RESEARCH & DEVELOPMENT GROUP

File Systems Management and Examples

Database Management Systems

Lecture 18: Reliable Storage

RAID Levels and Components Explained Page 1 of 23

RAID Basics Training Guide

Chapter 12: Mass-Storage Systems

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

IBM ^ xseries ServeRAID Technology

Sistemas Operativos: Input/Output Disks

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

RAID Utility User s Guide Instructions for setting up RAID volumes on a computer with a MacPro RAID Card or Xserve RAID Card.

File System Design and Implementation

Storing Data: Disks and Files

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

UK HQ RAID Chunk Size T F ISO 14001

Disk Array Data Organizations and RAID

COS 318: Operating Systems. Storage Devices. Kai Li Computer Science Department Princeton University. (

Q & A From Hitachi Data Systems WebTech Presentation:

Assessing RAID ADG vs. RAID 5 vs. RAID 1+0

RAID Made Easy By Jon L. Jacobi, PCWorld

Benefits of Intel Matrix Storage Technology

Comparing Dynamic Disk Pools (DDP) with RAID-6 using IOR

Using Synology SSD Technology to Enhance System Performance Synology Inc.

RAID Technology Overview

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

RAID Overview

RAID Utility User Guide. Instructions for setting up RAID volumes on a computer with a Mac Pro RAID Card or Xserve RAID Card

Flexible Storage Allocation

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

RAID. Tiffany Yu-Han Chen. # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead

COMP 7970 Storage Systems

File System Reliability (part 2)

RAID Performance Analysis

CSE-E5430 Scalable Cloud Computing P Lecture 5

Intel RAID Controllers

IncidentMonitor Server Specification Datasheet

RAID: Redundant Arrays of Independent Disks

Best Practices for Optimizing SQL Server Database Performance with the LSI WarpDrive Acceleration Card

Physical Data Organization

Chapter 11 I/O Management and Disk Scheduling

RAID-DP: NetApp Implementation of Double- Parity RAID for Data Protection

OPTIMIZING EXCHANGE SERVER IN A TIERED STORAGE ENVIRONMENT WHITE PAPER NOVEMBER 2006

The What, Why and How of the Pure Storage Enterprise Flash Array

RAID0.5: Active Data Replication for Low Cost Disk Array Data Protection

RAID Implementation for StorSimple Storage Management Appliance

Disk Storage & Dependability

Overview of I/O Performance and RAID in an RDBMS Environment. By: Edward Whalen Performance Tuning Corporation

COS 318: Operating Systems

Google File System. Web and scalability

Chapter 7. Disk subsystem

Best Practices RAID Implementations for Snap Servers and JBOD Expansion

Chapter 10: Mass-Storage Systems

Transcription:

EECS 262a Advanced Topics in Computer Systems Lecture 4 Filesystems (Con t) September 15 th, 2014 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley Today s Papers The HP AutoRAID Hierarchical Storage System (2-up version), John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan. Appears in ACM Transactions on Computer Systems, Vol. 14, No, 1, February 1996, Pages 108-136. Finding a needle in Haystack: Facebook s photo storage,doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel. Appears in Proceedings of the USENIX conference in Operating Systems Design and Implementation (OSDI), 2010 System design paper and system analysis paper Thoughts? http://www.eecs.berkeley.edu/~kubitron/cs262 2 Array Reliability Reliability of N disks = Reliability of 1 Disk N 50,000 Hours 70 disks = 700 hours Disk system MTTF: Drops from 6 years to 1 month! Arrays (without redundancy) too unreliable to be useful! Hot spares support reconstruction in parallel with access: very high media availability can be achieved RAID Basics (Two optional papers) Levels of RAID (those in RED are actually used): RAID 0 (JBOD): striping with no parity (just bandwidth) RAID 1: Mirroring (simple, fast, but requires 2x storage)» 1/n space, reads faster (1 to Nx), writes slower (1x) why? RAID 2: bit-level interleaving with Hamming error-correcting codes (ECC) RAID 3: byte-level striping with dedicated parity disk» Dedicated parity disk is write bottleneck, since every write also writes parity RAID 4: block-level striping with dedicated parity disk» Same bottleneck problems as RAID 3 RAID 5: block-level striping with rotating parity disk» Most popular; spreads out parity load; space 1-1/N, read/write (N-1)x RAID 6: RAID 5 with two parity blocks (tolerates two drive failures) Use RAID 6 with today s drive sizes! Why? Correlated drive failures (2x expected in 10hr recovery) [Schroeder and Gibson, FAST07] Failures during multi-hour/day rebuild in high-stress environments 3 4

Redundant Arrays of Disks RAID 1: Disk Mirroring/Shadowing Each disk is fully duplicated onto its "shadow" Very high availability can be achieved Bandwidth sacrifice on write: Logical write = two physical writes Reads may be optimized recovery group Most expensive solution: 100% capacity overhead Targeted for high I/O rate, high availability environments 5 Redundant Arrays of Disks RAID 5+: High I/O Rate Parity A logical write becomes four physical I/Os Independent writes possible because of interleaved parity Reed-Solomon Codes ("Q") for protection during reconstruction D0 D1 D2 D3 P D4 D5 D6 P D7 D8 D9 P D10 D11 D12 P D13 D14 D15 P D16 D17 D18 D19 Increasing Logical Disk Addresses Stripe Stripe Unit D20 D21 D22 D23 P Targeted for mixed applications.... Disk Columns.. 6 Problems of Disk Arrays: Small Writes RAID-5: Small Write Algorithm 1 Logical Write = 2 Physical Reads + 2 Physical Writes D0' new data + D0 D1 D2 D3 P old (1. Read) data XOR + XOR D0' D1 D2 D3 P' old (2. Read) parity (3. Write) (4. Write) 7 System Availability: Orthogonal RAIDs Array Redundant Support Components: fans, power supplies, controller, cables Data Recovery Group: unit of data redundancy End to End Data Integrity: internal parity protected data paths 8

System-Level Availability host I/O Array Fully dual redundant host I/O Array Goal: No Single Points of Failure How to get to RAID 6? One option: Reed-Solomon codes (Non-systematic): Use of Galois Fields (finite element equivalent of real numbers) Data as coefficients, code space as values of polynomial: P(x)=a 0 +a 1 x 1 + a 4 x 4 Coded: P(1),P(2).,P(6),P(7) Advantage: can add as much redundancy as you like: 5 disks? Problems with Reed-Solomon codes: decoding gets complex quickly even to add a second disk Alternates: lot of them I ve posted one possibility Idea: Use prime number of columns, diagonal as well as straight XOR Recovery Group... with duplicated paths, higher performance can be obtained when there are no failures 9 10 HP AutoRAID Motivation Goals: automate the efficient replication of data in a RAID RAIDs are hard to setup and optimize Mix fast mirroring (2 copies) with slower, more space-efficient parity disks Automate the migration between these two levels RAID small-write problem: to overwrite part of a block required 2 reads and 2 writes! read data, read parity, write data, write parity Each kind of replication has a narrow range of workloads for which it is best... Mistake 1) poor performance, 2) changing layout is expensive and error prone Also difficult to add storage: new disk change layout and rearrange data... HP AutoRAID Key Ideas Key idea: mirror active data (hot), RAID 5 for cold data Assumes only part of data in active use at one time Working set changes slowly (to allow migration) How to implement this idea? Sys-admin» make a human move around the files... BAD. painful and error prone File system» best choice, but hard to implement/ deploy; can t work with existing systems Smart array controller: (magic disk) block-level device interface» Easy to deploy because there is a well-defined abstraction» Enables easy use of NVRAM (why?) 11 12

HP AutoRaid Features Block Map level of indirection so that blocks can be moved around among the disks implies you only need one zero block (all zeroes), a variation of copy on write in fact could generalize this to have one real block for each unique block Mirroring of active blocks RAID 5 for inactive blocks or large sequential writes (why?) Start out fully mirrored, then move to 10% mirrored as disks fill Promote/demote in 64K chunks (8-16 blocks) Hot swap disks, etc. (A hot swap is just a controlled failure.) Add storage easily (goes into the mirror pool) useful to allow different size disks (why?) No need for an active hot spare (per se); just keep enough working space around Log-structured RAID 5 writes Nice big streams, no need to read old parity for partial writes 13 AutoRAID Details PEX (Physical Extent): 1MB chunk of disk space PEG (Physical Extent Group): Size depends on # Disks A group of PEXes assigned to one storage class Stripe: Size depends # Disks One row of parity and data segments in a RAID 5 storage class Segment: 128 KB Strip unit (RAID 5) or half of a mirroring unit Relocation Block (RB): 64KB Client visible space unit 14 Closer Look: 15 Questions When to demote? When there is too much mirrored storage (>10%) Demotion leaves a hole (64KB). What happens to it? Moved to free list and reused Demoted RBs are written to the RAID5 log, one write for data, a second for parity Why log RAID5 better than update in place? Update of data requires reading all the old data to recalculate parity. Log ignores old data (which becomes garbage) and writes only new data/parity stripes When to promote? When a RAID5 block is written... Just write it to mirrored and the old version becomes garbage. How big should an RB be? Bigger Less mapping information, fewer seeks Smaller fine grained mapping information How do you find where an RB is? Convert addresses to (LUN, offset) and then lookup RB in a table from this pair Map size = Number of RBs and must be proportional to size of total storage How to handle thrashing (too much active write data)? Automatically revert to directly writing RBs to RAID 5! 16

Issues Disks writes go to two disks (since newly written data is hot ). Must wait for both to complete -- why? Does the host have to wait for both? No, just for NVRAM uses cache for reads uses NVRAM for fast commit, then moves data to disks What if NVRAM is full? Block until NVRAM flushed to disk, then write to NVRAM What happens in the background? 1) compaction, 2) migration, 3) balancing Compaction: clean RAID5 and plug holes in the mirrored disks. Do mirrored disks get cleaned? Yes, when a PEG is needed for RAID5; i.e., pick a disks with lots of holes and move its used RBs to other disks. Resulting empty PEG is now usable by RAID5 What if there aren t enough holes? Write the excess RBs to RAID5, then reclaim the PEG Migration: which RBs to demote? Least-recently-written (not LRU) Balancing: make sure data evenly spread across the disks. (Most important when you add a new disk) Is this a good paper? What were the authors goals? What about the performance metrics? Did they convince you that this was a good system? Were there any red-flags? What mistakes did they make? Does the system meet the Test of Time challenge? How would you review this paper today? 17 18 Finding a Needle in Haystack This is a systems level solution: Takes into account specific application (Photo Sharing)» Large files!, Many files!» 260 Billion images, 20 PetaBytes (10 15 bytes!)» One billion new photos a week (60 TeraBytes)» Each photo scaled to 4 sizes and replicated (3x) Takes into account environment (Presence of Content Delivery Network, CDN)» High cost for NAS and CDN Takes into account usage patterns:» New photos accessed a lot (caching well)» Old photos accessed little, but likely to be requested at any time NEEDLES Cumulative graph of accesses as function of age 19 Old Solution: NFS Issues with this design? Long Tail Caching does not work for most photos Every access to back end storage must be fast without benefit of caching! Linear Directory scheme works badly for many photos/directory Many disk operations to find even a single photo (10 I/Os!) Directory s block map too big to cache in memory Fixed by reducing directory size, however still not great (10 3 I/Os) FFS metadata requires 3 disk accesses per lookup (dir, inode, pic) Caching all inodes in memory might help, but inodes are big Fundamentally, Photo Storage different from other storage: Normal file systems fine for developers, databases, etc. 20

Solution: Finding a needle (old photo) in Haystack Differentiate between old and new photos How? By looking at Writeable vs Read-only volumes New Photos go to Writeable volumes Directory: Help locate photos Name (URL) of photo has embedded volume and photo ID Let CDN or Haystack Cache Serve new photos rather than forwarding them to Writeable volumes Haystack Store: Multiple Physical Volumes Physical volume is large file (100 GB) which stores millions of photos Data Accessed by Volume ID with offset into file Since Physical Volumes are large files, use XFS which is optimized for large files DRAM usage per photo: 40 bytes vs 536 inode Cheaper/Faster: ~28% less expensive, ~4x reads/s than NAS 21 What about these results? Workloads: A: Random reads to 64KB images 85% of raw throughput, 17% higher latency B: Same as A but 70% of reds are 8KB images C, D, E: Write throughput with 1, 4, 16 writes batched (30 and 78% throughput gain) F, G: Mixed workloads (98% R/2% MW, 96% R/4% MW of 16 image MW) Are these good benchmarks? Why or why not? Are these good results? Why or why not? 22 Discussion of Haystack Did their design address their goals? Why or why not Were they successful? Is this a different question? What about the benchmarking? Good performance metrics? Did they convince you that this was a good system? Were there any red-flags? What mistakes did they make? Will this system meet the Test of Time challenge? Is this a good paper? What were the authors goals? What about the performance metrics? Did they convince you that this was a good system? Were there any red-flags? What mistakes did they make? Does the system meet the Test of Time challenge? How would you review this paper today? 23 24