Sonexion GridRAID Characteristics



Similar documents
Improving Lustre OST Performance with ClusterStor GridRAID. John Fragalla Principal Architect High Performance Computing

Xyratex Update. Michael K. Connolly. Partner and Alliances Development

RAID Level Descriptions. RAID 0 (Striping)

Comparing Dynamic Disk Pools (DDP) with RAID-6 using IOR

RAID Implementation for StorSimple Storage Management Appliance

General Parallel File System (GPFS) Native RAID For 100,000-Disk Petascale Systems

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Definition of RAID Levels

Dynamic Disk Pools Delivering Worry-Free Storage

RAID Basics Training Guide

IBM ^ xseries ServeRAID Technology

technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels

Worry-free Storage. E-Series Simple SAN Storage

Best Practices RAID Implementations for Snap Servers and JBOD Expansion

Nutanix Tech Note. Failure Analysis All Rights Reserved, Nutanix Corporation

How To Create A Multi Disk Raid

Performance Impact on Exchange Latencies During EMC CLARiiON CX4 RAID Rebuild and Rebalancing Processes

Why disk arrays? CPUs improving faster than disks

New Storage System Solutions

How to recover a failed Storage Spaces

The Effect of Priorities on LUN Management Operations

HP Smart Array Controllers and basic RAID performance factors

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

RAID Levels and Components Explained Page 1 of 23

Technical White paper RAID Protection and Drive Failure Fast Recovery

SSDs and RAID: What s the right strategy. Paul Goodwin VP Product Development Avant Technology

Using RAID6 for Advanced Data Protection

How To Understand And Understand The Power Of Aird 6 On Clariion

Storing Data: Disks and Files

RAID HARDWARE. On board SATA RAID controller. RAID drive caddy (hot swappable) SATA RAID controller card. Anne Watson 1

RAID Made Easy By Jon L. Jacobi, PCWorld

Current Status of FEFS for the K computer

Intel RAID Controllers

OPTIMIZING VIRTUAL TAPE PERFORMANCE: IMPROVING EFFICIENCY WITH DISK STORAGE SYSTEMS

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Storage node capacity in RAID0 is equal to the sum total capacity of all disks in the storage node.

RAID: Redundant Arrays of Independent Disks

Data Storage - II: Efficient Usage & Errors

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

The Use of Flash in Large-Scale Storage Systems.

High-Performance SSD-Based RAID Storage. Madhukar Gunjan Chakhaiyar Product Test Architect

EMC XTREMIO EXECUTIVE OVERVIEW

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

Automated Data-Aware Tiering

How To Write A Disk Array

Chapter 6 External Memory. Dr. Mohamed H. Al-Meer

Cray DVS: Data Virtualization Service

IBM System x GPFS Storage Server

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

VERITAS Volume Management Technologies for Windows

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Oracle Database 10g: Performance Tuning 12-1

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

Operating Systems. RAID Redundant Array of Independent Disks. Submitted by Ankur Niyogi 2003EE20367

Cluster Implementation and Management; Scheduling

How To Make A Backup System More Efficient

RAID 5 rebuild performance in ProLiant

TECHNOLOGY BRIEF. Compaq RAID on a Chip Technology EXECUTIVE SUMMARY CONTENTS

Xanadu 130. Business Class Storage Solution. 8G FC Host Connectivity and 6G SAS Backplane. 2U 12-Bay 3.5 Form Factor

White Paper A New RAID Configuration for Rimage Professional 5410N and Producer IV Systems November 2012

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

an analysis of RAID 5DP

Outline. Database Management and Tuning. Overview. Hardware Tuning. Johann Gamper. Unit 12

Storage Technologies for Video Surveillance

Reliability and Fault Tolerance in Storage

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.2

Dynamic Disk Pools Technical Report

Technical White Paper for the Oceanspace VTL6000

Benefits of Intel Matrix Storage Technology

RAID 6 with HP Advanced Data Guarding technology:

PowerVault MD1200/MD1220 Storage Solution Guide for Applications

PARALLELS CLOUD STORAGE

RAID technology and IBM TotalStorage NAS products

Maintenance Best Practices for Adaptec RAID Solutions

Intel RAID Software User s Guide:

RAID OPTION ROM USER MANUAL. Version 1.6

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

BrightStor ARCserve Backup for Windows

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Energy aware RAID Configuration for Large Storage Systems

High Performance Tier Implementation Guideline

An Oracle White Paper January A Technical Overview of New Features for Automatic Storage Management in Oracle Database 12c

Areas Covered. Chapter 1 Features (Overview/Note) Chapter 2 How to Use WebBIOS. Chapter 3 Installing Global Array Manager (GAM)

Understanding Microsoft Storage Spaces

Why disk arrays? CPUs speeds increase faster than disks. - Time won t really help workloads where disk in bottleneck

February, 2015 Bill Loewe

FLASH IMPLICATIONS IN ENTERPRISE STORAGE ARRAY DESIGNS

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

RAID Overview

USER S GUIDE. MegaRAID SAS Software. June 2007 Version , Rev. B

Review. Lecture 21: Reliable, High Performance Storage. Overview. Basic Disk & File System properties CSC 468 / CSC /23/2006

Configuring RAID for Optimal Performance

Intel RAID Software User s Guide:

Architecting a High Performance Storage System

Exploring RAID Configurations

Transcription:

Sonexion GridRAID Characteristics Mark Swan Performance Team Cray Inc. Saint Paul, Minnesota, USA mswan@cray.com Abstract This paper will present performance characteristics of the Sonexion declustered parity product, known as GridRAID. Comparisons to MDRAID will be made during the normal operational state of Object Storage Targets (OSTs) as well as the degraded states (reconstruct and rebalance). Keywords-Sonexion; GridRAID; declustered parity I. INTRODUCTION The Cray Sonexion GridRAID product is an implementation of declustered parity [1][2][3] by Seagate/Xyratex. This product has several technically intriguing features including reduced time spent in degraded modes, lower OST counts per file system, decreased number of active streams required to reach peak performance, and similar per-building-block performance characteristics to its MDRAID counterpart. II. CRAY SONEXION OVERVIEW The Cray Sonexion 1600 is composed of a single Metadata Management Unit (MMU) and one or more Scalable Storage Units (SSU). The MMU consists of four servers and either a 2U24 or 5U84 drive enclosure. The four servers are two cluster management servers and two file system metadata servers (i.e. the MGS and MDS). The SSU consists of two Object Storage Servers (OSS), each with 32 GiB of memory, and a 5U84 drive enclosure. In the MDRAID configuration, of the 84 drives in the 5U84 enclosure, there are two SSDs, 80 spinning disks arranged into eight Object Storage Targets (OSTs) which are RAID 6 8+2 arrays, and two spinning disks as global hot spares. Each OSS has primary responsibility for four OSTs. In the GridRAID configuration, of the 84 drives in the 5U84 enclosure, there are two SSDs and 82 spinning disks arranged into two OSTs. Each OSS has primary responsibility for one OST. As a general guideline, the performance of a single SSU is said to be 5 GB/s sustained and 6 GB/s peak. The SSU is the building block of the file system. III. RAID 6 8+2 BACKGROUND In the Sonexion MDRAID configuration, there are ten physical drives associated each OST. The generic notation for this type of configuration is P drive (N + K), where P is the number of disks in the array, N is the number of data blocks per stripe, and K is the number of parity blocks per stripe. Therefore, the Sonexion MDRAID is 10 drive (8+2). The data and parity blocks are not strictly confined to any one of the ten drives but, rather, the data and parity are distributed among the ten drives. The key feature of RAID 6 is that writing and reading of data can continue even when two drives fail concurrently. Rebuilding a RAID 6 device after the loss of a drive requires regeneration of the data for the missing drive and completely writing the contents of the newly replaced drive. As such, the time to complete this rebuild phase is limited by the speed of a single drive. Given a nominal drive write rate of 50 MB/s, writing an entire 1 TB drive can take approximately 5.6 hours (1,000,000 / 50 / 3600 = 5.55). Consequently, an OST created from ten 4 TB drives would take approximately 22.2 hours to rebuild. During this time, I/O operations are degraded. Furthermore, the loss of a second drive causes the OST to enter what is called critical mode since there is no parity protection when two drives are lost. IV. GRIDRAID OST CREATION As mentioned previously, in the Sonexion MDRAID product, there are eight OSTs in an SSU and each OSS in the SSU has primary responsibility for four OSTs. In the GridRAID product, an OST is created using 41 of the 82 spinning disks in the standard 5U84 disk enclosure. Therefore, there are two OSTs per SSU and each OSS has primary responsibility for one OST. The generic notation for the GridRAID device is P drive (N + K + A), where P is the number of disks in the array, N is the number of data blocks in a stripe, K is the number of parity blocks in the stripe, and A is the number of distributed spare blocks per tile. The specific notation for GridRAID, then, is 41 drive (8+2+2). Since the RAID 6 stripes are logical, there are no strict boundaries on the physical devices; all data blocks, parity blocks, and spare blocks are distributed among the physical blocks of the 41 drives. The actual algorithms used to implement the layout of data blocks is considered proprietary information by Seagate/Xyratex and will not be presented in this paper.

V. PERFORMANCE CHARACTERISTICS The version of NEO software that supports GridRAID supports the same hooks for viewing and changing the OST allocation position pointer as NEO 1.2.3 [4]. Fig. 1 shows the results of running the obdfilter-survey tool from the fast zones to the slow zones of a GridRAID OST. Fig. 2, Performance of buffered I/O, 32m transfers Fig. 1, GridRAID OST speed zone differences Considering the fact that there are four MDRAID OSTs per OSS and only a single GridRAID OST per OSS, we might wonder if it takes four times as many active streams for the performance of a GridRAID OST to match that of four MDRAID OSTs. Fig. 2 and Fig. 3 show the IOR performance of a single GridRAID OST with 32 MiB transfers. Fig. 4 and Fig. 5 show comparisons of write and read performance of an MDRAID OSS versus a GridRAID OSS. As we can see, write performance of GridRAID ramps up to the peak with fewer active streams than MDRAID. GridRAID buffered reads take a few more active streams than MDRAID to ramp to the peak. GridRAID direct I/O reads are, overall, not as good as MDRAID but this can be improved with some server-side tuning. Lustre OSTs can be configured to pre-allocate differently sized chunks of data as files are created [4]. This preallocation size can be used to create more contiguous data within files as they are written to the OST. In turn, this larger amount of contiguous data can greatly improve read performance but, sometimes, at the cost of write performance. Fig. 6 and Fig. 7 show the effects of modifying the OST pre-allocation size in order to improve buffered read rates. As we can see, the pre-allocation size of 8 MiB maximizes buffered read rates and causes only a slight buffered write performance decrease. Fig. 3, Performance of direct I/O, 32m transfers Fig. 4, Comparison of MDRAID and GridRAID write performance, 32m transfers

Fig. 5, Comparison of MDRAID and GridRAID read performance, 32m transfers Fig. 6, OST pre-allocation effects, buffered write VI. GRIDRAID DEGRADED MODES There are two degraded modes of operation for a GridRAID device; reconstruct and rebalance. Like the Sonexion MDRAID product, the Sonexion GridRAID product enables customization of the bandwidth to be allocated to the recovery phases when Lustre client I/O is occurring. In the absence of I/O from Lustre clients, however, the recovery phases will consume the maximum amount of bandwidth available. A. Reconstruct When a single drive fails in the 41-drive OST, the data from the failed drive must be reconstructed and distributed to the spare blocks that exist on the remaining 40 drives. There is no single drive that needs to be written since all data for the logical RAID 6 stripes is distributed across all drives. Therefore, this reconstruct phase is not limited by the write rate of a single drive. Instead, the write bandwidth of the remaining 40 drives is available. Given the same nominal 50 MB/s drive write rate used previously, the data placement is such that the OSS now has 40 * 50 MB/s / 9 (222 MB/s) of bandwidth available to it for reconstructing the OST. For an OST based on 1 TB drives, this is approximately 1.25 hours. For an OST based on 4 TB drives, the time from losing a drive to having a fully functional and redundant OST is a little more than 5 hours. Compare this to the 22.2 hours for the MDRAID product. Fig. 8 shows the bandwidth over time as an idle GridRAID OST (based on 3 TB drives) goes through the reconstruct phase. Total time is approximately 2.5 hours. The reader might recognize the performance curve as being the same as the performance differences from the fast zones to the slow zones of the disks. According to the data found in /proc/mdstat of the OSS, the bandwidth ranged from approximately 98 MB/s per drive at the beginning to approximately 58 MB/s per drive at the end. Fig. 9 shows two OSTs going through the reconstruct phase at the same time. OST 1 (black) was idle while OST 0 (green) had a series of IOR jobs run against it. The minimum bandwidth allocated to performing the reconstruct was set to 50 MB/s per disk. Fig. 10 superimposes LMT data from the series of IOR jobs along with the reconstruct rates and times. As IOR begins to move data, the rate of reconstruct for OST 0 decreases to the configured value of 50 MB/s per disk. Fig. 7, OST pre-allocation effects, buffered read

Fig. 8, Reconstruct of idle GridRAID OST OST is completely functional and redundant due to the reconstruct phase that has already completed. Fig. 11 shows the bandwidth over time as an idle GridRAID OST (based on 3 TB drives) goes through the rebalance phase. Total time is approximately 6.5 hours. Fig. 12 shows two OSTs going through the rebalance phase at the same time. OST 1 (black) was idle while OST 0 (green) had a series of IOR jobs run against it. The minimum bandwidth allocated to performing the rebalance was set to 50 MB/s per disk. Fig. 13 superimposes LMT data from the series of IOR jobs along with the rebalance rates and times. As IOR begins to move data, the rate of rebalance for OST 0 decreases to the configured value of 50 MB/s per disk. Finally, Fig. 14 zooms in on the superimposed LMT data to show how the rebalance bandwidth of OST 0 changes as IOR moves data. It is interesting to note the delay as the rebalance bandwidth attempts to increase as IOR activity decreases and then decreases as IOR activity increases. Fig. 9, Reconstruct of two GridRAID OSTs Fig. 11, Rebalance of idle GridRAID OST Fig. 10, Superimposing LMT data with reconstruct B. Rebalance The rebalance phase occurs when a replacement drive is introduced into an OST. The term rebalance implies that all the data for the OST (that was on 40 drives after the reconstruct) needs to be rebalanced among the 41 drives that now exist. Similar to the recovery phase of a MDRAID device, this rebalance phase must write the complete contents of the new drive and, as such, is limited by the bandwidth of the device. However, during this period, the Fig. 12, Rebalance of two GridRAID OSTs

Fig. 13, Superimposing LMT data with rebalance Fig. 14, Superimposing LMT data with rebalance, zoomed in to show fluctuation C. Single Drive Failure With the MDRAID implementation, losing a single drive causes one of the OSTs to be in a degraded state. That is 25% of the OSTs of that particular OSS. Due to the distributed nature of data in a GridRAID OST, losing a single drive causes 25% of the OST to be in a degraded mode and, since there is only a single GridRAID OST per OSS, this is equivalent to MDRAID. However, as we have previously pointed out, the time it takes to recover from a single drive loss is four times greater with MDRAID than with GridRAID. D. Double Drive Failure When a MDRAID OST loses a second drive, that entire OST enters into [what is referred to as] critical state. The OST is critical because there is no longer any parity protection. Therefore, 25% of the user data for that OSS is at risk. Losing a third drive will cause complete loss of data for that OST. The MDRAID OST remains in critical state for as long as it takes to rebuild the data on a replacement disk (i.e. limited by the speed of a single drive). In the [proprietary] GridRAID implementation, a second drive loss only puts approximately 5.5% of the OST into a critical state and, therefore, 5.5% of the data for that OSS. The OST will only remain in critical state as long as the reconstruct phase takes which, again, is one-fourth of the time for MDRAID recovery. E. Periodic RAID check While the periodic RAID check is not strictly defined as a degraded mode of operation, it can have a significant impact on Lustre client performance. The RAID check is an integrity check performed by the OSS on a scheduled basis or can be invoked manually. Fig. 15 shows two OSTs going through the RAID check phase at the same time. OST 1 (black) was idle while OST 0 (green) had a series of IOR jobs run against it. The minimum bandwidth allocated to performing the RAID check was set to 10 MB/s per disk. The data for OST 0 begins when the RAID check was 26% complete and the data for OST 1 begins when the RAID check was 46% complete. Fig. 16 superimposes LMT data from the series of IOR jobs along with the RAID check rates and times. As IOR begins to move data, the rate of RAID check for OST 0 decreases to the configured value of 10 MB/s per disk VII. SUMMARY The Cray Sonexion GridRAID product, due to the distributed nature of data and parity among 41 drives, offers several benefits over the traditional MDRAID implementation. The most important of these benefits appears to be the lower time spent in degraded and critical states. Performance characteristics of the GridRAID product are on par with, and sometimes surpasses those of, MDRAID. Fig. 15, RAID check of two GridRAID OSTs

REFERENCES [1] Holland, Mark "On-Line Data Reconstruction In Redundant Disk Arrays" CMU 1994. [2] Merchant, Arif, and Philip S. Yu. "Analytic modeling of clustered RAID with mapping based on nearly random permutation." Computers, IEEE Transactions on 45.3 (1996): 367-373. [3] Dau, Son Hoang, et al. "Parity Declustering for Fault-Tolerant Storage Systems via t-designs." CoRR:1209.6152 (2012) [4] Swan, M., Tuning and Analyzing Sonexion Performance CUG 2014. Fig. 16, Superimposing LMT data with RAID check