Stripped mirroring RAID architecture



Similar documents
A Tutorial on RAID Storage Systems

technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels

Load Balancing in Fault Tolerant Video Server

Definition of RAID Levels

CS 153 Design of Operating Systems Spring 2015

How To Create A Multi Disk Raid

PIONEER RESEARCH & DEVELOPMENT GROUP

IBM ^ xseries ServeRAID Technology

How To Improve Performance On A Single Chip Computer

CSE 120 Principles of Operating Systems

Lecture 36: Chapter 6

RAID Level Descriptions. RAID 0 (Striping)

Price/performance Modern Memory Hierarchy

Operating Systems. RAID Redundant Array of Independent Disks. Submitted by Ankur Niyogi 2003EE20367

RAID Overview: Identifying What RAID Levels Best Meet Customer Needs. Diamond Series RAID Storage Array

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

CHAPTER 4 RAID. Section Goals. Upon completion of this section you should be able to:

Exploring RAID Configurations

CSAR: Cluster Storage with Adaptive Redundancy

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing

RAID 5 rebuild performance in ProLiant

HP Smart Array Controllers and basic RAID performance factors

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Distributed RAID Architectures for Cluster I/O Computing. Kai Hwang

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

RAID Technology Overview

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

Fast, On-Line Failure Recovery in Redundant Disk Arrays

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Outline. Database Management and Tuning. Overview. Hardware Tuning. Johann Gamper. Unit 12

StorTrends RAID Considerations

CS161: Operating Systems

Increasing the capacity of RAID5 by online gradual assimilation

Review. Lecture 21: Reliable, High Performance Storage. Overview. Basic Disk & File System properties CSC 468 / CSC /23/2006

TECHNOLOGY BRIEF. Compaq RAID on a Chip Technology EXECUTIVE SUMMARY CONTENTS

Sonexion GridRAID Characteristics

Energy aware RAID Configuration for Large Storage Systems

Module 6. RAID and Expansion Devices

RAID technology and IBM TotalStorage NAS products

Data Storage - II: Efficient Usage & Errors

RAID0.5: Active Data Replication for Low Cost Disk Array Data Protection

Non-Redundant (RAID Level 0)

Data Backup and Archiving with Enterprise Storage Systems

Best Practices RAID Implementations for Snap Servers and JBOD Expansion

How To Write A Disk Array

California Software Labs

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

RAID Basics Training Guide

Version : 1.1. SR2760-2S-S2 User Manual. SOHORAID Series

RAID HARDWARE. On board SATA RAID controller. RAID drive caddy (hot swappable) SATA RAID controller card. Anne Watson 1

Theoretical Aspects of Storage Systems Autumn 2009

Efficient Data Replication Scheme based on Hadoop Distributed File System

1 Storage Devices Summary

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

Online Remote Data Backup for iscsi-based Storage Systems

RAID. Contents. Definition and Use of the Different RAID Levels. The different RAID levels: Definition Cost / Efficiency Reliability Performance

RAID Storage Systems with Early-warning and Data Migration

Overview of I/O Performance and RAID in an RDBMS Environment. By: Edward Whalen Performance Tuning Corporation

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

A Comparison on Current Distributed File Systems for Beowulf Clusters

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

A Fault Tolerant Video Server Using Combined Raid 5 and Mirroring

General Parallel File System (GPFS) Native RAID For 100,000-Disk Petascale Systems

Performance Analysis of RAIDs in Storage Area Network

RAID Storage, Network File Systems, and DropBox

The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000)

Storage Technologies for Video Surveillance

Summer Student Project Report

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

RAID Technology. RAID Overview

V:Drive - Costs and Benefits of an Out-of-Band Storage Virtualization System

High Performance Computing. Course Notes High Performance Storage

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

HARD DRIVE CHARACTERISTICS REFRESHER

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Reliability and Fault Tolerance in Storage

Rebuild Strategies for Clustered Redundant Disk Arrays

Architectures and Algorithms for On-Line Failure Recovery in Redundant Disk Arrays

WHITE PAPER Guide to 50% Faster VMs No Hardware Required

How To Understand And Understand The Power Of Aird 6 On Clariion

A Load Balanced PC-Cluster for Video-On-Demand Server Systems

Designing a Cloud Storage System

CS420: Operating Systems

ES-1 Elettronica dei Sistemi 1 Computer Architecture

RAID: Redundant Arrays of Independent Disks

UK HQ RAID Chunk Size T F ISO 14001

Chapter 6 External Memory. Dr. Mohamed H. Al-Meer

VERY IMPORTANT NOTE! - RAID

Q & A From Hitachi Data Systems WebTech Presentation:

What is RAID? data reliability with performance

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Storing Data: Disks and Files

SAN Conceptual and Design Basics

Striped Set, Advantages and Disadvantages of Using RAID

3PAR Fast RAID: High Performance Without Compromise

Three-Dimensional Redundancy Codes for Archival Storage

Transcription:

Journal of Systems Architecture 46 (2000) 543±550 www.elsevier.com/locate/sysarc Stripped mirroring RAID architecture Hai Jin a,b, *, Kai Hwang b a The University of Hong Kong, Pokfulam Road, Hong Kong b Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089, USA Received 10 September 1998; received in revised form 17 February 1999; accepted 1 July 1999 Abstract Redundant arrays of independent disks (RAID) provide an e cient stable storage system for parallel access and fault tolerance. The most common fault tolerant RAID architecture is RAID-1 or RAID-5. The disadvantage of RAID-1 lies in excessive redundancy, while the write performance of RAID-5 is only 1/4 of that of RAID-0. In this paper, we propose a high performance and highly reliable disk array architecture, called stripped mirroring disk array (SMDA). It is a new solution to the small-write problem for disk array. SMDA stores the original data in two ways, one on a single disk and the other on a plurality of disks in RAID-0 by stripping. The reliability of the system is as good as RAID-1, but with a high throughput approaching that of RAID-0. Because SMDA omits the parity generation procedure when writing new data, it avoids the write performance loss often experienced in RAID-5. Ó 2000 Elsevier Science B.V. All rights reserved. Keywords: Disk array architecture; Mirroring; Parallel I/O; Fault tolerant; Performance evaluation 1. Introduction Redundant arrays of independent disks (RAID) [2,5] systems deliver higher throughput, capacity and availability than can be achieved by a single large disk by hooking together arrays of small disks. RAID technology is an e cient way to solve the bottleneck problem between CPU processing ability and I/O processing speed [4]. The tremendous growth of RAID technology has been driven by three factors. First, the growth in processor speed has outstripped the growth in disk data rate. This imbalance transforms traditionally computerbound applications to I/O-bound applications. * Corresponding author. E-mail addresses: hjin@ceng.usc.edu, hjin@eee.hku.hk (H. Jin), kaihwang@usc.edu (K. Hwang). Therefore, I/O system throughput must be increased by increasing the number of disks. Second, arrays of small-diameter disks often have substantial cost, power and performance advantages over larger drives. Third, such systems can be made highly reliable by storing a small amount of redundant information in the array. Without this redundancy, large disk arrays have unacceptable low data reliability because of their large number of component disks. Fig. 1 presents an overview of the RAID systems considered in this paper. This gure only shows the rst few units on each disk in di erent RAID levels. ``D'' represents a block, or unit, of user data (of unspeci ed size, but some multiple of one sector) and ``Px y'' a parity unit computed over user data units x through y. The numbers on the left indicate the o set into the raw disk, expressed in data units. Shaded blocks represent 1383-7621/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved. PII: S 1 3 8 3-7 6 2 1 ( 9 9 ) 0 0 027-2

544 H. Jin, K. Hwang / Journal of Systems Architecture 46 (2000) 543±550 Fig. 1. Data layout in RAID-0, 1, 5 and 10. redundant information, and nonshaded blocks represent user data. RAID-0 is nonredundant and does not tolerate faults. RAID-1 is simple mirroring, in which two copies of each data unit are maintained. RAID-5 exploits the fact that failed disks are self-identifying, achieving fault tolerance using a simple parity (exclusive-or) code, lowering the capacity overhead to only one disk out of six in this example. In RAID-5, the parity blocks rotate through the array rather than being concentrated on a single disk, avoiding parity access bottleneck [12]. RAID-10 [11,19] combines RAID-0 and RAID-1 in a single array. It provides data reliability through RAID-1 and enhanced I/O performance through disk stripping. While RAID-5 disk arrays o er performance and reliability advantages for a wide variety of applications, they possess at least one critical limitation: their throughput is penalized by a factor of four to RAID-0 for workloads of small writes. This penalty arises because a small-write request may require that the old value of user's targeted data be read (pre-read), overwriting this with new user data, pre-reading the old value of the corresponding parity, then overwriting this second disk block with the updated parity. In contrast, systems based on mirrored disks simply write the user's data on two separate disks and, therefore, are only penalized by a factor of two. This disparity, four accesses per small write instead of two, has been termed the small-write problem [21]. Small-write performance is important. The performance of on-line transaction processing (OLTP) system is largely determined by smallwrite performance. A single read-modify-write of an account record will require ve disk accesses for RAID-5, while the same operation would require three accesses on mirrored disks, and only two on RAID-0. Because of this limitation, many OLTP systems continue to employ the much more expensive option of mirrored disks. In this paper, we propose a new RAID architecture with high reliability and high performance, called stripped mirroring disk array (SMDA). It is a new solution to the small-write problem for disk array. SMDA stores the original data in two ways, one in a single disk and the other in a plurality of disks in the way of RAID-0. Section 2 reviews related works. Section 3 discusses the stripped mirroring disk array mechanism. Section 4 analyzes the read/write performance of di erent RAID architectures. Section 5 closes with the conclusions and future work. 2. Related works Many studies have been previously proposed to deal with the RAID write penalty. In this section, we will give a brief review of the related works. Parity stripping [3] stripes the parity across the disks, but does not stripe the data. It is based on the fact that the mirrored disk has higher availability and higher throughput. Small random request can achieve higher transfer throughput without stripping large data into small pieces. The disk utilization and throughput of parity stripping are similar to that of the mirrored disk, and have cost/gb comparable to RAID-5. Parity stripping has preferable fault containment and operations features compared with RAID-5. Floating data/parity structure [18] uses the method of shorten ``read-modify-write'' time when modifying data or parity information so as to solve the small-write problem. All the data blocks and parity blocks correspond to the di erent cylinders separately. In each cylinder, one track is

H. Jin, K. Hwang / Journal of Systems Architecture 46 (2000) 543±550 545 preserved to keep the modi cation result of the data block and parity block. This method can reduce response time of small write by the oating address physical space on the disk. But it is not so convenient to the large size request. After many instances of small write, many logically continuous blocks are physically separated. Thus it enlarges the rotation delay of accessing logically continuous data. Parity logging [21,22] sets a large capacity of cache or bu er to combine some small writes to a large writes to improve the data transfer rate and reduce the time of modifying parity information. It stores all the modi cation of parity information as a log in the logging cache. When the logging cache is ful lled, it writes parity information in large blocks to the parity log disk in a serial way. When the parity log disk is ful lled, it reads all the information in the parity log disk and the information of the parity disk or data disk to reconstruct parity information. One of the disadvantages of parity logging is that when the parity log disk is ful lled, the I/O request should be blocked so as to reconstruct parity information, and these operations should be carried on in a front end way. Besides, it can only reduce the parity block accessing time and has no in uence on the data block. Disk caching disk (DCD) [7] uses a small log disk, referred to as cache-disk, as secondary disk cache to optimize write performance. While the cache disk and the normal data disk have the same physical properties, the access speed of the former di ers dramatically from the latter because of di erent data units and di erent ways in which data are accessed. DCD exploits this speed di erence by using the log disk as a cache to build a reliable and smooth disk hierarchy. A small RAM bu er is used to collect small-write requests to form a log that is transferred onto the cache disk whenever the cache disk is idle. Because of the temporal locality, the DCD system shows write performance close to the same size RAM for the cost of a disk. The oating-location technique improves the e ciency of writes by eliminating static association of logical disk blocks and xed locations in the disk array. When a disk block is written, a new location is chosen in a manner that minimizes the disk-arm time devoted to the write, and a new physical-to-logical mapping is established. An example of this approach is the log-structure le system (LFS) [13,14]. It writes all modi cations to the disk sequentially in a log-like structure, thereby speeding up both le writing and crash recovery. The log is the only structure on the disk; it contains indexing information so that the le can be read back from the log e ciently. In order to maintain large free areas on disk for fast writing, LFS divides the log into segments and uses a segment cleaner to compress the live information from heavily fragmented segments. However, because logically nearby blocks may not be physically nearby, the performance of LFS in read-intensive workloads may be degraded if the read and write access patterns di er widely. The distorted-mirror approach [20] uses the 100% storage overhead of mirroring to avoid this problem: one copy of each block is stored in xed location, while the other copy is maintained in oating storage, achieving higher write throughput while maintaining data sequentially. However, all oating-location techniques require substantial host or controller storage for mapping information and bu ered data. Write bu ering [16,24] delays users' write requests in a large disk or le cache to achieve deep queue, which can then be scheduled to substantially reduce seek and rotational positioning overheads. Data loss on a single failure is possible in these systems unless fault-tolerant caches are used [17]. In the traditional mirrored system, all disks storing a mirrored collection are functional, but each may o er a di erent throughput over time to any individual reader. In order to avoid the performance consistency, the graduated declustering approach [1] will fetch data from all available data mirrors instead of picking a single disk to read a partition from. In the case where data are replicated on two disks, disk 0 and disk 1, the client will alternatively send a request for block 0 to disk 0, then block 1 to disk 1; as each disk responds, another request will be sent to it, for the next desired block.

546 H. Jin, K. Hwang / Journal of Systems Architecture 46 (2000) 543±550 3. Stripped mirroring disk array architecture This section discusses the architecture of stripped mirroring disk array (SMDA). Our approach is motivated by the fact that RAID-0 has the highest data transfer rate and the maximum I/O rate for both read and write, while RAID-1 has the highest reliability among all the RAID levels. SMDA stores the original data in two ways, with original data stored in one disk drive and duplicated data distributed and stored in di erent disk drives with the method of RAID-0. Fig. 2 shows a typical data layout of SMDA that composes of ve disk drives. This gure only shows the rst few units (where 16 units) on each disk. The numbers on the left indicate the o set into the raw disk, expressed in data units. Nonshaded blocks represent the original information, and shaded blocks represent the duplicated data that are distributed among all the other disks in the array. SMDA comprises a plurality of disk drives, while the number of disk drives in the array is at least 3. Suppose the number of disks in the array is N, then N P 3. The disk array controller controls the writing of data to each of the disks in the array and reading of data from each of the disks in the array. A disk drive unit connected to the disk control unit of the disk array has a logical group formed of Fig. 2. Data layout in stripped mirroring disk array (SMDA). a plurality of disk drives. In our example, there are two logical groups. The length of each logical group in each disk drive is 2 N 1 blocks. In each logical group, both original data and duplicated data of the original data are stored. Each logical group is divided into two sub-groups, while the original data are stored in the rst sub-group of each disk drive and the duplicated data are stored in the second sub-group distributed among all the other disk drives. The original data blocks D0, D1, D2 and D3 are stored in the rst subgroup of the rst logical group in disk 0 in our example. The duplicated data blocks D0, D1, D2 and D3 are stored in the second sub-group of the rst logical group in disks 1, 2, 3 and 4, respectively. We call each of the locations stored in the original data in the rst sub-group and the locations stored in the duplicated data distributed in the second sub-group among all the other disk drives in the array an area pair. In the example, the locations of rst sub-group of disk 0 and the locations of the fth data block in disks 1, 2, 3 and 4 belong to one area pair. The locations of the rst sub-group of disk 1 and the locations of the fth data block in disk 0 and the sixth data block in disks 2, 3 and 4 belong to another area pair. The area for original data and an area for duplicated data belonging to a same area pair are located in di erent respective disk drives. A set of n area pairs (where 2 6 n 6 N 1) which have their areas for original data on a common disk drive, the corresponding areas for duplicated data are distributed in one-to-one correspondence across each of n other disk drives. In this way, the original data and the duplicated data are stored in di erent drives. When a plurality of data stored in the disk array are to be read, the original data in one area or the duplicated data distributed in the same area pair among all the other disks are read in parallel from the di erent drives. In the above example, if data blocks D0, D1, D2 and D3 are to be read from the array, they can be read from disks 1, 2, 3 and 4 in parallel. If only one data block, say D0, is to be read from the array, it can be read both from disk 0 and 1 in parallel. As the duplicated data are stored in the area among all the other disk drives

H. Jin, K. Hwang / Journal of Systems Architecture 46 (2000) 543±550 547 in the array in the fashion of RAID-0, SMDA illustrates the higher I/O performance by reading the duplicated data out in parallel from the disk drives. When a plurality of data are written to the disk array, they are written to the area for original data and another area for duplicated data belonging to the same area pair in parallel with the di erent disk drives in the array. In the above example, if data block D0 is to be written to the array, it can be written to the disk 0 and 1 in parallel. As it only writes the original data and duplicated data to the disk drives in the array, without keeping the parity information of the data information, it avoids the write performance loss when using SMDA architecture. Because it omits the parity generation procedure when writing the new data, the overall performance of SMDA is the same as that of RAID-0. The fault tolerance of SMDA architecture is realized by using the original data and the duplicated data among all the disks in the array. In case of one disk drive crashes in SMDA, the locations to store the original data can be read out from all the other disk drives in the disk array. The locations to store the part of duplicated data can be read out from the disk drive storing the original data in the same area pair of disk array. In the above example, suppose disk 2 is in failure, let us consider the data blocks in the rst logical group. The original data blocks D8, D9, D10 and D11 can be read from disks 0, 1, 3 and 4, respectively. The duplicated data blocks D1, D5, D14 and D18 can be read from disks 0, 1, 3 and 4, respectively. 4. Modeling and performance evaluation In this section we present a utilization-based analytical model to compare the I/O access performance of RAID-0, RAID-1, RAID-5, RAID-10 and SMDA. RAID-0 here is just for comparison only, we will not use it in any disk array system because of lack of fault tolerance. This model predicts saturated array performance in terms of achieved disk utilization. The variables used in this model are de ned as follows: B s N D T S M R H Amount of data to be accessed Numbers of disks in array Data units per track Tracks per cylinder Average seek time Single track seek time Average rotational delay (1/2 disk rotation time) Head switch time We de ne the unit read as the read access from only one data block in the array, while unit write as the write access to only one data block in the array. Unit read time (r) and unit write time (w) do not include the start-up mechanical delay time, which may include seek time, head switch time and rotational delay. For the read access, the situation is quite simple. All the di erent RAID architectures have the same unit read time, that is 2R=D. For the write access, the situation for RAID-5 is quite di erent from other RAID architectures. For RAID-5, small writes require four I/Os: data pre-read, data write, parity read, parity write. These can be combined into two read±rotate±write accesses. Each read±rotate±write access can be done in an I/O that reads the data, waits for the disk to spin around once, then updates the data. Each unit write time for RAID-5 is 2R=D 2R 2R=D 2R=D. For each small write, there are two unit writes. For RAID-0, RAID-1, RAID-10 and SMDA, no pre-read is required. The unit write time is 2R=D. Next, we discuss the start-up mechanical delay for di erent RAID architectures. There are three di erent types of mechanical start-up delays for each I/O accesses, they are seek time, head switch time and rotational delay. Seek operation happens at when the head of the disk seeks user data among di erent cylinders. Head switch happens when the head of the disk changes in the same cylinder. Rotational delay happens when the head of the disk waits for the data to rotate under the head. Because the data layout in each RAID architecture is di erent, the head switch times (m 1 ) and cylinder

548 H. Jin, K. Hwang / Journal of Systems Architecture 46 (2000) 543±550 switch time (m 2 ) are di erent. All these values are listed in Tables 1 and 2. Tables 1 and 2 compare the read and write access time for di erent RAID architectures among RAID-0, RAID-1, RAID-5, RAID-10 and SMDA. From the discussion, we can see that using the architecture of SMDA can greatly improve the I/O throughput of disk array. Because of the smallwrite problem, RAID-5 has the lowest I/O throughput among ve RAID architectures. RAID-1 has the limited throughput because only one pair of disks can be accessed in parallel. RAID-10 has half the peak throughput as only half of the disks in the array can be accessed in parallel. For SMDA, as the maximum disks that can be accessed in parallel is N 1, the total throughput can achieve as high as N 1=N peak throughput. Here we assume RAID-0 can achieve peak throughput. SMDA has the higher throughput even in the degraded mode and the rebuilt mode. In the case of degraded mode, there is no need to modify the extra data to keep the data consistent as in the RAID-5. It only writes one copy, whether original or duplicated depends on the locations where the data blocks are written. Thus, even in the degraded mode, SMDA performs the highest I/O performance just as in the normal mode. In the time of rebuild, it is much more easy to recovery the failure data to the newly replaced disk drive just using the copy operation avoiding the operation of reading all the other data as well as the parity information to perform exclusive OR operation in RAID-5. Therefore, it greatly reduces the rebuild time and MTTR. The reliability of SMDA is the same as RAID-1 and RAID-10, which has the highest reliability among all the RAID architectures. Table 1 Comparisons of read access time for di erent RAID architectures Architecture Head switch times (m 1 ) Cylinder switch times (m 2 ) Unit read time (r) Total read time RAID-0 B s =ND B s =NDT 2R=D S R rš m 2 1 M R rš m 1 m 2 1 H R rš RAID-1 B s =2D B s =2DT 2R=D S R rš m 2 1 M R rš m 1 m 2 1 H R rš RAID-5 B s = N 1 D B s = N 1 DT 2R=D S R rš m 2 1 M R rš m 1 m 2 1 H R rš RAID-10 B s = N D B 2 s= N DT 2R=D S R rš m 2 2 1 M R rš m 1 m 2 1 H R rš SMDA B s = N 1 D B s = N 1 DT 2R=D S R rš m 2 1 M R rš m 1 m 2 1 H R rš Table 2 Comparisons of write access time for di erent RAID architectures Architecture Head switch times (m 1 ) Cylinder switch times (m 2 ) Unit write time (w) Total write time RAID-0 B s =ND B s =NDT 2R=D S R wš m 2 1 M R wš m 1 m 2 1 H R wš RAID-1 B s =2D B s =2DT 2R=D S R wš m 2 1 M R wš m 1 m 2 1 H R wš RAID-5 B s = N 1 D B s = N 1 DT 2R=D 2R 2R=D 2R=D 2 S R wš 2 m 2 1 M R wš 2 m 1 m 2 1 H R wš RAID-10 B s = N D B 2 s= N DT 2R=D 2 S R wš m 2 1 M R wš m 1 m 2 1 H R wš SMDA B s = N 1 D B s = N 1 DT 2R=D S R wš m 2 1 M R wš m 1 m 2 1 H R wš

H. Jin, K. Hwang / Journal of Systems Architecture 46 (2000) 543±550 549 5. Conclusions and future work This paper presents a new solution to the smallwrite problem and high I/O load applications in disk array. We store the original copy on one disk drive while distributing the duplicated copies to other drives in the array. The proposed technique achieves substantially higher performance than conventional RAID-5 arrays. The data should not be read in advance. There is no need to keep the parity information as it does not use the method of parity encoded fault tolerant algorithm. Compared with other RAID architectures, stripped mirroring RAID architecture (SMDA) can achieve nearly the peak throughput (N 1=N). Although the reliability of SMDA is the same as RAID-1 and RAID-10, the SMDA may lead to higher throughput than RAID-1 and RAID-10. One application of SMDA architecture is in the design of I/O systems for cluster of computers. Clusters of workstations [9] are often used in I/O intensive applications, especially in the business world. High availability in cluster operations demands both high bandwidth and fault tolerance in the distributed disk arrays. Di erent distributed RAID architectures were proposed to enhance the reliability of clusters [6,15,23]. We proposed a hierarchical checkpointing scheme using mirroring architecture to build high availability cluster of workstations [8,10]. In order to improve the throughput of mirroring architecture, we use SMDA architecture to store mirrored checkpointers. We hope SMDA architecture can be adopted by RAB [19] and as many manufacturers as possible to be an extension of standard RAID levels. References [1] R.H. Arpaci-Dusseau, E. Anderson, N. Treuhaft, D.E. Culler, J.M. Hellerstein, D. Patterson, K. Yelick, Cluster I/ O with river: making the fast case common, in: Proceedings of Sixth Workshop on Input/Ouput in Parallel and Distributed Systems (IOPADSÕ99), Atlanta, Georgia, May 1999. [2] P.M. Chen, E.K. Lee, G.A. Gibson, R.H. Katz, D.A. Patterson, RAID: High performance reliable secondary storage, ACM Computing Surveys 26 (2) (1994) 145±185. [3] S. Chen, D. Towsley, The design and evaluation of raid 5 and parity stripping disk array architecture, Journal of Parallel and Distributed Computing 17 (1993) 58±74. [4] C.L. Elford, D.A. Reed, Technology trends and disk array performance, Journal of Parallel and Distributed Computing 46 (1997) 136±147. [5] G. Gibson, Redundant Disk Arrays: Reliable, Parallel Secondary Storage, MIT Press, Cambridge, MA, 1992. [6] G.A. Gibson, D.F. Nagle, K. Amiri, F.W. Chang, E.M. Feinberg, H. Gobio, C. Lee, B. Ozceri, E. Riedel, D. Rochberg, J. Zelenka, File server scaling with networkattached secure disks, in: Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (Sigmetrics '97), June 1997. [7] Y. Hu, Q. Yang, DCD-disk caching disk: a new approach for boosting I/O performance, in: Proceedings of the 23rd International Symposium on Computer Architecture, 1996, pp. 169±177. [8] K. Hwang, H. Jin, E. Chow, C.-L. Wang, Z. Xu, Designing SSI clusters with hierarchical checkpointing and single I/O space, IEEE Concurrency 7 (1) (1999) 60±69. [9] K. Hwang, Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, WCB/McGraw-Hill, New York, 1998. [10] H. Jin, K. Hwang, Recon gurable RAID-5 and mirroring architectures for building high-availability clusters of workstations, Technical Report, Internet and Cluster Computing Laboratory, University of Southern California, Los Angeles, CA, 1999. [11] V. Jumani, Redundant arrays of inexpensive disks RAID: technology description, characterization, comparisons, usages and cost bene ts, Journal of Magnetics Society of Japan 18 (S1) (1994) 53±58. [12] E. Lee, R. Katz, The performance of parity placement in disk arrays, IEEE Transactions on Computers C-42 (6) (1993) 651±664. [13] J.N. Matthews, D. Roselli, A.M. Costello, R.Y. Wang, T.E. Anderson, Improving the performance of log-structured le systems with adaptive methods, in: Proceedings of 16th Symposium on Operating Systems Principles, October 1997. [14] B. McNutt, Background data movement in a log-structured disk subsystem, IBM Journal of Research and Development 38 (1) (1994) 47±58. [15] D.A. Menasce, O.I. Pentakalos, Y. Yesha, An analytic model of hierarchical mass storage systems with networkattached storage devices, in: Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (Sigmetrics '96), May 1996, pp. 180± 189. [16] J. Menon, Performance of RAID 5 disk arrays with read and write caching, Distributed and Parallel Databases 2 (3) (1994) 261±293. [17] J. Menon, J. Cortney, The architecture of a fault-tolerant cached RAID controller, in: Proceedings of the 20th Annual International Symposium on Computer Architecture, San Diego, CA, May 1993, pp. 76±86.

550 H. Jin, K. Hwang / Journal of Systems Architecture 46 (2000) 543±550 [18] J. Menon, J. Roche, J. Kason, Floating parity and data disk arrays, Journals of Parallel and Distributed Computing (1993). [19] RAID Advisory Board, The RAIDbook, 7th ed., The RAID Advisory Board, December 1998. [20] J.A. Solworth, C.U. Orji, Distorted mapping techniques for high performance mirrored disk systems, Distributed and Parallel Databases: An International Journal 1 (1) (1993) 81±102. [21] D. Stodolsky, G. Gibson, M. Holland, Parity logging overcoming the small write problem in redundant disk arrays, in: Proceedings of the 20th Annual International Symposium on Computer Architecture, San Diego, CA, May 1993, pp. 64±75. [22] D. Stodolsky, M. Holland, W.V. Courtright II, G.A. Gibson, Parity-logging disk arrays, ACM Transactions on Computer Systems 12 (3) (1994) 206±235. [23] M. Stonebraker, G.A. Schloss, Distributed RAID ± A new multiple copy algorithm, in: Proceedings of the Sixth International Conference on Data Engineering, February 1990, pp. 430±443. [24] K. Treiber, J. Menon, Simulation study of cached RAID5 designs, in: Proceedings of the First International Conference on High-Performance Computer Architecture, January 1995, pp. 186±197. Kai Hwang received the Ph.D. degree in electrical engineering and computer science from the University of California at Berkeley in 1972. He is a Professor of Computer Engineering at the University of Southern California. Prior to joining USC, he has taught at Purdue University for many years. An IEEE Fellow, he specializes in computer architecture, digital arithmetic, parallel processing, and distributed computing. He has published over 150 scienti c papers and six books in computer science and engineering. He has served as a distinguished visitor of the IEEE Computer Society, on the ACM SigArch Board of Directors, and the founding Editor-in-Chief of the Journal of Parallel and Distributed Computing. He has chaired the International Conferences, ARITH-7 in 1985, ICPP 86, IPPS 96, and HPCA-4 in 1998. His current interests focus on fault tolerance and single system image in multicomputer clusters and integrated information technology for multi-agent, Java, Internet, and multimedia applications. Hai Jin is the Professor of computer science at Huazhong University of Science and Technology, Wuhan, China. He obtained B.S., M.S. degree of computer science from Huazhong University of Science and Technology in 1988 and 1991, respectively. He obtained his Ph.D. degree in electrical and electronic engineering in 1994 from Huazhong University of Science and Technology. He is the associate dean of School of Computer Science and Technology at Huazhong University of Science and Technology. In 1996, he got the scholarship awarded by German Academic Exchange Service (DAAD) for academic research at Technical University of Chemnitz-Zwickau in Chemnitz, Germany. Now he is a postdoctoral research fellow in Department of Electrical and Electronic Engineering at the University of Hong Kong, where he participated in the HKU Pearl Cluster project. Presently, he worked as a visiting scholar at the Internet and Cluster Computing Laboratory at the University of Southern California, where he engages in the USC Trojan Cluster project. He served as program committee member of PDPTA'99, IWCCÕ99. He has co-authored three books and published nearly 30 papers in international journals and conferences. His research interests cover computer architecture, parallel I/O, RAID architecture design, high performance storage system, cluster computing, benchmark and performance evaluation, and fault tolerant.