Rebuild Strategies for Clustered Redundant Disk Arrays

Transcription

1 Rebuild Strategies for Clustered Redundant Disk Arrays Gang Fu, Alexander Thomasian, Chunqi Han and Spencer Ng Computer Science Department New Jersey Institute of Technology Newark, NJ 07102, USA Abstract RAID5 tolerates single disk failures by recreating lost data blocks on demand, but this results in the doubling of the load of surviving disks for pure read workload. This increase may be unacceptable if the original load was high. Clustered RAID (CRAID) with parity group size G smaller than the number of disks (G < N) was proposed so that the increase in load is α = (G 1)/(N 1) < 1, but this is at the cost of higher parity overhead 1/G. There have been two implementation of CRAID (i) the balanced incomplete block design BIBD and (ii) the nearly random permutation layout NRP data layouts. In this study we consider the latter implementation, since it provides more flexibility in varying α for a fixed N. Rebuild is a systematic reconstruction of the contents of the failed disk on a spare disk, which involves the reading of the rebuild units (say tracks) from a subset of surviving disks. We compare the effect of processing rebuild requests using the vacationing server model VSM and the permanent customer model PCM, which process rebuild requests at a lower or the same priority as user requests, respectively. We also investigate the effect of a control policy to ensure the progress of the rebuild process, since the spare disk may become a bottleneck in this case. The effect of various parameters on the completion time of rebuild processing and mean disk response time are also explored. Authors supported by NSF through Grant in Computer Systems Architecture. Hitachi Global Storage Technologies, San Jose Research Center, San Jose, CA. 1

2 1 Introduction RAID5 with rotated parity is a popular design, which tolerates single disk failure and balances disk loads via striping. Striping partitions a file into stripe units SUs and allocates them in round-robin manner on N disks, inserting a parity SU, which is the exclusive-or (XOR) of the contents of N 1 SUs in the same row or stripe as the parity. A capacity equal to the capacity of one disk is dedicated to parity. The parity blocks are kept up-to-date as data is updated and this is especially costly when small randomly placed data blocks are updated, hence the small write penalty. Caching of modified blocks in a non-volatile RAM allows the destaging of dirty blocks to be deferred, so that both data and the associated parity block can be written to disk at a lower priority than read requests. When a single disk fails, a requested data block on the failed disk can be reconstructed by XORing the contents of the surviving N 1 disks. which are accessed by an N 1 way fork-join request. There is an increase in load, since each surviving disks needs to process its own load in addition to the fork-join requests. At worst, the load is doubled when all requests are reads. The increase in load may be unacceptable when the disk utilization was already high in normal mode, so that the increase will result in disk saturation in degraded mode. This will more likely occur if we have an infinite source model, so that the arrival rate of requests is not reduced due to a slowdown of the disk request rate. Given that disk requests belong to various categories, some with lower priorities than others, one solution to deal with overload is to 2

3 shed low priority loads up to the point that overload is alleviated. This can be accomplished by a front-end such as Facade [9]. Clustered RAID proposed in [12] trades disk capacity against load increase. While the number of SUs participating over which the parity is computed, referred to as the parity group size: G = N in RAID5, clustered RAID allows G < N. To reconstruct a single block, a fraction α = (G 1)/(N 1) of the disks need to be accessed, while previously the declustering ratio was α = 1. Balanced Incomplete Block Designs BIBD [8, 13] and nearly random permutation NRP data layouts [11] are two approaches to balance disk loads from the viewpoint of parity updates, while reducing the increase in disk load in degraded mode. The NRP data layout provides higher flexibility than the BIBD layout. We will briefly describe the two methods in the next section and point out their similarity. The paper is organized as follows. In Section 2.1 and 2.2 we briefly describe the BIBD and NRP layouts. The simulation model is described in Section 4. Simulation results are given in Section 5, which is followed by conclusions in Section 6. 2 Clustered RAID Complete block designs are unacceptable, since given N disks and parity groups of size G, the number of combinations ( ) N G with N = 20 and G = 10 is too large to yield a balanced load with finite capacity disks. In what follows we describe two data layouts that work. 3

4 2.1 BIBD Data Layout A BIBD data layout is a grouping of N distinct objects into b blocks, such that each block contains G objects, each object occurs in exactly r blocks, each pair of objects appears in exactly L blocks [6]. Only 3 out of 5 variables are free since bg = Nr and r(g 1) = L(N 1). Consider the BIDB data layout in [13]: N = 10, G = 4, the number of parity groups is b = 15, the number of domains (different parity groups) per disk is r = 6, and the number of parity groups common to any pair of disks is L = 2. The data layout is given by the following table: Disk numbers Parity group number Block designs exist only for certain values of N and G, e.g., a layout for N = 33 with G = 12 does not exist, but G = 11 and G = 13 can be used instead [8]. 2.2 Nearly Random Permutation Data Layout This method proposed in [11] can be summarized as follows: 1. The whole disk array space is organized as a matrix, where the N columns correspond to disks and there are M rows or stripes consisting of N SUs ( stripe units). The SUs in the N columns are numbered 0 : N 1 and there are NM SUs in the array numbered {0 : NM 1}. 2. The parity group is G < N. Parity groups are placed sequentially on the NM SUs in 4

5 the array, so that parity group i occupies SUs ig through ig + G 1. We refer to this as an initial logical allocation. With N = 10 and G = 4 the initial data layout is shown below (Pi-j stands for the parity of SUs Di through Dj). It can be seen that parity SUs appear on half of the disks, so that the write load would not be balanced. Disk numbers Parity D0 D1 D2 P0-2 D3 D4 D5 P3-6 D6 D7 groups D8 P6-8 D9 D10 D11 P9-11 D12 D13 D14 P12-14 (initial D15 D16 D17 P15-17 D18 D19 D20 P18-20 D21 D22 allocation) D23 P21-23 D24 D25 D26 P24-26 D27 D28 D29 P For row I generate a random permutation of {0, 1,..., N 1} given as P I = {P 0, P 1,..., P N 1 }. I is used as a seed to the pseudo-random number generator that produces the permutation, Algorithm 235 [4] is used in our study 1. So that given a block number, N and the stripe unit size, we can compute the row of the block and its disk number. If mod(n, G) = 0 then step 3 should be repeated M times, i.e., 1 I M. If mod(n, G) 0 then the random permutation is reused K = LCM(N, G)/N times, where LCM(N, G) is the least common multiple of N and G. For example, assume the random permutation P 1 = {0, 9, 7, 6, 2, 1, 5, 3, 4, 8} for row one. Then we will have the following data allocation for the first two rows, since K = LCM(10, 4)/10 = 2. Note that permutations generated on successive rows will ensure and (approximately) equal number of parity blocks per disk. Disk numbers Final D0 D4 D3 P3-6 D6 D5 P0-2 D2 D7 D1 allocation D8 P0-11 D11 D13 D14 D12 D10 D9 P12-14 P6-8 Note that SUs of a parity group striding two rows in this case will be mapped onto 1 Algorithm 235: Shuffle array a[i], i=0,...,n-1. for i := n step -1 until 2 do { j=entier(i random + 1); b:=a[i]; a[i]:=a[j]; a[j]:=b} 5

6 different columns, so that (i) the parity SU will reside on a different disk than the data SUs, and (ii) all SUs can be accessed in parallel. These requirements are more formally discussed in the next section. 2.3 Discussion and Other Designs Six properties for ideal layouts are given in [8]: (i) Single failure correcting: the SUs of the same stripe are mapped to different disks. (ii) Balanced load due to parity: all disks have the same number of parity stripes mapped onto them. (iii) Balanced load in failed mode: the reconstruction workload should be balanced across all disks. (iv) Large write optimization: each stripe should contain N 1 contiguous SUs, where N is the parity group size. (v) Maximal read parallelism: reading n N disk blocks entails in accessing n disks. (vi) Efficient mapping: the function that maps physical to logical addresses is easily computable. The Permutation Development Data Layout PDDL is a mapping function described in [16], which has excellent properties and good performance in light load (like the PRIME data layout [2]) and heavy loads (like the DATUM data layout [1]). 3 Rebuild Processing The rebuild process is a systematic reconstruction of the contents of the failed disk, which is started immediately after a disk fails, provided a hot spare is available. The smallest unit of reconstruction is called the rebuild unit RU, which is usually a stripe unit or a fraction of it. An RU size equal to one or multiple of disk tracks was considered in earlier studies, 6

7 but this is not appropriate in zoned disks, since the number of (512 byte) sectors per track varies from track to track and this would complicate buffer allocation and control logic for rebuild. Of interest is the time to complete the rebuild T rebuild (ρ) and the response time of user requests versus time: R(t), 0 < t < T rebuild (ρ), where ρ is the disk utilization before the disk failure occurred (a.k.a. normal mode). When the disk is idle, rebuild time equals the number of the tracks and disk rotation time (plus delays due to track and cylinder skews). After a disk failure a RAID5 disk array operates at a higher utilization ρ = βρ, where β > 1 and in the worst case when all requests are reads β = 2. ρ is specified explicitly, since it has a first order effect on rebuild time. Other factors affecting rebuild time are discussed below. A distinction is made between stripe-oriented or more appropriately rebuild-unit (RU)- oriented and disk-oriented rebuild in [8]. In stripe-oriented rebuild each RU is rebuild by a dedicated process, which accesses RUs from all surviving disks, XORs them, and writes the resulting value to a spare disk. Denoting the maximum number of concurrent rebuild processes as P, when P = 1 there is synchronization after reconstructing each rebuild unit, but as P is increased the effect of this efficiency is eliminated. Disk-oriented rebuild dedicates a process to each disk, which reads RUs from a surviving disks asynchronously. The number of RUs read from surviving disks which have contributed to an RU, yet to be written to the spare disk varies. Therefore, this policy has higher buffer requirement than stripeoriented rebuild with small number of processes. It is shown in [8] that disk-oriented rebuild outperforms stripe-oriented rebuild, therefore the stripe-oriented rebuild policy will not be considered further in this study. 7

8 Regarding buffer requirements, we consider two type of buffers: (i) B disk, i.e., temporary buffers dedicated to each disk; (ii) B spare, i.e., buffers dedicated for writing to the spare disk. Since the unit of buffering is an RU (rebuild unit), the two types of buffers are interchangeable and there is no reason to unnecessarily move data from one buffer to the other. Once an RU is read into of B disk buffers, it is XORed as soon as possible with the appropriate RU in B spare, so that B spare will eventually hold the XORs of all appropriate disks, at which point it can be written to the spare disk. The time to XOR an RU and free B temp is a function of memory bandwidth and the parallelism in the XOR unit. We assume that the XOR operation is fast enough so that the B disk buffers are never exhausted, and that the reading of RUs from disk due to disk-oriented rebuild is not stopped because of this reason. We will consider the effect of B spare on rebuild performance, however. Rebuild requests can be processed at the same priority as user requests, which is the case with the permanent customer model PCM [11]. In PCM one request is processed at a time, i.e., a new rebuild request is inserted at the end of the disk queue as soon as a previous request is completed. Rebuild processing can be started as soon as a disk becomes idle, i.e., completes the processing of pending user requests. Rebuild processing is stopped after a user request arrives. This rebuild policy corresponds to the vacationing server model VSM in queueing theory and has been investigated in [8, 18, 20]. In effect, an idle server (resp. disk) takes successive vacations (resp. reads successive RUs), but returns from vacation (resp. stops reading RUs) when a user request arrives at the disk, so that rebuild requests are processed at a lower (nonpreemptive) priority than user requests. 8

9 Rebuild options in RAID5 are classified in [12] as follows: Baseline rebuild: Materialized blocks on the spare disk are updated, but not used to satisfy read requests. Read redirection: Materialized blocks on the spare disk are updated, but also used to satisfy read requests. This provides faster response for read requests intended for the failed disk, as well as reduces the utilization of surviving disks resulting in improved response times at these disks and faster rebuild. Piggy-backing at block level: Reconstructed data are materialized onto the spare disk as a result of a read targeted to the failed disk. Rebuild processing cost is not reduced unless all of the blocks on a track have been rebuilt. In fact it results in degraded performance as shown by the simulation results in [8]. Piggy-backing at track level is proposed and evaluated in [5]. It is shown to reduce rebuild time when the initial disk load is low enough to tolerate the increased load when the reading of blocks, requiring half a disk rotation on the average, is replaced by a full disk rotation to read a whole track with a zero latency capability. Several policies to improve the response time of user requests by allowing them to preempt rebuild requests are proposed in [19]. (i) Split-seek option: after a seek to read a track is completed, interrupt rebuild process if user disk requests are pending, otherwise start reading consecutive tracks until a user request arrives. (ii) Split-latency/transfer option: allows preemption even after search and transfer are started. (iii) Preemptable seeks: not considered, because this requires intimate knowledge of disk characteristics. While there is some improvement in response time, this improvement is at the cost of increased rebuild time. In effect a larger number of user requests are affected because the rebuild time is elongated. The few studies dealing with rebuild processing [8, 11, 20] leave several questions unan- 9

10 swered. A recent study evaluates the relative performance of VSM and PCM in (unclustered) RAID5 disk arrays and investigates the effect of buffer size, piggybacking, read redirection, etc [5]. 4 Performance Evaluation of Clustered RAID Analytic modeling has been used in the past to evaluate RAID5 performance, in normal, degraded, and rebuild modes, e.g., [3, 10, 18, 11, 20]. In normal mode there are no disk failures, in degraded mode the system operates with a single failed disk. All of the above studies with one exception ([10]) utilize the M/G/1 queueing model, allowing general rather than exponential service times. It has been shown by validation against detailed simulation results that an M/G/1 model provides very accurate estimates of mean response time, even for disks with zoning when rebuilding a RAID5 disk array according to the VSM policy [20]. More recently two-disk failure tolerant arrays such as RAID6, EVENODD, and RM2 have been analyzed in normal and degraded modes using the vacationing server model [7]. Analytical modeling has many shortcomings, e.g., there are no analytical models for the SATF policy. There are difficulties with incorporating the effect of passive resources, such as buffers. Another reason for adopting simulation rather than an analytic solution method is because of the approximations required for analysis, which would have required validation by simulation anyway. In this study we use simulation to study the CRAID5 performance with the NRP data 10

11 layout with varying degrees of declustering ratio. Our RAID5 simulator utilizes a detailed simulator of single disks, which can handle different disk drives whose characteristics are given at [14]. Here we present results for the IBM 18ES disk drive, which has 9 GB, and rotates at 7200 RPM, so that rotation time 8.33 ms. It has 11 zones, with the number of sectors per track varying from The average seek time is 7.16 ms and the average access time is over 11 ms. While our simulator can handle traces, such as those available at [17], for efficiency reasons we utilize a workload characterization of OLTP workloads given in [15], which shows that in an OLTP environment 96% of requests are to 4 KB and 4% to 24 KB blocks and that this blocks are randomly placed. We simplify this model and assume that all disk requests are to 4KB blocks. This is because positioning time dominates the service time. We also assume that the arrival process is Poisson, since it allows us to vary the arrival rate of requests and, for example, obtain the mean response time characteristic of the system, in normal and degraded modes. In degraded mode, as discussed in the previous section, we are interested in the rebuild time, assuming rebuild is started immediately, as soon as a disk fails. The user response time R(t) in rebuild mode is plotted as a function of time. The results provided are based on the repetition of the run. 11

12 5 Experimental Results The parameter space to be investigated includes: (i) VSM versus PCM. (ii) The effect of read redirection and dynamic controlling. (iii) The impact of buffer size. (iv) The impact of array size. (iv) the impact of parity de-clustering. (v) The impact of rebuilt unit size. In this paper, we mainly focus on the rebuild time, and the response time of user requests rather than rebuild requests. 5.1 VSM versus PCM (a) Declustering ratio α = 0.75 (b) Declustering ratio α = 0.25 Figure 1: Mean user response time and rebuild time with VSM and PCM. In both VSM and PCM, preemption is not allowed, so that the disk will not process user requests until the current rebuild request is completed. Figure 1(a) and (b) show the mean user response time and rebuild progress versus elapsed time in VSM and PCM with different de-clustering ratio. The following observations can be made: (i) the user response time in VSM is always lower than in PCM. (ii) the rebuild time 12

13 in VSM is always shorter than in PCM. It is natural that the user response time in VSM is lower than in PCM, because in VSM rebuild requests are processed at a lower priority and hence they have much less impact on user requests. However, it is somewhat counter-intuitive that the rebuild time in VSM is shorter than in PCM. The reason is that during rebuild, the utilization of bottleneck disk(s), either surviving disks or spare disk, is approximately 100%. And the disk utilization due to the user requests is approximately the same no matter what rebuild policy applied. Therefore, the remaining utilization left for rebuilding is a constant. Then the key to improve rebuild efficiency is to reduce the mean rebuild service time per RU. In VSM, the rebuild requests are processed only when the disk is idle, and it is more likely that the disk is still idle when the first rebuild request finished so that a second rebuild request can follow up without incurring an additional seek. Therefore, the mean service time per RU in VSM is shorter than that of PCM. In other words, the probability that the consecutive rebuild be interrupted is P VSM interrupt = 1 e λx RU in VSM and P PCM interrupt = 1 e λ(x RU +W RU ) in PCM, where X RU is the mean service time for reading/writing an RU and W RU is the mean waiting time of rebuild request in PCM. Since W RU > 0, P VSM interrupt < P PCM interrupt. Since VSM is superior to PCM in both rebuild time and user response time, we only consider VSM hereafter. 13

14 Figure 2: The Effect of Read Redirection 5.2 The effect of read redirection and dynamic control Read redirection allows the spare disk to process user read requests to access the part of the spare disk that has been already rebuilt. Read redirection can lower the user response time and shorten the rebuild process significantly. Figure 2 shows the overall mean user response time and rebuild time with or without read direction. However, read redirection may retard the rebuild progress when spare disk becomes a bottleneck, since it would increase the load of spare disk. This would typically happen when the RAID has a low declustering ratio (α) such that the surviving disks are feeding spare disk with data too fast than it can consume. When this happens, in order to shorten the rebuild time, we can control the fraction of read requests that are redirected to the spare disk so that the write rate on spare disk matches the read rate on surviving disks. Hence the idea of dynamic read redirection control, which was first present in [12]. 14

15 (a) Declustering ratio α = 1 (b) Declustering ratio α = 0.75 (c) Declustering ratio α = 0.5 (d) Declustering ratio α = 0.25 Figure 3: The Effect of Dynamic Control on Read Redirection at various declustering ratios. The effects of dynamic controlling of read redirection under various decluster ratios are shown in Figures 3(a),(b),(c), and (d). It can be observed that the improvement by using dynamic control is more noticeable when α is small and disk utilizations are high. 5.3 The impact of buffer size Figure 4 shows the user response time and rebuild time versus buffer size in VSM. In the simulation we use shared buffer for all disks. After reading a rebuild unit from surviving disk, it is XORed right away onto the rebuild working buffer which will be written to the spare disk. When all surviving RU from parity group are read, the working buffer contains the reconstructed data and can be materialized to the spare disk. Dedicated buffers for 15

16 Figure 4: The impact of buffer size each disk is required before the RU can be XORed onto working buffer, but their sizes are negligibly small and are therefore ignored. It can be observed that (i) A larger buffer size leads to shorter rebuild time. Due to temporary load imbalance, disks may go out of sync: several disks are too busy and the rebuild requests on those disks lag far behind the rebuild requests on others. As a result, the rebuild buffer is filled up and even the idle disks can not read RUs anymore. Hence the rebuild process is suspended until the bottleneck disks get time to process rebuild requests. Obviously, the possibility of this suspension is smaller with large buffer sizes. Similar to caching, the improvement of the increasing the buffer size is significant at small sizes, but drops quickly while the buffer size is big enough. A proper rebuild buffer size is related to the RU size and the workload variance over disks. (ii) The user response times are not sensitive to buffer size. In VSM, the rebuild requests are processed at a lower priority, so process rate of rebuild requests doesn t affect rebuild 16

17 requests significantly. 5.4 The impact of rebuilt unit size Figure 5: The impact of rebuild unit size Figure 5 plots user response time and rebuild time versus rebuild unit (RU) size. It is shown that a larger RU size leads to higher user response time but shorter rebuild time. The reason is that larger RU takes longer service time per rebuild request, while the user requests are typically delayed by rebuild requests the amount of half their service time. On the other hand, we gain the efficiency since the each seek can read more blocks. Consequently, the rebuild time is shortened. 5.5 The impact of array size Figure 6 shows user response time and rebuild time versus array size. The declustering ratio α is fixed at We can see that the array size does not affect user response time and rebuild time significantly. The array size may affect the rebuild speed through the fork-join 17

18 Figure 6: The impact of array size effect. Larger array size will result in higher variance for fork-join requests. However, this variance is concealed by the buffer so long as the buffer size is large enough. 5.6 The impact of parity de-clustering Figure 7: The impact of parity de-clustering on user response time. Figure 7 shows disk utilization versus the user response time when the rebuild process just starts (i.e. the worse case user response time). It is clear that all degraded mode user 18

19 Figure 8: The impact of parity de-clustering on rebuild time response time suffers a rise comparing to normal mode, while this rise is smaller for smaller α. The declustering ratio (α) can substantially affect the rebuild process in two ways: firstly, the overhead brought to each surviving disk by the rebuild process depends on the declustering ratio. The smaller the α, the lower the rebuild overhead, since only a small fraction of disks need to be accessed to rebuild a block. Then in turn this overhead decides the bandwidth left for rebuild request. Secondly, for each rebuild request, a smaller α indicates weaker fork-join effect and therefore makes the rebuilding easier. As shown in Figure 8, smaller α results in lower user response time and shorter rebuild time. 6 Conclusions and Future Work We first study the effect of the declustering ratio (α) on the performance of a clustered RAID array. We vary the arrival rate of disk requests, which are reads, until we get close to a disk utilization of one in normal mode. When a single disk fails the maximum throughput drops to one half the 19

20 throughput in normal mode, while with α = 0.5 it drops by a factor of 1.5. Also shown in the figure is the response time at the beginning of the rebuild process and before read redirection takes effect. The increase in response time with respect to degraded mode is equal to the mean residual reading time of rebuild units, i.e., this difference will increase with larger rebuild units. Disk-oriented rebuild using the vacationing server model - VSM outperforms rebuild using the permanent customer model -PCM both in terms of response time and rebuild time. The former is attributable to the fact that user requests are processed at a higher priority than rebuild requests, while the latter is due to the fact that with VSM the chances are higher that multiple rebuild accesses can be processed uninterrupted, thus minimizing the number of seeks incurred for this purpose. Read redirection has a positive effect in that with read redirection the response time of read requests improves gradually until it reaches normal mode response time, while there is no improvement without read redirection. The effect of dynamic control on read redirection shows that rebuild time decreases, while response time increases. This is not a fair comparison, since we need to take into account the fact that fewer requests will be processed in degraded mode when rebuild is completed earlier. Buffer size, if selected too small, results in a significant degradation in rebuild time, which results in further reading of rebuild requests being blocked. Increasing the rebuild unit size also results in lower rebuild time, which is accompanied in a small increase in response time. fixed. Array size has no effect on rebuild time and response time as long as other parameters remain 20

21 References [1] G. A. Alvarez, W. A. Burkhard, and F. Cristian. Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering, Proc. 24th ISCA, 1997, pp [2] G. A. Alvarez. W. A. Burkhard, L. J. Stockmeyer, and F. Cristian. Declustered disk array architectures with optimal and near optimal parallelism, Proc. 25th ISCA, 1998, pp [3] S.-Z. Chen and D. F. Towsley. The design and evaluation of RAID5 and parity striping disk array architectures, J. Parallel and Distributed Computing 10(1/2): (1993). [4] R. Durstenfeld. Algorithm 235: Random Permutation, Comm. ACM 7(7): p. 420 (1964). [5] G. Fu, A. Thomasian, C. Han, and S. Ng. Rebuild strategies for redundant disk arrays, Proc. IEEE/NASA Conf. on Mass Storage Systems and Technologies, [6] M. Hall. Combinatorial Theory, Wiley [7] C. Han and A. Thomasian. Performance of two-disk failure tolerant disk arrays, Proc. Symp. Performance Evaluation of Computer and Telecomm. Systems - SPECTS 03, [8] M. C. Holland, G. A. Gibson, and D. P. Siewiorek. Architectures and algorithms for on-line failure recovery in redundant disk arrays, Distributed and Parallel Databases 11(3): (1994). [9] C. R. Lumb, A. Merchant, and G. Alvarez. Facade: Virtual storage devices with performance guarantees, Proc. File and Storage Technologies - FAST Conf., [10] J. Menon. Performance of RAID5 disk arrays with read and write caching, Distributed and Parallel Databases 11(3): (1994). 21

22 [11] A. Merchant and P. S. Yu. Analytic modeling of clustered RAID with mapping based on nearly random permutation, IEEE Trans. Computers 45(3): (1996). [12] R. Muntz and J. C. S. Lui. Performance analysis of disk arrays under failure, Proc. 16th Int l VLDB Conf., 1990, pp [13] S. W. Ng and R. L. Mattson. Uniform parity distribution in disk arrays with multiple failures, IEEE Trans. Computers 43(4): (1994). [14] [15] K. K. Ramakrishnan, P. Biswas, and R. Karedla. Analysis of file I/O traces in commercial computing environments, Proc. Joint ACM SIGMETRICS/Performance 92 Conf., 1992, pp [16] T. J. E. Schwarz, J. Steinberg, and W. A. Burkhard. Permutation development data layout (PDDL) disk array declustering, Proc 5th IEEE Symp. on High Performance Computer Architecture - HPCA, 1999, pp [17] [18] A. Thomasian and J. Menon. Performance analysis of RAID5 disk arrays with a vacationing server model, Proc. 10th ICDE Conf., 1994, pp [19] A. Thomasian. Rebuild options in RAID5 disk arrays, Proc. 7th IEEE Symp. on Parallel and Distributed Systems, San Antonio, TX, Oct. 1995, pp [20] A. Thomasian and J. Menon. RAID5 performance with distributed sparing, IEEE Trans. Parallel and Distributed Systems 8(6): (June 1997). 22