Exploring RAID Configurations J. Ryan Fishel Florida State University August 6, 2008 Abstract To address the limits of today s slow mechanical disks, we explored a number of data layouts to improve RAID performance. We found that by dedicating more disks to hosting popular data and fewer disks to hosting less popular data, RAID was able to handle 20% more concurrent streams. However, we have not found a conclusive explanation for this phenomenon. 1. Introduction Improvements upon modern computers have enabled CPU, memory, and network speeds to increase exponentially. However, the speed of mechanical disks has not improved nearly as much, resulting in a system wide bottleneck. Although memorybased storage is becoming a strong contender for disks, disk based storage will remain important for high capacity mass storage in the foreseeable future because it is 100 times cheaper per byte compared to memory based storage. To overcome the performance problem of disks, researchers have invented ways to spread I/O loads across multi disk RAID devices to exploit hardware parallelism, and conventional wisdom suggests that balancing the load across disks in a RAID is essential for good performance. However, we observe that perfect load balancing does not equate to maximum performance. The intuition is that if each request is served by all disks whenever possible, a disk needs to multiplex often when serving many simultaneous request streams. Alternatively, each disk can focus on serving a subset of request streams to reduce the per disk multiplexing overhead. To investigate this hypothesis, we explored techniques to reduce the level of multiplexing for each disk in RAIDs. Our results show that for a few concurrent request streams, performance can be improved. For a higher number of concurrent streams, reducing the multiplexing within each disk yields little difference in performance. However, we found that by dedicating more disks to hosting frequently accessed data and fewer disks to hosting less frequently accessed data, the system can handle 20% more concurrent request streams. Our results show several plausible explanations, but the exact cause is still inconclusive. 2. Background Disks: Disks are mechanical storage consisting of a comb of disk heads for reads and writes and rotating platters coated with magnetic recording material. Each platter surface consists of concentric tracks, which are divided into the minimum
storage unit of sectors. Therefore, the disk access time consists of seek time (to move the head over a track), rotational delay (to rotate the sector under a head), and transfer time (to access the requested data). According to Seagate, their top ofthe line Savvio 15K RPM drive is able to produce an average read/write seek time of 2.9msec/3.3msec respectively, an average rotational delay of 2msec, while producing a transfer rate up to 112MB/s [6]. RAIDs: RAID stands for redundant array of inexpensive disks. [4] The basic idea is to spread, or stripe, a piece of data uniformly across disks (typically in units called chunks), so that a large request can be served by multiple disks in parallel. RAIDs mainly reduce the time to transfer data, while the disk in a RAID that incurs the longest seek and rotation delays can slow down the completion of a striped request. Since the use of multiple disks reduces the reliability of a RAID, different RAID configurations, or levels, are invented to overcome this constraint. Popular RAID levels are RAID 0, RAID 1, and RAID 5. RAID 0 involves simple striping of data across disks. Therefore, a single disk failure can result in data loss. In a RAID 1, the content of each disk is replicated to another disk. Therefore, data loss occurs only when two disks holding the same data fail simultaneously. RAID 5 reduces the overhead to store redundant information to one disk, but it can only survive a single disk failure. Since this research focuses on the performance aspect, we only explored the RAID 0 configuration for various experiments. RAID optimizations: Knowing that the slowest disk can slow down the entire RAID, load balancing techniques [5] try to migrate popular items away from the slowest disk to improve the overall RAID performance. However, balanced load does not equate to maximum performance automatically, since there are many ways to achieve balanced loads across disks. For example, each disk in an N disk RAID can focus on serving one request stream. Alternatively, each disk can multiplex among N request streams by serving only 1/N h of each stream. Both schemes can plausibly achieve load balancing, but the one without multiplexing can save more time on seeks. Clustering and declustering are techniques in which data is combined or separated in ways that aim to maximize parallel transfers and minimize the seek and rotational delays. In the Shekar et al. paper [7], declustering was done through the analysis of spatial data, meaning that they required an on disk distance metric to analyze the data and separate it in a way that would minimize concurrent requests to the same disk. One drawback of this approach is developing an accurate on disk distance metric, which is very difficult considering the complexity of modern disks [1]. Studies on RAID configurations explore the relationships among stripe granularities, the number of disks in a stripe, and the number of concurrent request streams. Under artificial workloads, Chen et al. [2] recommends the striping chunk size to be ½ * the average access time * transfer rate of a disk. Although the one size fits all 2
recommendation eases system administration, the question of whether approaches based on nonuniform treatments of files will lead to better performance. The limits of various approaches prompted us to investigate the possibility of reducing the level of per disk multiplexing to increase RAID performance. 3. Feasibility Studies To quantify the potential benefits of reducing per disk multiplexing, we conducted experiments with 10 concurrent request streams to five disks. In the first RAID 0 configuration, we created an ext2 file system on a RAID 0 device with a chunk size of 64 KB. We then populated 250 files of the same size on the file system. We also varied the size from 10 KB to 100 MB to see the effect of file sizes. Therefore, each disk potentially needs to multiplex up to ten concurrent requests. In the second configuration, files are not striped. We created an ext2 file system on each drive and placed 50 files of the same size on each drive. Therefore, on average, each disk multiplexes between two concurrent request streams (Figure 3.1). Figure 3.1: Feasibility test data layouts. We ran each experiment 5 times, and the numbers were analyzed at the 90% confidence interval. For all experiments, the machine was rebooted between each run to clear the system and disk cache. A script would then read 100 files randomly. The elapsed time to read each request would then be recorded. Processor Server Intel Xeon 2.80 GHz 3
16 KB L1 Cache 2 MB L2 Cache Memory 1 GB DDR2 400 Disks Fujitsu MAW3073NC 73.5 GB 10K RPM SCSI Ultra 320 [3] 8 MB on board cache 1 disk for booting 5 disks for experiments Operating System Linux 2.6.20 Table 3.1: Experimental settings. Figure 3.2 shows that at a concurrency level of 10, multiplexing reduction would yield a bandwidth improvement for file sizes > 100KB. Therefore, for our subsequent data layout schemes, we explored the use of video on demand workloads. Figure 3.2: Concurrency test results. 4
4. Multimedia Tests For multimedia experiments, five disks were populated with 20 GB of 30 MB media files. A multimedia request stream is represented and emulated by a process, which reads 512 KB per second from a chosen file. Once the request reaches the end of the file, the process chooses another file. The elapsed time of each 512 KB read request is recorded. For each experiment, the system began with one request stream process. A new request stream process was spawned after each second, until one of the processes missed five one second deadlines. A message would then be sent to all the processes signifying the end of the experiment. The experiments were run on the same system as used in the feasibility tests (Table 3.1). The data read from the five disk device was written to an ext2 partition stored on a separate disk, and the system was then rebooted to clear disk and system cache. Each experiment set was run five times; we then averaged and reported the results up to the lowest number of concurrent streams supported. The primary metrics used were the number of concurrent streams a system can support and the aggregate bandwidth, computed by dividing the number of bytes accessed by the busy time of a disk or a RAID. 4.1. Test Set 1 Initial Layout Tests RAID 0: The base case configuration was the original RAID 0 disk layout, with 64 KB chunks (Figure 4.1.1). We emulated the common skewed access frequency distributions by having 80% of references uniformly distributed to 20% of files (also referred to as hot files), and 20% of references to 80% of files (also referred to as cold files). Figure 4.1.1: RAID 0. Hot/cold blocks are randomly spread across the entire array. 1Hot4Cold: To explore the idea of reducing per disk multiplexing, we used one disk to hold hot files (4 GB total), and a four disk RAID 0 to hold cold files (4 GB per disk). The idea is that the hot disk only needs to multiplex among the 20% of files that are hot; the cold disks only need to multiplex among the 80% of files that are cold. 5
Figure 4.1.2: 1Hot4Cold. Cold files are stored on the RAID 0 device, while hot files are stored on a single drive. 2Partitions: The third configuration divides each of five disks into two partitions (Figure 4.1.3). A RAID 0 is formed with partitions nearest the outer edge of the platters to store hot files. Another RAID 0 with partitions next to the previous partitions stores cold files. This configuration aims to keep the disk head within the hot partition most of the time to reduce the seek times. Figure 4.1.3: 2Partitions. Data is separated into two partitions. The partition near the outer edge of the disk holds the hot files, while the adjacent partition holds the cold files. Figure 4.1.4: Test results for RAID 0, 1Hot4Cold, and 2Partitions. 6
Figure 4.1.4 shows two surprising findings: (1) The 2Partitions case suggests that the seek time reduction contributed little to the bandwidth difference. One explanation is that we did not use a big enough working set. For 20 GB of files spread across five disks, each disk only stores 4 GB of data. Out of 73.5 GB of storage capacity, the average seek time for 4 GB is about 0.16 msec. With hot files stored in a separate partition, the average seek time is reduced to about 0.05 msec, which is rather insignificant compared to the average rotational delay of 2 msec. We will try a bigger working set as our future work. (2) Reducing the level of multiplexing actually contributes negatively to the performance. Based on the achieved aggregate bandwidth of 45 MB/sec vs. 250 MB/sec peak bandwidth of RAID 0, the disk containing hot files clearly became a bottleneck. 1Cold4Hot: Out of curiosity, we tried 1Cold4Hot where cold files would reside on the independent drive, while the hot files were on the four disk RAID 0 (Figure 4.1.5). This arrangement achieves a number of effects: (1) the per disk load becomes balanced, as each disk receives 20% of references on average. (2) The level of multiplexing is reduced for hot files as well. Each disk in the RAID 0 stored about 1 GB of files, while the cold disk stored 16 GB. (3) The average seek distance for RAID 0 is reduced further. (4) Serving hot files is isolated from the interference of serving cold files. Figure 4.1.5: 1Cold4Hot. Hot files reside on a RAID 0 device, while cold files populate a single drive. 7
Figure 4.1.6: Test results for RAID 0, 1Hot4Cold, 2Partitions, and 1Cold4Hot. Figure 4.1.6 superimposes 1Cold4Hot over prior results and shows a surprising finding. Although 1Cold4Hot does not achieve the same peak bandwidth, the scheme can support 20% more concurrent streams compared to RAID 0. Therefore, it became clear that bandwidth was not the primary performance attribute for comparison, since a peak aggregate bandwidth of 250 MB/sec (or 8.3 MB/sec per stream) would help little in improving request streams that only need 512KB/sec. This finding also led us to question why 1Cold4Hot obtained 20% scaling improvement, which is not explainable with simple models of seek times, rotational delays, and transfer times. The plausible causes were changes in seek times, the level of multiplexing, load balancing, and the separation of hot and cold files. The following test sets were designed to the sensitivity of each cause in turn. We used the 1Cold4Hot results as the baseline. 4.2. Test Set 2 Increased Seek This test set was designed to observe the effect of an increased seek distance on a layout similar to the 1Cold4Hot experiment. We increased the size of hot files by 10% to 33MB and left the number of hot files the same (Figure 4.2.1). By doing so, we forced the disk head to seek further for hot disks. 8
Figure 4.2.1: Increased Seek. This layout is similar to the 1Cold4Hot layout, but the hot file size on the RAID 0 device is increased by 10%. As can be seen in Figure 4.2.2, increasing the seek distance decreased the scaling performance slightly, and created little deviation on the bandwidth performance. From these results, we ruled out reduced seek distance as the main cause for the 1Cold4Hot s performance gains. This finding is also consistent with the 2Partitions case. 4.3. Test Set 3 Reduced Multiplexing Figure 4.2.2: Test results for increased seek distance. Another possible reason for the scaling increase could be the reduction of multiplexing on the entire system. By separating the hot from the cold files, we were able to decrease the level of multiplexing by 80% for the four disk RAID 0 device. This test was designed to reduce the level of multiplexing even further to see if the system can serve more concurrent requests. The disk layout is identical to the 1Cold4Hot experiment, except that only a fraction of hot files (10% and 50%) were 9
available to be accessed. Therefore, by limiting accesses to fewer hot files, we effectively reduced the level of multiplexing for RAID 0. Figure 4.3.1: Further reduced multiplexing for hot files by 10%. Figure 4.3.2: Further reduced multiplexing for hot files by 50%. Figure 4.3.3 shows that decreasing multiplexing does not increase scaling and seemed to have little effect on either bandwidth or scaling performance. Figure 4.3.3: Test results for further reduced multiplexing for hot files by 10%/50%. 4.4. Test Set 4 Hot/Cold File Separation The next experiment set investigated various probabilities to access hot and cold files. For prior experiments, we used 80% of references to hot files and 20% of 10
references to cold files By changing the probabilities (Table 4.4.1), we hoped to see a definite change in scaling performance. For all probability changes, 20% of files are hot, and 80% of files are cold. Probabilities Tested Hot Cold 90% 10% 70% 30% 50% 50% Table 4.4.1: Tested probabilities to reference hot and cold files. 1Mixed4Mixed: In addition, we altered the original 1Cold4Hot configuration, so that hot and cold files were mixed (Figure 4.4.1). This test would show the effect of not separating hot and cold files. Figure 4.4.1: 1Mixed4Mixed. Hot and cold files are spread between a 4 disk RAID 0 device and an independent device according to the 1Cold4Hot distribution. Figure 4.4.2: Test results for hot/cold file separation Figure 4.4.2 illustrates that the skewness of file popularity concentration has a significant effect on the performance of the system. By changing the probability 11
ratio of hot and cold file references from 80/20 to 70/30, we can see that there is a considerable negative effect from redirecting 10% of traffic from hot to cold. It would seem that this change is enough to overload the independent drive to degrade scaling. In addition, evening out the probability to 50/50 degraded scaling by nearly half in comparison to the 1Cold4Hot results. Alternately, increasing the probability to reference hot files to 90% from 80% did not increase scaling performance, but rather decreased it slightly. From these results, it would seem that the 80% hot file probability was a sweet spot for data separation. For the 1Mixed4Mixed case, the single disk would receive about 80% of references, which is proportional to the amount of data stored on the disk due to the lack of hot and cold separation. This layout yields the lowest performance for both bandwidth and scaling. 4.5. Test Set 5 Load Balancing As for load balancing, the original RAID 0 was well balanced in the sense that each disk would receive 20% of references. Thus, load balance does not explain the scaling for the 1Cold4Hot case. 2Cold3Hot Imbalanced: As a sanity check, we also tried to use a two disk RAID 0 to store cold files and a three disk RAID 0 to store hot files. We anticipated performance decrease due to load imbalance. We maintained the 80/20 request ratios for hot and cold files. 2Cold3Hot Balanced: We also ran another instance of this test with a 60/40 request ratio to hot and cold files to balance the per disk load. We expected to see the load balanced setting of 2Cold3Hot outperform the imbalanced version. Figure 4.5.1: 2Cold3Hot. Hot and cold files reside on a 3 disk RAID 0 and a 2 disk RAID 0 respectively. 12
Figure 4.5.2: Load balancing test results. As expected, Figure 4.5.2 shows that the balanced 2Cold3Hot outperforms the imbalanced 2Cold3Hot, and their performance is between the RAID 0 and 1Cold4Hot case. 5. Future Work Due to time constraints, we have not tracked down definitive explanations for why separating hot and cold files would lead to better scaling. In the future, we plan to explore the factors of larger working sets, I/O scheduling queues, and real world workloads and develop explanations for various findings. 6. Conclusions Through exploring various RAID configurations, we discovered many surprising findings. We were unable to show that reduced multiplexing could lead to better performance with many concurrent streams. On the other hand, we found that the separation of hot and cold files could lead to 20% better scaling. Although we have conducted extensive experiments, we have yet to find simple explanations. Overall, we found that modern storage subsystems are complex. Our lack of understanding suggests many open opportunities for future explorations. 7. Acknowledgements 13
I want to thank to Dr. Andy Wang for his support and guidance on this project, and the work leading to it, and for his friendship. I would also like to thank Dr. Ted Baker and Dr. Piyush Kumar for being on the project committee, as well as contributing to an excellent educational experience at Florida State University. I thank my family for all of their support, both financially and emotionally. I thank my friends for keeping my spirits high when times looked dire. I thank the amazing OS and Storage Systems group for their camaraderie and support. Finally, I would like to send a special thanks to my Macaroni for giving me the motivation and support to finish this project. References [1] Anderson D. You Don't Know Jack about Disks. Queue 1(4), pp. 20 30, 2003. [2] Chen PM, Lee EK. Striping in a RAID Level 5 Disk Array. Proceedings of the 1995 ACM SIGMETRICS Conference on Measurement and Modeling of Comptuer Systems. 1995. [3] Fujitsu. MAW3 NC Series Obsolete Product Information Specifications, http://www.fujitsu.com/global/services/computing/storage/hdd/archive/maw 3073nc maw3300nc.html, 2008. [4] Patterson DA, Gibson G, Katz RH. A Case for Redundant Arrays of Inexpensive Disks (RAID). ACM SIGMOD International Conference on Management of Data. 1998. [5] Scheuermann, P, Gerhard W, Zabback P. Data Partitioning and Load Balancing in Parallel Disk Systems. The International Journal on Very Large Data Bases,7(1):48 66, Feburary 1998. [6] Seagate Technology LCC. Savvio 15K Data Sheet, http://www.seagate.com/docs/pdf/datasheet/disc/ds_savvio_15k.pdf, 2008. [7] Shekhar S,Ravada S, Kumar V, Chubb D, Turner G. Load Balanacing in High Performance GIS: Declustering Polygonal Maps. Proceedings of the 4th International Symposium on Large Spatial Databases. 1995. 14