Performance Impact on Exchange Latencies During EMC CLARiiON CX4 RAID Rebuild and Applied Technology Abstract This white paper discusses the results of tests conducted in a Microsoft Exchange 2007 environment. These tests examined the effects of single- and multiple-drive failures on Exchange s performance when using RAID 5 or RAID 6 technology. March 2010
Copyright 2010 EMC Corporation. All rights reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED AS IS. EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com All other trademarks used herein are the property of their respective owners. Part Number h6944 Applied Technology 2
Table of Contents Executive summary...5 Introduction...5 Audience... 5 Terminology... 6 Overview of test results...6 Overview of tests...7 About the tests... 7 Testing steps with no load... 7 Testing steps with load... 8 RAID group and LUN configuration... 8 Database RAID groups... 8 Testing plan and schedule... 8 Storage processor events... 9 Test results for baseline RAID 5...10 Test 001: RAID 5 with no activity and one drive pull... 12 Utilization... 12 Response time... 12 Storage processor utilization... 13 Test 001: RAID 5 rebalance with one drive replacement... 14 Utilization... 14 Response time... 15 Storage processor utilization... 15 Test 002: RAID 5, eight-hour Jetstress, with one drive pull... 17 Utilization... 17 Response time... 18 Storage processor utilization... 18 Test 002: RAID 5 rebalance after Jetstress with one drive replacement... 21 Test results for RAID 6...23 Test 004: RAID 6 with no activity and two drive pulls... 23 Utilization... 23 Response time... 23 Storage processor utilization... 24 Test 004: RAID 6 rebalance with two drive replacements... 24 Utilization... 24 Response time... 25 Storage processor utilization... 25 Test 005: RAID 6, eight-hour Jetstress, with two drive pulls... 26 Utilization... 26 Response time... 27 Storage processor utilization... 27 Test 005: RAID 6 rebalance after Jetstress with two drive replacements... 28 Utilization... 28 Response time... 28 Storage processor utilization... 29 Applied Technology 3
Jetstress comparison between the RAID 6 baseline and RAID 6 during hot spare rebuild... 30 Jetstress comparison between RAID 5 and RAID 6 during hot spare rebuild... 31 Conclusion...32 References...32 Applied Technology 4
Executive summary The Total Customer Experience (TCE) program, which is driven by Lean Six Sigma methodologies, demonstrates EMC s commitment to maintaining and improving the quality of EMC s products. In keeping with this philosophy, EMC designed Customer Integration Labs in its Global Solutions Centers and Partner Engineering Labs where we conduct rigorous tests that reflect real-world environments. In these tests, we design and execute TCE use cases and carefully measure performance. These TSE use cases provide us with insight into the challenges currently facing our customers, allowing us to provide the highest quality products. This white paper describes how the EMC CLARiiON implementations of RAID 5 and RAID 6 were tested. These tests determined the impact that RAID rebuilding (due to drive failures) and rebalancing have on the performance of a CLARiiON CX4-480, when the CX4-480 is operating with or without a load. The results of these tests demonstrated the superior ability of CLARiiON s RAID 5 and RAID 6 technologies to fail over to a hot spare when faced with a drive failure. Introduction This white paper summarizes the results of tests on an EMC CLARiiON CX4-480 that were obtained by testing with the functionality of RAID 5 and RAID 6 sets with hot spare disks. These tests were conducted with and without application activity. This paper discusses the test use-case scenarios, objectives, expected results, and actual results. The configurations for these tests included: CLARiiON CX4-480 storage configuration for a RAID 5 4+1 RAID group set CLARiiON CX4-480 storage configuration for a RAID 6 4+2 RAID group set CLARiiON CX4-480 storage configuration for hot spare disks and a RAID group These tests did not include: Install and configuration of Microsoft Exchange 2007 Install and configuration of Microsoft Jetstress 2007 Install and configuration of Microsoft Windows 2008 Creation of RAID 5 or RAID 6 CLARiiON RAID groups Audience The intended audience for this white paper is: Internal EMC personnel EMC Partners The audience should have a firm understanding of the following: CLARiiON CX4 RAID technology Navisphere UI or NaviSECCLI for the creation of RAID 5 and/or RAID 6 RAID groups Navisphere UI or NaviSECCLI for the creation of RAID 5 and/or RAID 6 LUNs Navisphere UI or NaviSECCLI for the creation of RAID 5 and/or RAID 6 metaluns Navisphere UI or NaviSECCLI for the creation of hot spare RAID groups/hot spare disk(s) Applied Technology 5
Terminology RAID: Redundant Array of Independent Disks. RAID 5: RAID 5 uses block-level striping with parity data distributed across all member disks. RAID 5 can recover from a single drive failure, while allowing the set to remain functional until the failed drive is replaced. RAID 5 calculation of drives is (n-1), where n is the total drives with 1 representing the loss of one drive. RAID 5 will lose data upon the loss of more than one drive. RAID 6: RAID 6 extends RAID 5 by adding an additional parity block using block-level striping with two parity blocks distributed across all member disks. Unlike RAID 5, RAID 6 has the ability to remain online while recovering from two drive failures, giving it an added level of fault tolerance over RAID 5. RAID 6 calculation of drives is (n-2), where n is the total drives with the lost of capacity of two total drives. The loss of capacity is offset with the fault-tolerance loss of three drives before total failure. Hot spare: The CLARiiON CX4 allows you to create hot spare disks; host spare disks are used for the short-term replacement of failed disks. Hot spares allow for additional time for user/vendor replacement of failed drives. The usage of hot spares decreases the possibility of RAID 5 and RAID 6 sets from total failure by replacing failed drive(s) without administrator/user intervention. Hot spare drives are not intended to be complete replacement of failed drives, and replacement of failed drives as soon as possible is expected. A recommendation rule of thumb is 30:1 drives to hot spares on CLARiiON arrays. Rebuild: This process is when a hard drive fails (or is marked bad) and a hot spare drive is available. The parity bits on the good drives are used to rebuild the failed drive. The rebuild process has a greater impact on response times than the disk replacement rebalancing process. During the rebuild process, each good drive reads eight 512 KB reads in its (4 MB) queue. 4 MB of data is written to a hot spare drive (eight 512 KB) through the back-end bus. Rebalance: This process occurs when a failed hard drive is replaced within a RAID group. Data is copied from the hot spare drive to the replacement drive through the back-end bus. Overview of test results During testing, all databases remained online without corruption. For both RAID 5 and RAID 6, database latencies rose slightly during the rebuild-to-hot-spare process. Array utilization barely registered above normal during the rebalancing process. (During the rebalancing process, the disk is rebalanced from the hot spare to the replaced drive.) The rebuild priority can be changed based on application requirements. These settings affect the performance of the storage processor. During testing the default setting for rebuild priority was set to High. Increasing from High to ASAP reduced the time for hot spare rebuild but affected the performance of the storage processor. Reducing the setting to Medium or Low lowered the impact on the storage processor performance during the hot spare rebuild. Lowering the rebuild priority decreased the impact on the storage processor and increased the amount of time it took to rebuild to the hot spare drive. Customers wishing to minimize the possibility of a second drive failure during the longer rebuild process based on the lower setting may wish to consider RAID 6, which allows for the loss of two drives within the same RAID group. Applied Technology 6
Figure 1. Snapshot of the Navisphere UI setting for Rebuild Priority options Overview of tests EMC conducted the following tests to measure the performance impact of RAID 5 and RAID 6 hot spare drive replacement and rebalancing with and without a load. We used Microsoft Jetstress to simulate reallife activity. We did this to obtain data points (for customers) that show how Microsoft Exchange will react and how their user base will be affected during rebuild and rebalance operations. RAID 5 was tested by removing a single drive; the drive was physically pulled from the array without preparation on the array or server. Hot spare drives were set up on the array according to EMC recommendations. This test measured the impact a failed drive had on the storage processors and Exchange servers. It measured the impact during the hot spare rebuild. It also showed what happened when the (simulated) repaired drive was placed back into the array, which was not done until the storageprocessor event logs showed that the array was stabilized and the rebuild/sync functions were completed. RAID 6 was tested the same way, except that in this test we removed two drives at the same time to prove that this technology can function without disruption with two failed drives. In this case, databases remained online. The testing was not done to see if Jetstress would fail; instead it showed how database latencies increasing during the rebuild and/or rebalance processes. We conducted baseline Jetstress testing on the RAID 5 and RAID 6 configurations. These results were compared to the results during the drive failure tests to determine the impact of rebuild and rebalance operations. About the tests Testing steps with no load 1. Clear all logging to ensure only current test data is within saved logs: Navisphere Analyzer (NAR) Navisphere event logs on each storage processor Windows event logs (system/application) 2. Start all logging: Navisphere Analyzer Windows Performance Monitor (PerfMon) 3. Wait 15 minutes. 4. Remove a predetermined drive in a predetermined RAID group homing Exchange database LUNs. For example: 1_0_1 RAID 5, 1_0_0 and 1_0_1 RAID 6. 5. Monitor Event logs on storage processors until events show that: All rebuilds for a FRU have completed. CRU Unit rebuild is complete. 6. Monitor NaviAnalyzer to confirm SP utilization has normalized. 7. Stop all monitoring. 8. Save all logs: Applied Technology 7
NAR Storage processor event log Perfmon Windows event logs (system/application) Testing steps with load 1. Clear all logging to ensure only current test data is within saved logs: Navisphere Analyzer (NAR) Navisphere event logs on each storage processor Windows event logs (system/application) 2. Start all logging: Navisphere Analyzer Windows Performance Monitor (PerfMon) 3. Start Microsoft Jetstress. 4. Wait 15 minutes (this is to get information in the logs before failure). 5. Remove a predetermined drive in a predetermined RAID group homing Exchange database LUNs. IE: 1_0_1 in the RAID 5 configuration, 1_0_0 and 1_0_1 in the RAID 6 configuration. 6. Monitor event logs on storage processors until events show that all rebuilds for a FRU have completed: CRU Unit Rebuild Complete Monitor NaviAnalyzer to confirm SP utilization has normalized Stop all monitoring 7. Save all logs: NAR Storage processor event log Windows event logs (system/application) RAID group and LUN configuration For both the RAID 5 and RAID 6 testing the capacity yield is four disks total capacity (n=4). Database RAID groups This configuration resulted in a total of eight DB metaluns and eight log file metaluns for eight ESGs: RAID 5, two RAID groups at 4+1 16 LUNs in each RAID group 8 metaluns were created from the 16 component LUNs. RAID 6, two RAID groups at 4+2 16 LUNs in each RAID group 8 metaluns were created from the 16 component LUNs. Log file RAID groups RAID 10 2+2 16 LUNs in each RAID group 8 metaluns created from the 16 component LUNs. Testing plan and schedule Please note that: The test plan shown in Table 1 outlines the test, RAID configuration, length of time, activity, and description of the test. Baseline testing using Microsoft Jetstress was also done for each of the RAID configurations with no drive loss/pull. Applied Technology 8
Test 001 and Test 004 show a length of time as to-be-determined (TBD). These tests determined how long it took to rebuild a RAID group to a hot spare. Table 1. Test plan Test Baseline RAID 5 RAID type Hot spares Length Activity Description RAID 5 1 TBD Jetstress Baseline RAID 5: Run Jetstress with no failures to gather data on I/O and latencies for future tests. Gather all logs upon completion. Test 001 RAID 5 1 TBD None Pull 1 drive: Allow hot spare to rebuild. Monitor event viewer on SP for completion. Gather all logs upon completion. Test 002 RAID 5 1 8h Jetstress Start an 8-hour Jetstress performance test. Pull 1 drive: Allow hot spare to rebuild. Monitor event viewer on SP for completion. Gather all logs upon completion. Test 004 RAID 6 2 TBD None Pull 2 drive(s): Allow hot spare to rebuild. Monitor event viewer on SP for completion. Gather all logs upon completion. Test 005 RAID 6 2 8h Jetstress Start an 8-hour Jetstress performance test. Pull 2 drive(s): Allow hot spares to rebuild. Monitor event viewer on SP for completion. Gather all logs upon completion. Storage processor events The messages in SP event logs were used to determine the following: Exactly when the drives failure was noted through the event: Disk(Bus 1 Enclosure 0 Disk 1) failed or was physically removed Exactly when the hot spare(s) began replacing the failed drive(s): Hot Spare is now replacing a failed drive. When rebuild has completed: All rebuilds for a FRU have completed When the failed drive was replaced: Drive was physically inserted into the Slot When the replaced drive has synced and hot spare no longer used: Hot Spare is no longer replacing a failed drive: Applied Technology 9
Test results for baseline RAID 5 Applied Technology 10
The Jetstress report shows the same average, about 16 ms db read latencies, achieving 1476 IOPS in a RAID 4+1 configuration: Applied Technology 11
Test 001: RAID 5 with no activity and one drive pull Utilization This chart shows RAID 5 utilization with no activity and one drive pull: Response time RAID group response time remained a steady 30 ms for RAID Group 20 (the RAID group was affected by drive pull): Applied Technology 12
Storage processor utilization The storage processor utilization increased to approximately 10 percent throughout the rebuild: This read size (KB) chart shows the rebuild process reading 512 KB from all disks in the RAID group: Applied Technology 13
The read bandwidth (measured in MB/s) was doing well at approximately 48 MB/s: Test 001: RAID 5 rebalance with one drive replacement Utilization The Navisphere chart shows that RAID group utilization rose to 8-10 percent during the rebalance to the replaced drive, and it took 6.7 hours. Applied Technology 14
Response time Storage processor utilization Applied Technology 15
Read size was, as expected, at 512 KB for the single drive accessed during the rebalance copy process. The image below shows a single drive being read from for this process. The read bandwidth in this test was 13 MB/s. Applied Technology 16
Test 002: RAID 5, eight-hour Jetstress, with one drive pull Utilization This chart shows several characteristics of utilization during the RAID rebuild: The difference in RAID group utilization from the affected RG, when the drive was pulled and was not at max, was 30 percent. As the rebuild continued to completion, the second RAID group s utilization increased. Utilization for both RAID groups was identical until the drive was pulled. Then utilization for the RAID group increased almost 40 percent for the rebuild, while the second RAID group increased slightly in the beginning and dropped back to the same levels (about 48 percent) through the rest of the test. Applied Technology 17
Response time This chart shows that before the rebuild started, the storage processors were about 7 percent active, with RG20 increasing as expected during the rebuild time, but only 5 percent during the rebuild process. RG21 appears to remain at about 7 percent throughout the test. Storage processor utilization This chart shows clearly when the rebuild begins and ends, along with spikes for the storage processor during CRU rebuild processes for this drive: Applied Technology 18
After the drive is pulled until the rebuild is complete, the read size goes from a low of 64 KB up to 100 KB. The read bandwidth, in MB/s, is shown below: Applied Technology 19
Total bandwidth (MB/s) for the RAID group was approximately 58 MB/s: Applied Technology 20
Jetstress testing shows the effects of the RAID rebuild compared to baseline tests. Achieved IOPS were 1279, or 13 percent lower than baseline. RG20 IOPS were 641.226, or 13 percent lower than baseline. DB read latencies for RG20 were 19.25 ms, or 22 percent higher than baseline. DB read latencies for RG21 were 12.75 ms, or 12 percent lower than baseline (this would be expected due to the lower IOPS). Write latencies did not change for either RAID group database or log files. Test 002: RAID 5 rebalance after Jetstress with one drive replacement Because the RAID group utilization and response times registered at almost 0 (or 1 2.5 percent, which is attributable to normal array processes), the following chart shows RAID group utilization, RAID group response time, and storage processor utilization. This chart shows that, after a failed drive replacement, a rebalance does not affect the RAID groups or storage processors. The time to complete was 6.7 hours. Applied Technology 21
Applied Technology 22
Test results for RAID 6 Test 004: RAID 6 with no activity and two drive pulls For the following tests two drives were pulled to show the ability of RAID 6 to recover from two drive failures. Utilization Response time This increased during the parity rebuild process but no more than during RAID 5. Applied Technology 23
Storage processor utilization Storage processor utilization rose slightly during the parity rebuild process in the beginning, but then it dropped quickly. Test 004: RAID 6 rebalance with two drive replacements Utilization RAID group utilization increased slightly during the rebalance process: Applied Technology 24
Response time RAID group response time had a brief increase during the start, and remained ~12 percent throughout the rebalance. The spike at the end was not related to these tests and caused by the polling/pulling of data. Storage processor utilization Storage processor utilization during the rebalance process was unremarkable, with the utilization against them being insignificant compared to normal operations. Applied Technology 25
Test 005: RAID 6, eight-hour Jetstress, with two drive pulls Utilization This test showed the effects of parity rebuild upon the storage processors and RAID group during heavy load: RAID group utilization before the drives were pulled showed both at 80 percent. Upon simulated drive failure and the drives being pulled, the affected RAID group increased to 90 percent utilization, while the second RAID group s utilization dropped to 60 percent. This is expected as the array gives more priority to the faulted RAID group. The time it took to complete, even during a heavy load, was the same at 6.7 hours. Applied Technology 26
Response time Before the drives were pulled, response times with Jetstress running were about 8 percent. Upon the simulated failure, the affected RAID group increased slightly to 12 percent (still below Microsoft best practices) while the other RAID groups response times dropped slightly. Storage processor utilization Applied Technology 27
Test 005: RAID 6 rebalance after Jetstress with two drive replacements Utilization RAID group utilization was almost identical to the rebalance without activity. Response time RAID group response times were the same as other rebalance tests, remaining at about 12 percent and raising slightly before completion. Applied Technology 28
Storage processor utilization For storage processor utilization, as with other rebalance tests, the utilization remains insignificant, rising briefly to 2 percent but not much higher than normal services processor utilization. With a read size of 512 KB, using Release 22 plus RAID 5, 4 MB of data was read from a hot spare in 512 KB chunks, and written to the repaired drive in the same manner as shown below: Read bandwidth during the rebalance after Jetstress averaged about 12 MB/s: Applied Technology 29
Jetstress comparison between the RAID 6 baseline and RAID 6 during hot spare rebuild In this test: The achieved IOPS was 1170.996, or 40 percent lower than baseline. RG20 IOPS was 586.472, or 41 percent lower than baseline. The DB read latency for RG20 was 21 ms, or 32 percent higher than baseline. The DB read latency for RG21 was 14.5 ms, or 4 percent higher than baseline. Applied Technology 30
Jetstress comparison between RAID 5 and RAID 6 during hot spare rebuild In this test: The achieved IOPS was 1170, or 8 percent higher than RAID 5. The RG20 IOPS was 586.472, or 8 percent higher than RAID 5. The DB read latency for RG20 was 21 ms, or 8 percent higher than RAID 5. The DB read latency for RG21 was 14.5 ms, or 12 percent higher than RAID 5. Write latencies did not change for either RAID group database or log files. Applied Technology 31
Conclusion These tests demonstrate the superior ability of CLARiiON s RAID 5 and RAID 6 technologies to fail over to a hot spare when faced with a drive failure. RAID 5 was tested with four data drives and had the one parity drive that is required for RAID 5 technology. RAID 6 was tested with four data drives and had the two parity drives required for RAID 6 technology. In a Microsoft Exchange environment with RAID 5, with one drive failure, the CLARiiON failed over to a hot spare and the following occurred: IOPS dropped slightly and latencies increased slightly during the rebuild process. No database was dismounted or corrupted. No server lost connectivity to the array. In a Microsoft Exchange environment with RAID 6, the CLARiiON recovered from two drive failures and IOPS dropped slightly and latencies increased slightly during the rebuild process and the following occurred: No database was dismounted or corrupted. No server lost connectivity to the array. It was able to recover multiple drive failures. It was able to rebuild a hot spare in 6.7 hours Both RAID technologies had outstanding performance during the rebalancing from the hot spare to the replacement drives. Additionally, both technologies took under 7 hours to complete, and the processor, RAID group utilization, and response times barely registered usage while rebalancing. References EMC CLARiiON Best Practices for Performance and Availability: Release 29.0 Firmware Update Applied Best Practices Applied Technology 32