VERY IMPORTANT NOTE! - RAID

Transcription

1 Disk drives are an integral part of any computing system. Disk drives are usually where the operating system and all of an enterprise or individual s data are stored. They are also one of the weakest links in a computing system. There are disk platters rotating at 7.2k to 15k RPMs, read and write heads moving over the rotating platters at a high rate of speed, and the data is magnetically written to the platter. A Disk drive is subject to failure due to shock, wear and tear on bearings, accidental erasure from magnetic fields, and damage from electrostatic discharge or power surges. For an individual, company, or enterprise that cannot tolerate system downtime, it makes sense to plan and implement contingency plans for the near-certainty that a hard drive is going to fail. When properly implemented, RAID can help maintain data availability when one or more disk drives fail. RAID is an acronym for Redundant Array of Independent Disks. In the early days of RAID, it was sometimes called Redundant Array of Inexpensive Disks. Both terms describe the same thing: A method of utilizing a number of disks for data storage. Why use multiple disks? There are three primary reasons: 1) Increased storage capacity - The maximum size of hard drives is increasing; every 8 months to a year the technology advances and it becomes possible to build a hard drive that will hold more data, and its barely keeping up with demand due to virtualized computing. 2) Increased performance when striping - Multiple drives can be used in a scheme where the data is striped across multiple drives. Consider this example: Suppose you had data that takes T seconds to write to a single disk. If you striped across two disks it would only take T/2 since you don t have to wait. This is multiplexing across a stripe. To stripe across n drives, it would take T/n. The more drives you have in a stripe, the faster you can write since mechanical drive speeds are orders of magnitude slower than processor or RAID HW speeds. 3) Data Availability (Fault tolerance) - If you only have one drive, and it fails, you are down until you can install a new drive and copy your backup data to this drive. Utilizing any RAID scheme (other and RAID 0), it is possible to create a scheme in which a drive failure(s) can be tolerated. That means, the system will stay up and running and data will not be lost. Of course, there can be combinations of the above reasons for using multiple drives, and there are different types of RAID or RAID Levels to address these requirements. VERY IMPORTANT NOTE! - RAID is NOT a substitute for backup! RAID offers a level of defense against a certain amount of failure. BUT - If an unlikely, unforeseen catastrophic event occurs, data can be lost in spite of even the most thorough implementation of RAID. Always formulate a solid backup and disaster recovery plan that will enable recovery from a worstcase scenario, such as complete destruction from fire, flood, tornado, etc. NAS (Network Attached Storage) devices with frequent snapshots are a good method of disaster recovery.

2 Let s look at different RAID levels and examine the advantages and weaknesses. RAID 0 - Striping Disk 0 Disk 1 A1 A2 B1 B2 RAID 0 requires a minimum of 2 disks, and it stripes the data across disks. The system sees the RAID array as one volume. When using two disks, the RAID controller simultaneously places one part of the file on disk one and the next part of the file on disk 2. The diagram above shows two write operations, A and B. Part of write operation A is written to disk 0 and part of write operation A is written to disk 1, and write operation B is striped in the same fashion. Since the write operation to each disk involves half of the file, writing the file takes about half the time it would take if there were a single disk. RAID 0 has a number of positive aspects: it speeds up read and write operation, all of the disk space is usable, and it is the least expensive form of RAID to implement. It does have a major shortcoming in that it leaves out one part of the RAID acronym Redundancy. With RAID 0, parts of files are stored on multiple drives. If one of the drives fails, all data is lost. There is no fault tolerance. In fact since there are multiple disks, there are more opportunities for failure. RAID 1 Mirroring Disk 0 Disk 1 A1 A1 A2 A2 B1 B1 B2 B2 In simplest form, RAID 1 consists of two drives storing duplicate data, but appear as a single volume to the system. It provides fault tolerance if one drive fails, the remaining drive contains all of the data that was on the failed drive. RAID 1 is the only RAID level that offers redundancy with only two disks. RAID 1 degrades write performance because the data has to be written twice. RAID 1 also cuts the storage capacity in half. If two 1TB drives are used in a RAID 1 array, their total capacity is 2TB, but only 1TB is available - the remaining 1TB is duplicate data.

3 RAID 10 Striping and Mirroring RAID 1 RAID 0 RAID 0 Disk 0 Disk 1 Disk 2 Disk 3 A1 A2 A1 A2 B1 B2 B1 B2 RAID 10 is a combination of RAID 0 and RAID 1. It is also a nested RAID because it nests RAID 0 arrays under a RAID 1 array. In the diagram above, notice disk 0 and disk 1 data is being mirrored to disk 2 and disk 3. RAID 10 takes a RAID 0 stripe, and mirrors data across the different spans. This offers a level of redundancy. If any one drive fails, functionality will continue. If both drives on one side of one of the Mirrors fail, functionality will continue. Note that disk 0 and disk 2 are identical, and disk 1 and disk 3 are identical. If disks 0 and 2 both fail, the system will fail because the data they contain is not available anywhere else. The same is true with disks 1 and 3. Like any of the other RAID schemes that use mirroring, only half of the total disk space is available for storage; the other half is devoted to the mirror image. RAID 5 Striping with Parity Disk 0 Disk 1 Disk 2 Disk 3 P 1 A1 A2 A3 B1 P 2 B2 B3 C1 C2 P 3 C3 D1 D2 D3 P 4 RAID 5 utilizes data striping like RAID 0, but introduces parity to the scheme. Parity is an extra bit of information that is computed and written along with the data. It can be checked later to assure that all data arrived as it was transmitted. When coupled with RAID, it can be used to rebuild a failed drive on the fly. RAID 5 requires a minimum of 3 drives. When data is written to a RAID 5 array, the data is striped across all members. In addition, a parity block is written to a different drive with each striping cycle. This is sometimes referred to as rotating parity, and gives RAID 5 the ability to tolerate the failure of any one drive in the array. The failure of a second drive will result in the loss of all data. When a single drive failure occurs, the system degrades to a RAID 0 configuration, and has no fault tolerance. When the failed drive is replaced, the system does not instantly achieve the fault-tolerance it had before the failure. The new drive must be rebuilt and re-integrated before fault tolerance is achieved. This means that in the period from the replacement of the drive until it is fully re-integrated, if another drive fails, all data will be lost. The larger the drive array, the more time this can take. Depending on the size of the drive array, rebuilding could take hours or even days, weeks or months. That is a lot of time to leave data vulnerable That is why backup and disaster recovery strategies are so important.

4 When a RAID 5 array is healthy with no failed disks, read operations are relatively fast, for the same reasons RAID 0 is fast data is being simultaneously read from multiple disks. For write operations, the writing of parity bits adds some overhead, so write performance is not as good as RAID levels that do not use parity. In many cases, the fault tolerance and disk utilization that is achieved with RAID 5 can make the performance tradeoff worthwhile. It is important to note that parity takes up some of the usable disk space, but significantly less space than is used with RAID levels that mirror drives. RAID 5 requires one disk worth of space for parity, so adding more disks to the array can reduce this overhead. RAID 50 Nesting RAID 5 Arrays in a RAID 0 Striping Array RAID 0 RAID 5 RAID 5 Disk 0 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 7 P 01 A01 A02 A03 P 11 A11 A12 A13 B01 P 02 B02 B03 B11 P 12 B12 B13 C01 C02 P 03 C03 C11 C12 P 13 C13 D01 D02 D03 P 04 D11 D12 D13 P 14 RAID 50 nests a number of RAID 5 arrays under a RAID 0 array. Remember that RAID 5 uses striping and parity, and RAID 0 uses striping, but no parity. RAID 0 has no fault tolerance, while RAID 5 can tolerate the failure of one disk. As we have seen previously, if a RAID 0 array loses a member, all data is lost. This is still true with a RAID 50 array, but now each member of the RAID 0 array is not a single disk, but a RAID 50 Array, and each RAID 50 array can tolerate the failure of one disk. RAID 50 could withstand one drive failure in each RAID 5 array (up to 4 disks in the example above) and keep functioning. A second drive failure in one of the individual RAID 5 array would cause all data to be lost. When a disk fails in a RAID 50 array, performance is degraded, and remains so until the disk is replaced AND the new disk is rebuilt and integrated into the array. RAID 50 s degraded performance is worse than RAID 0 s degraded performance, but better than RAID 5 s degraded performance. It is important to note that a RAID 50 array is not immediately healed and is NOT fully functional at the moment the failed drive is replaced. Time is required to rebuild the failed drive onto the new drive. The larger the drives, the more time required. For extremely large drive arrays, this could be hours or even days, weeks or months a lot of time for ALL of your data to be vulnerable. It is good practice to break up the data array in multiple LUNs as opposed to one single large volume for this reason.

5 RAID 6 Striping with Dual Parity Disk 0 Disk 1 Disk 2 Disk 3 P 1 P 2 A B C D P 3 P 4 P 5 P 6 E F G H P 7 P 8 RAID 6 is also a striping and parity scheme, but the difference between RAID 5 and RAID 6 is that RAID 6 utilizes two parity calculations per write operation, whereas RAID 5 utilizes one parity calculation per write operation. RAID 6 requires a minimum of four drives, and will tolerate the failure of any two drives. Upon the failure of the third drive, all data is lost. This additional fault tolerance comes at a cost. During write operations, a RAID 6 controller must calculate and write two parity blocks for each write operation. This takes time, making RAID 6 write performance slower than other RAID schemes. RAID 60 Nesting RAID 6 Arrays in RAID 0 Striping Array RAID 0 RAID 6 RAID 6 Disk 0 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 7 P 01 P 02 A0 B0 P 11 P 12 A1 B1 C0 D0 P 03 P 04 C1 D1 P 13 P 14 P 05 P 06 E0 F0 P 15 P 16 E1 F1 G0 H0 P 07 P 08 G1 H1 P 17 P 18 RAID 60 takes multiple RAID 6 arrays and nests them in a RAID 0 array, in a similar fashion to the way RAID 50 stripes individual disks. It has higher fault tolerance and can survive two disk faults per RAID 6 array.

6 RAID levels compared in a 12 disk NAS system RAID LEVEL Method Fault Tolerance Disk utilization (efficiency) Read Performance Write Performance RAID 0 Striping across disks None 100% Highest Highest RAID 1 Mirroring disks 1 Disk 50% High Low RAID 10 Stripe mirrored arrays Up to 1 disk failure in each sub-array 50% High High RAID 5 Striping and Single Parity across disks 1 Disk 91.67% High Low RAID 50 Striping across RAID 5 arrays Up to 1 disk failure in each sub-array 91.67% High Medium RAID 6 Striping and Dual Parity across disks 2 disks 83.33% High Lowest RAID 60 Striping across RAID 6 arrays Up to 2 disks failure in each sub-array 83.33% High Medium