Chapter 7. Disk subsystem

Transcription

1 Chapter 7. Disk subsystem Ultimately, all data must be retrieved from and stored to disk. Disk accesses are usually measured in milliseconds, whereas memory and PCI bus operations are measured in nanoseconds or microseconds. Disk operations are typically thousands of times slower than PCI transfers, memory accesses, and LAN transfers. For this reason the disk subsystem can easily become the major bottleneck for any server configuration. Disk subsystems are also important because the physical orientation of data stored on disk has a dramatic influence on overall server performance. A detailed understanding of disk subsystem operation is critical for effectively solving many server performance bottlenecks. A disk subsystem consists of the physical hard disk and the controller. A disk is made up of multiple platters coated with magnetic material to store data. The entire platter assembly mounted on a spindle revolves around the central axis. A head assembly mounted on an arm moves to and fro (linear motion) to read the data stored on the magnetic coating of the platter. The linear movement of the head is referred to as the seek. Thetimeittakes to move to the exact track where the data is stored is called seek time. The rotational movement of the platter to the correct sector to present the data under the head is called latency. The ability of the disk to transfer the requested data is called the data transfer rate. The most widely used drive technology today in servers is SCSI (Small Computer System Interface). IBM s flagship SCSI controller is the ServeRAID-4H adapter. Besides SCSI, other storage technologies are available, such as: SSA (Serial Storage Architecture) FC-AL (Fibre Channel Arbitrated Loop) EIDE (Enhanced Integrated Drive Electronics) Using EIDE in servers For performance reasons, do not use EIDE disks in your server. The EIDE interface does not handle multiple simultaneous I/O requests very efficiently and so is not suited to a server environment. The EIDE interface uses more server CPU capacity than SCSI. We recommend you limit EIDE use to CD-ROM and tape devices. Copyright IBM Corp. 1998,

2 In this redbook we will focus only on SCSI and Fibre Channel. 7.1 SCSI bus overview The SCSI bus has evolved into the predominant server disk connection technology. Several different versions of SCSI exist. The table below contains all versions covered by the current SCSI specification. Table 11. SCSI specifications SCSI Standard Bus Clock Speed 50-pin Narrow (8-bit) / 68-pin Wide (16-bit) Maximum Cable Length SCSI 5 MHz 5 MBps 6 meters SCSI-2 Fast 10 MHz 10 MBps / 20 MBps 3 meters Ultra SCSI 20 MHz 20 MBps / 40 MBps 1.5 meters Ultra2 SCSI 40 MHz 40 MBps / 80 MBps 12 meters (LVD) Ultra3 SCSI 80 MHz 80 MBps / 160 MBps 12 meters (LVD) SCSI SCSI-2 First implemented as an ANSI standard in 1986, the Small Computer System Interface defines an 8-bit interface with a burst-transfer rate of 5 MBps with a 5 MHz clock (that is, 1 byte transferred per clock cycle). SCSI cable lengths are limited to 6 meters. The SCSI-2 standard was released by ANSI in 1996 and allowed for better performance than the original SCSI interface. It defines extensions that allow for 16-bit transfers and twice the data transfer due to a 10 MHz clock. The 8-bit interface is called SCSI-2 Fast and the 16-bit interface is called SCSI-2 Fast/Wide. In addition to the faster speed, SCSI-2 also introduced new command sets to improve performance when multiple requests are issued from the server. The trade-off with increased speed was shorter cable length. The 10 MHz SCSI-2 interface supported a maximum of 3 meter cable lengths. 120 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

3 7.1.3 Ultra SCSI Ultra SCSI is an update to the SCSI-2 interface offering faster data transfer rates and was introduced in It is a subset of the SCSI-3 parallel interface (SPI) standard currently under development within the X3T10 SCSI committee. The clock speed was doubled again to 20 MHz and provides a data transfer speed up to 40 MBps with a 16-bit data width, while maintaining the backward compatibility with SCSI and SCSI 2. Although the data transfer can be done at 20 MHz (that is 40 MBps wide), all SCSI commands are issued at 10 MHz to maintain compatibility. This means that the maximum bandwidth is less than 31 MBps, even with 64 KB blocks. Once again, with the increased speed, cable lengths were halved to 1.5 meters maximum Ultra2 SCSI Ultra2 SCSI uses Low Voltage Differential (LVD) signalling, which is designed to improve SCSI bus signal quality, enabling faster transfer rates and longer cable lengths. Ultra2 SCSI doubles the clock speed to 40 MHz. It employs the same concept as the older Differential SCSI specification where two signal lines are used to transmit each of the 8 or 16 bits, one signal the negative of the other. See Figure 34. At the receiver, one signal, A+ is subtracted from the other, A- (that is, the differential is taken) which effectively removes spikes and other noise from the original signal. The result is A± as shown in Figure 34. A+ A A+_ 1 0 Figure 34. Differential SCSI Differential components tend to be more expensive than similar single-ended SCSI components, and differential termination requires a lot of power, generating significant heat levels. Because of the large voltage swings (20 Volts) and high power requirements, current differential transceivers cannot Chapter 7. Disk subsystem 121

4 7.1.5 Ultra3 SCSI be integrated onto the SCSI chip, but must be additional external components. LVD has differential's advantages of long cables and noise immunity without the power and integration problems. Because LVD uses a small (1.1 Volts) voltage swing, LVD transceivers can be implemented in CMOS, allowing them to be built into the SCSI chip, reducing cost, board area, power requirements, and heat. The use of LVD allows cable lengths to be up to 12 meters. The maximum theoretical throughput of Ultra3 160/m SCSI can reach 160 MBps on each SCSI channel. Ultra3 160/m uses the same clock frequency as Ultra2 SCSI, but data transfers occur on both rising and falling edges of the clock signal, effectively doubling the throughput. This feature is called double transition clocking. Note: double transition clocking requires LVD signalling. On a single-ended SCSI bus, clocking will revert to Single Transition mode. If you use a mixture of Ultra3 and Ultra2 devices on an LVD-enabled SCSI bus, there is no need for all devices use run at Ultra2 speed: the Ultra3 SCSI devices will still operate at the Ultra3 (160 MBps) speed. Additionally, Ultra3 160/m SCSI can use CRC to ensure data integrity and is therefore far more reliable than older SCSI implementations which only support parity control. Domain validation is another feature of Ultra3 160/m SCSI. It is performed during the SCSI bus initialization and the intent is to ensure devices on the SCSI bus (=domain) can reliably transfer data at negotiated speed. Only Ultra3 capable devices can use domain validation. Note: Ultra3 160/m is a subset of Ultra3 SCSI. It supports double transition clocking, CRC and domain validation, but does not include all Ultra3 SCSI features, like packetization or quick arbitration SCSI controllers and devices There are two basic types of SCSI controller designs array and non-array. A standard non-array SCSI controller allows connection of SCSI disk drives to the PCI bus. Each drive is presented to the operating system as an individual, physical drive. 122 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

5 Figure 35 shows a typical non-array controller. The SCSI bus (an internal cable typically) is terminated on both ends. The SCSI controller (or host adapter) normally has one of the end terminators integrated within its electronics, so only one physical terminator is required. The SCSI bus can contain different device types, such as disk, CD-ROM and tape all on the same bus. However, most non-disk devices conform to the slower SCSI and SCSI-2 Fast standards. So, if I/O to a CD-ROM or tape drive is required, the entire SCSI bus would have to switch to the slower speed during that access, which dramatically affects performance. This would not be much of a problem if the CD-ROM is not used for production purposes (that is, the CD-ROM is not a LAN resource available to users) and the tape drive is only accessed after hours, when performance is not critical. If at all possible, we recommend you do not attach CD-ROMs or tape drives to the same SCSI bus as disk devices. Fortunately, most Netfinity servers have the standard CD-ROM on the EIDE bus. System Bus SCSI Host Adapter Controller Host System T SCSI Bus (Cable) T Controller Controller Controller Controller Disk Disk Disk CD-ROM Figure 35. Non-array SCSI configuration The array controller, a more advanced design, contains hardware designed to combine multiple SCSI disk drives into a larger single logical drive. Combining multiple SCSI drives into a larger logical drive greatly improves I/O performance compared to single-drive performance. Most array controllers employ fault-tolerant techniques to protect valuable data in the event of a disk drive failure. Array controllers are installed in almost all servers because of these advantages. Note: Although there are many array controller technologies, each possessing unique characteristics, this redbook includes details and tuning information specific to the IBM ServeRAID array controller. Chapter 7. Disk subsystem 123

6 7.2 SCSI IDs With the introduction of SCSI-2, a total of 16 devices can be connected to a single SCSI bus. To uniquely identify each device, each is assigned a SCSI ID from 0 to 15. One of these is the SCSI controller itself and it is assigned ID 7. Because the 16 devices share a single data channel, only one device can use the bus at a time. When two SCSI devices attempt to control the bus, the SCSI IDs determine who wins according to a priority scheme, as shown in Figure 36. The highest priority ID is that of the controller. Next are the low order IDs from 6 to 0 and then the high order IDs from 15 to 8. Although this priority scheme allows backward compatibility, it can result in negative system Figure 36. SCSI ID priority SCSI ID Priority 7 (Highest) Controller (Lowest) performance if your devices are configured incorrectly. Narrow (8-bit) devices with lower IDs will automatically preempt use of the bus by the faster F/W devices with addresses greater than 7. This is especially important when CD-ROMs and tape drives are placed on the same SCSI bus as F/W disk drives. Note: With the use of hot-swap drives, the SCSI ID is automatically set by the hot-swap backplane. Typically, the only change is whether the backplane assigns high-order IDs or low-order IDs. 7.3 Disk array controller architecture Almost all server disk controllers implement the SCSI communication between the disk controller and disk drives. SCSI is an intelligent interface that allows simultaneous processing of multiple I/O requests. This is the single most important advantage for using SCSI controllers on servers. Servers must process multiple independent requests for I/O. SCSI s ability to concurrently process many different I/O operations makes it the optimal choice for servers. SCSI array controllers consist of the following primary components: PCI bus interface/controller 124 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

7 SCSI bus controller(s) and SCSI bus(es) Microprocessor Memory (microprocessor code and cache buffers) Internal bus (connects PCI interface, microprocessor, and SCSI controllers) SCSI Disk Drives Microprocessor Memory Microcode SCSI Bus SCSI Controller Internal Bus Data Buffers Cache PCI Bus Controller Figure 37. Architecture of a disk array controller 7.4 Disk array controller operation The SCSI-based disk array controller is a PCI busmaster initiator with capability to master the PCI bus to gain direct access to server main memory. The following sequence outlines the fundamental operations that occur when a disk-read operation is performed: 1. The server operating system generates a disk I/O read operation by building an I/O control block command in memory. The I/O control block contains the READ command, a disk address called a Logical Block Address (LBA), a block count or length, and the main memory address where the read data from disk is to be placed (destination address). 2. The operating system generates an interrupt to tell the disk array controller that it has an I/O operation to perform. This interrupt initiates execution of the disk device driver. The disk device driver (executing on the server s CPU) addresses the disk array controller and sends it the address of the I/O control block and a command instructing the disk array controller to fetch the I/O control block from memory. 3. The disk array controller initiates a PCI bus transfer to copy the I/O control block from server memory into its local adapter memory. The on-board microprocessor executes instructions to decode the I/O control block Chapter 7. Disk subsystem 125

8 command, to allocate buffer space in adapter memory to temporarily store the read data, and to program the SCSI controller chip to initiate access to the SCSI disks containing the read data. The SCSI controller chip is also given the address of the adapter memory buffer that will be used to temporarily store the read data. 4. At this point, the SCSI controller arbitrates for the SCSI bus, and when bus access is granted, a read command, along with the length of data to be read, is sent to the SCSI drives that contain the read data. The SCSI controller disconnects from the SCSI bus and waits for the next request. 5. The target SCSI drive begins processing the read command by initiating the disk head to move to the track containing the read data (called a seek operation). The average seek time for current high-performance SCSI drives is about 5 to 7 milliseconds. This time is derived by measuring the average amount of time it takes to position the head randomly from any track to any other track on the drive. The actual seek time for each operation can be significantly longer or shorter than the average. In practice, the seek time depends upon the distance the disk head must move to reach the track containing the read data. 6. After the seek time elapses, and the head reaches its destination track, the head begins to read a servo track (adjacent to the data track). A servo track is used to direct the disk head to accurately follow the minute variations of the data signal encoded within the disk surface. The disk head also begins to read the sector address information to identify the rotational position of the disk surface. This allows the head to know when the requested data is about to rotate underneath the head. The time that elapses between the point when the head settles and is able to read the data track, and the point when the read data arrives is called the rotational latency. Most disk drives have a specified average rotational latency, which is half the time it takes to traverse one complete revolution. It is half the rotational time because on average, the head will have to wait a half revolution to access any block of data on a track. The average rotational latency of a 7200 RPM drive is about 4 milliseconds, whereas the average rotational latency of a 10,000 RPM drive is about 3 milliseconds. The actual latency depends upon the angular distance to the read data when the seek operation completes, and the head can begin reading the requested data track. 7. When the read data becomes available to the read head, it is transferred from the head into a buffer contained on the disk drive. Usually this buffer is large enough to contain a complete track of data. 126 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

9 8. The disk drive has the ability to be a SCSI bus initiator or SCSI bus target (similar terminology used for PCI). Now the controller logic in the disk drive arbitrates to gain access to the SCSI bus, as an initiator. When the bus becomes available, the disk drive begins to burst the read data into buffers on the adapter SCSI controller chip. The adapter SCSI controller chip then initiates a DMA (direct memory access) operation to move the read data into a cache buffer in array controller memory. 9. When the transfer of read data into disk array cache memory is complete, the disk controller becomes an initiator and arbitrates to gain access to the PCI bus. Using the destination address that was supplied in the original I/O control block as the target address, the disk array controller performs a PCI data transfer (memory write operation) of the read data into server main memory. 10.When the entire read transfer to server memory has completed, the disk array controller generates an interrupt to communicate completion status to the disk device driver. This interrupt informs the operating system that the read operation has completed. 7.5 RAID summary Most of us have heard of RAID (redundant array of independent disks) technology. Unfortunately, there is still significant confusion about how RAID actually works and the performance implications of each RAID strategy. Therefore, this section presents a brief overview of RAID and the performance issues as they relate to commercial server environments. RAID was created by computer scientists at the University of California at Berkeley, to address the huge gap between computer I/O requirements and single disk drive latency and throughput. RAID is a collection of techniques that treat multiple, inexpensive disk drives as a unit, with the object of improving performance and/or reliability. IBM and the IT industry have also Chapter 7. Disk subsystem 127

10 introduced more RAID levels to meet industry demand. The following RAID strategies are defined by the Berkeley scientists, IBM and the IT industry: Table 12. RAID summary RAID level Fault tolerant? Description RAID-0 No All data evenly distributed to all drives. RAID-1 Yes A mirrored copy of one drive to another drive (2 disks). RAID-1E Yes Mirrored copies of each drive. RAID-3 Yes Single checksum drive. Bits of data are striped across N-1 drives. RAID-4 Yes Single checksum drive. Blocks of data are striped across N-1 drives. RAID-5 Yes Distributed checksum. Both data and parity are striped across all drives. RAID-5E Yes Distributed checksum and hot-spare. Data, parity and hot-spare are striped across all drives. RAID-10 Yes Mirror copies of RAID-0 arrays RAID-0 RAID-3 is useful for scientific applications that require increased byte throughput. It has very poor random access characteristics, and is not generally used in commercial applications. RAID-4 uses a single checksum drive that becomes a significant bottleneck in random commercial applications. It is not likely to be used by a significant number of customers because of its slow performance. RAID strategies that are supported by the IBM ServeRAID adapter are: RAID-0 RAID-1 RAID-1E RAID-5 RAID-5E Composite RAID levels, such as RAID-10 and RAID-50 RAID-0 is a technique that stripes data evenly across all disk drives in the array. Strictly, it is not a RAID level, as no redundancy is provided. On average, accesses will be random, thus keeping each drive equally busy. 128 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

11 SCSI has the ability to process multiple, simultaneous I/O requests, and I/O performance is improved because all drives can contribute to system I/O throughout. Since RAID-0 has no fault tolerance, when a single drive fails, the entire array becomes unavailable. RAID-0 offers the fastest performance of any RAID strategy for random commercial workloads. RAID-0 also has the lowest cost of implementation because redundant drives are not supported. Logical view RAID-0 - Physical view Figure 38. RAID-0: All data evenly distributed across all drives but there is no fault tolerance RAID-1 RAID-1 provides fault tolerance by mirroring one drive to another drive. The mirror drive ensures access to data should a drive fail. RAID-1 also has good I/O throughput performance compared to single-drive configurations because read operations can be performed on any data record on any drive contained within the array. Most array controllers (including the ServeRAID family) do not attempt to optimize read latency by issuing the same read request to both drives in the mirrored pair. The drive in the pair that is least busy is issued the read command, leaving the other drive to perform another read operation. This technique ensures maximum read throughput. Write performance is somewhat reduced because both drives in the mirrored pair must complete the write operation. For example, two physical write operations must occur for each write command generated by the operating system. Chapter 7. Disk subsystem 129

12 RAID-1 offers significantly better I/O throughout performance than RAID-5. However, RAID-1 is somewhat slower than RAID ' 2' 3' RAID-1 - Physical view Figure 39. RAID-1: Fault-tolerant. A mirrored copy of one drive to another drive RAID-1E RAID-1 Enhanced (which will be referred to as RAID-1E throughout the rest of this document), is only implemented by the IBM ServeRAID adapter and allows a RAID-1 array to consist of three or more disk drives. Regular RAID-1 consists of exactly two drives. The data stripe is spread across all disks in the array to maximize the number of spindles that are involved in an I/O request to achieve maximum performance. RAID-1E is also called mirrored stripe, as a complete stripe of data is mirrored to another stripe within the set of disks. Like RAID-1, only half of the total disk space is usable the other half is used by the mirror. 1 3' 4 2 1' 5 RAID-1E - Physical view 3 2' 6 Figure 40. RAID-1: Mirrored copies of each drive Because you can have more than two drives (up to 16), RAID-1E will out perform RAID-1. The only situation where RAID-1 will perform better then RAID-1E is the reading of sequential data. The reason for this is that when a RAID-1E reads sequential data off a drive, the data is striped across multiple drives. RAID-1E interleaves data on different drives therefore seek operations occur more frequently during sequential I/O. In RAID-1, data is not interleaved so fewer seek operations occur for sequential I/O. 130 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

13 7.5.4 RAID-5 RAID-5 offers an optimal balance between price and performance for most commercial server workloads. RAID-5 provides single-drive fault tolerance by implementing a technique called single equation single unknown. This technique says that if any single term in an equation is unknown, the equation can be solved to exactly one solution. The RAID-5 controller calculates a checksum (parity stripe in Figure 41) using a logic function known as an exclusive-or (XOR) operation. The checksum is the XOR of all data elements in a row. The XOR result can be performed quickly by the RAID controller hardware and is used to solve for the unknown data element. In Figure 41, addition is used instead of XOR to illustrate the technique: stripe 1 + stripe 2 + stripe 3 = parity stripe 1-3. Should drive one fail, stripe 1 becomes unknown and the equation becomes X + stripe 2 + stripe 3 = parity stripe 1-3. The controller solves for X and returns stripe 1 as the result. A significant benefit of RAID-5 is the low cost of implementation, especially for configurations requiring a large number of disk drives. To achieve fault tolerance, only one additional disk is required. The checksum information is evenly distributed over all drives, and checksum update operations are evenly balanced within the array parity 7-9 parity 8 RAID-5 - Physical view 1-3 parity 6 9 Figure 41. RAID-5: Both data and parity are striped across all drives However, RAID-5 yields lower I/O throughout then RAID-0 and RAID-1. This is due to the additional checksum calculation and write operations required. In general, I/O throughput with RAID-5 is 30-50% lower than with RAID-1. (The actual result depends upon the percentage of write operations.) A workload with a greater percentage of write requests generally has a lower RAID-5 throughput. RAID-5 will provide I/O throughput performance similar to RAID-0 when the workload does not require write operations (read only). For more information on RAID-5 performance, see 7.6, ServeRAID RAID-5 algorithms on page 136. Chapter 7. Disk subsystem 131

14 7.5.5 RAID-5E 1 5 Hot spare 2 6 Hot spare 3 7 Hot spare RAID-5E - Physical view parity Hot spare 1-4 parity 8 Hot spare Figure 42. RAID-5E: The hot spare is integrated into all disks, instead of a separate disk RAID-5E was invented by IBM research and is a technique that distributes the hot-spare drive space over the N+1 drives comprising the RAID-5 array plus standard hot-spare drive. It was first implemented in ServeRAID firmware V3.5. Adding a hot-spare drive to a server protects data by reducing the time spent in the critical state. This technique does not make maximum use of the hot-spare drive because it sits idle until a failure occurs. Often many years can elapse before the hot-spare drive is ever used. IBM invented a method to utilize the hot-spare drive to increase performance of the RAID-5 array during typical processing and preserve the hot-spare recovery technique. This method of incorporating the hot spare into the RAID array is called RAID-5E. RAID-5E is designed to increase the normal operating performance of a RAID-5 array in two ways: The hot-spare drive contains data that can be accessed during normal operation. The RAID-5 array now has an extra drive to contribute to the throughput of read and write operations. Standard 10,000 RPM drives can perform more than 100 I/O operations per second so the RAID-5 array throughput is increased with this extra I/O capability. The data in RAID-5E is distributed over N+1 drives instead of N as is done for RAID-5. As a result, the data occupies less tracks on each drive. This has the effect of physically utilizing less space on each drive keeping the head movement more localized and reducing seek times. Together, these improvements yield a typical system-level performance gain of about 10-20%. Another benefit of RAID-5E is the faster rebuild times needed to reconstruct a failed drive. In a standard RAID-5 hot-spare configuration the rebuild of a failed drive requires serialized write operations to the single hot-spare drive. Using RAID-5E the hot spare drive space is evenly distributed across all 132 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

15 drives, so the rebuild operations are evenly distributed to all remaining drives in the array. Rebuild times with RAID-5E can be dramatically faster than rebuild times using a standard hot-spare configuration. The only downside of RAID-5E is that the hot-spare drive cannot be shared across multiple physical arrays as can be done with standard RAID-5 plus hot-spare. This RAID-5 technique is more cost efficient for multiple arrays because it allows a single hot-spare drive to provide coverage for multiple physical arrays. This reduces the cost of using a hot-spare drive but the sacrifice is the inability to handle separate drive failures within different arrays. IBM ServeRAID adapters offer increased flexibility by providing the choice to use either standard RAID-5 with hot-spare or the newer integrated hot-spare provided with RAID-5E. While RAID-5E provides a performance improvement for most operating environments there is a special case where its performance can be slower than RAID-5. Consider a three-drive RAID-5 with hot-spare configuration as shown in Figure 43. This configuration employs a total of four drives but the hot-spare drive is idle so for a performance comparison it can be ignored. A four-drive RAID-5E configuration would have data and checksum on four separate drives. ServeRAID adapter 16 KB write operation Step 1 Adapter cache Step 2: calculate checksum 16 KB write operation Step 3: write data Step 4: write checksum RAID-5 with hot-spare 8KBstripe 8KB 8KB 8KB Figure43. Writinga16KBblocktoaRAID-5arraywithan8KBstripesize Referring to Figure 43, whenever a write operation is issued to the controller that is two times the stripe size (for example, a 16 KB I/O request to an array with an 8 KB stripe size), a three-drive RAID-5 configuration would not require any reads because the write operation would contain all the data needed for each of the two drives. The checksum would be generated by the Chapter 7. Disk subsystem 133

16 array controller (step 2) and immediately written to the corresponding drive (step 4) without the need to read any existing data or checksum. This entire series of events would require two writes for data to each of the drives storing the data stripe (step 3) and one write to the drive storing the checksum (step 4), for a total of three write operations. Contrast these events to the operation of a comparable RAID-5E array which contains four drives as shown in Figure 44. In this case, in order to calculate the checksum, a read must be performed of the data stripe on the extra drive (step 2). This extra read was not performed with the three-drive RAID-5 configuration and it slows the RAID-5E array for write operations that are twice the stripe size. ServeRAID adapter 16 KB write operation Step 1 Adapter cache Step 3: calculate checksum Extra step Step 2: read data 16 KB write operation Step 4: write data Step 5: write checksum RAID-5E with integrated hot-spare 8KBstripe 8KB 8KB 8KB 8KB Figure44. Writinga16KBblocktoaRAID-5Earraywitha8KBstripesize This problem with RAID-5E can be avoided with proper stripe size selection. By monitoring the average I/O size in bytes, or knowing the I/O size generated by the application, a large enough stripe size can be selected so that this performance degradation rarely occurs Composite RAID levels The ServeRAID-4 adapter family supports composite RAID levels. This means that it supports RAID arrays that are joined together to form larger RAID arrays. For example, RAID 10 is the result of forming a RAID-0 array from two or more RAID-1 arrays. With four SCSI channels each supporting 15 drives, this 134 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

17 means you can theoretically have up to 60 drives in one array. With the EXP200, the limit is 40 disks and with the EXP300, the limit is 56 disks. A ServeRAID RAID-10 array is shown in Figure 45: 1 1' 1 1' 1 1' 2 2' 2 2' 2 2' 3 3' 3 3' 3 3' RAID-10 - Physical view (striped RAID-1) Figure 45. RAID-10: A striped set of RAID-1 arrays Likewise a striped set of RAID-5 arrays is shown in Figure parity parity parity parity parity parity RAID-50 - Physical view (striped RAID-5) Figure 46. RAID-50: A striped set of RAID-5 arrays Chapter 7. Disk subsystem 135

18 The ServeRAID-4 family supports the following combinations: Table 13. Composite RAID levels supported by ServeRAID-4 adapters RAID level The sub-logical array is and the spanned array is RAID-00 RAID-0 RAID-0 RAID-10 RAID-1 RAID-0 RAID-1E0 RAID-1E RAID-0 RAID-50 RAID-5 RAID-0 Table 14 shows a summary of the performance characteristics of the three RAID levels commonly used in array controllers. A comparison is also made between small and large I/O data transfers. Table 14. Summary of RAID performance characteristics RAID level Data Sequential I/O Random I/O Data availability 2 capacity 1 performance 2 performance 2 Read Write Read Write With hot spare Without hot spare Single Disk n Not applicable RAID-0 n Not applicable RAID-1 n/ RAID-1E n/ RAID-5 n RAID-5E n N/A RAID-10 n/ Notes: 1 In the data capacity column, n refers to the number of equally sized disks in the array = best, 1=worst. You should only compare values within each column. Comparisons between columns is not valid for this table. 3 With the write back setting enabled. 7.6 ServeRAID RAID-5 algorithms The IBM ServeRAID adapter uses one of two algorithms for the calculation of RAID-5 parity. These algorithms ensure the best performance of RAID-5 write operations in array configurations, regardless of the number of drives in the array: 136 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

19 Use read/modify write forraid-5arraysoffivedrivesormore. Use full XOR for RAID-5 arrays of three or four drives. This section compares these two algorithms Read/modify write algorithm The read/modify write algorithm is optimized for configurations that use greater than four drives. The RAID-5 read/modify write algorithm is described in Figure 47. This algorithm always requires four disk operations to be performed for each write command, regardless of the number of drives in the RAID-5 array. As per Figure 47, the steps that occur are: 1. Read old data (data1) 2. Read old checksum (CS6) 3. Calculate the new checksum from old data, new data and old checksum 4. Write new data (data4) 5. Write new checksum (CS9) Read/Modify Write algorithm Command: Update data1 to data4 ServeRAID adapter Adapter cache data4 data1 New CS9 CS6 Step 3 (Calc) Step 5 (Write) Step 1 (Read) Step 2 (Read) Step 4 (Write) data1 data4 data2 data3 CS6 CS9 Figure 47. Read/modify write algorithm four I/O operations for every write command Regardless of the number of drives, with the read/modify write algorithm, the write command will always require four I/O operations: two reads and two writes. The algorithm is called read/modify write because it reads the checksum, modifies the checksum, then writes the checksum. Chapter 7. Disk subsystem 137

20 7.6.2 Full XOR algorithm A different method can be used to generate RAID-5 checksum information for a write operation that modifies data1 to be data4. This method is called the full exclusive or algorithm (full XOR algorithm). It involves disk read operations of data2 and data3. The full XOR algorithm then creates a new checksum from data4 + data2 + data3, writes the modified data (data4), and overwrites the old checksum with the new checksum (CS9). In this case, four disk operations are performed. The following operations (as per Figure 48) show the steps involved in the full XOR algorithm: 1. Read data2 2. Read data3 3. Calculate new checksum (CS6) from new data (data4), data2 and data3 4. Write data4 5. Write checksum (CS9) Full XOR algorithm Command: Update data1 to data4 ServeRAID Adapter Adapter cache data4 data2 data3 New CS9 Step 3 (Calc) Step 5 (Write) Step 1 (Read) Step 2 (Read) Step 4 (Write) data1 data4 data2 data3 CS6 CS9 Figure 48. Full XOR algorithm In this case, four disk operations are performed: two reads and two writes. If the number of disks in the array increases, then the number of read operations also increases: Five disks: five I/O operations (three reads and two writes) 138 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

21 Six disks: six I/O operations (four reads and two writes) n disks: n I/O operations (n-2 reads and two writes) The extra read operations required by this algorithm cause the performance of write commands to degrade as the number of drives increases. The algorithm is called full XOR due to the way the checksum is calculated. The checksum is calculated from all the data and then the calculated checksum is written to disk. The original checksum is not used in the calculation. However, for three disks, only three I/O operations are required: one read and two writes. Thus the following conclusions can be reached: For 3-drive RAID-5 arrays, full XOR is faster. For 4-drive RAID-5 arrays, the algorithms are the same. For 5+ drive RAID-5 arrays, read/modify write is faster. Thus, for a four-drive configuration, the full XOR algorithm requires the same number of disk operations as the read/modify write algorithm. A RAID-5 configuration using five drives would require four disk operations for the read/modify write algorithm, but five disk operations for the full XOR algorithm. Consequently, the number of disk operations increases for a full XOR algorithm as the number of drives configured in a RAID-5 array increases. The extra read operations required by the full XOR algorithm cause the performance of write operations to degrade as the number of drives increases. To take advantage of this, Version 2.3 of the ServeRAID firmware introduced a technique which used the better of these two algorithms depending on the number of drives in the array. It uses full XOR when the adapter is configured with three or four drives in a RAID-5 array, and read/modify write when the adapter is configured with five or more drives Sequential write commands The benefits of these two algorithms also affects the sequential write commands. When the ServeRAID adapter is configured for RAID-5 and the server I/O is sequential write operations (for example, when copying files to the server or when building a database), additional performance benefits can be achieved using a full XOR algorithm and using a write-back cache policy. (The benefits of write-back cache are discussed in 7.7.7, Disk cache write-back versus write-through on page 153.) Chapter 7. Disk subsystem 139

22 The ServeRAID firmware V2.7 has intelligence to detect this type of I/O, and switches to full XOR. This would cause each data element, data1, data2, data3, and checksum to be stored in the ServeRAID adapter cache after the first operation to update data1 to data4. In write-back mode, the updates to data2, data3 and the successive updates to the checksum could all be accomplished in cache memory. After the entire group of stripe elements is sequentially updated in cache memory, only three disk operations are required to store the updated data2, data3, and checksum information on disk. This feature of the ServeRAID can improve database load times in RAID-5 mode by up to eight times over earlier ServeRAID firmware levels. 7.7 Factors affecting disk array controller performance RAID strategy Many factors affect array controller performance. The most important considerations (in order of importance) for configuring the IBM ServeRAID adapter are: RAID strategy Number of drives Drive performance Logical drive configuration Stripe size SCSI bus organization and speed Disk cache write-back versus write-through RAID adapter cache size Device drivers Firmware Your RAID strategy should be carefully selected because it significantly affects disk subsystem performance. Figure 49 illustrates the performance differences between RAID-0, RAID-1E and RAID-5 for a server configured with 10,000 RPM Fast/Wide SCSI-2 drives and the IBM ServeRAID-3HB adapter with v3.6 code. The chart shows the RAID-0 configuration delivering about 97% greater throughput than RAID-5 and 35% greater throughput than RAID-1E. RAID-0 has no fault tolerance and is, therefore, best utilized for read-only environments when downtime for possible backup recovery is acceptable. RAID-1E or RAID-5 should be selected for applications requiring fault 140 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

23 tolerance. RAID-1E is usually selected when the number of drives is low (less than six) and the price for purchasing additional drives is acceptable. RAID-1E offers about 45% more throughput than RAID-5. These performance considerations should be understood before selecting a fault-tolerant RAID strategy. I/O operations per seconde Configuration: Windows NT Server 4.0 ServeRAID-3HB Firmware/driver v3.6 Maximum number of drives 10,000 RPM 8KBI/Osize Random I/O mix: 67/33 R/W 0 RAID-0 RAID-1E RAID Number of drives Figure 49. Comparing RAID levels In many cases, RAID-5 is the best choice because it provides the best price and performance combination for configurations requiring capacity greater than five or more disk drives. RAID-5 performance approaches RAID-0 performance for workloads where the frequency of write operations is low. Servers executing applications that require fast read access to data and high availability in the event of a drive failure should employ RAID-5. For more information about RAID-5 performance, see 7.6, ServeRAID RAID-5 algorithms on page 136. The number of disk drives significantly affects performance because each drive contributes to total system throughput. Capacity requirements are often the only consideration used to determine the number of disk drives configured in a server. Throughput requirements are usually not well understood or are completely ignored. Capacity is used because it is easily estimated and is often the only information available. Chapter 7. Disk subsystem 141

24 The result is a server configured with sufficient disk space, but insufficient disk performance to keep users working efficiently. High-capacity drives have the lowest price per byte of available storage and are usually selected to reduce total system price. This often results in disappointing performance, particularly if the total number of drives is insufficient. It is difficult to accurately specify server application throughput requirements when attempting to determine the disk subsystem configuration. Disk subsystem throughput measurements are complex. To express a user requirement in terms of bytes per second would be meaningless because the disk subsystem s byte throughput changes as the database grows and becomes fragmented, and as new applications are added. The best way to understand disk I/O and users throughput requirements is to monitor an existing server. Tools such as the Windows 2000 Performance console can be used to examine the logical drive queue depth and disk transfer rate (as described in Chapter 11, Windows 2000 Performance console on page 221). Logical drives that have an average queue depth much greater than the number of drives in the array are very busy. This indicates that performance would be improved by adding drives to the array. Adding drives In general, adding drives is one of the most effective changes that can be made to improve server performance. Measurements show that server throughput for most server application workloads increase as the number of drives configured in the server is increased. As the number of drives is increased, performance is usually improved for all RAID strategies. Server throughput continues to increase each time drives are added to the server. This can be seen in Figure Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

25 Tps Configuration: Windows NT 4.0 SQL Server 6.5 ServeRAID II 4.5 GB 7200 RPM Drives RAID-0 6Drives RAID-0 8Drives RAID-0 Figure 50. Improving performance by adding drives to arrays This trend will continue until another server component becomes the bottleneck. In general, most servers are configured with an insufficient number of disk drives. Therefore, performance increases as drives are added. Similar gains can be expected for all I/O-intensive server applications such as office-application file serving, Lotus Notes, Oracle, DB2 and Microsoft SQL Server. Rule of thumb For most server workloads, when the number of drives in the active logical array is doubled, server throughput will improve by about 50% until other bottlenecks occur Drive performance If you are using one of the IBM ServeRAID family of RAID adapters, you can use the logical drive migration feature to add drives to existing arrays without disrupting users or losing data. Drive performance contributes to overall server throughput because faster drives perform disk I/O in less time. There are four major components to the time it takes a disk drive to execute and complete a user request: Chapter 7. Disk subsystem 143

26 Command overhead This is the times it take for the drive s electronics to process the I/O request. The time depends on whether it is a read or write request and whether the command can be satisfied from the drive s buffer. This value is of the order of 0.1 ms for a buffer hit to 0.5 ms for a buffer miss. Seek time This is the time it takes to move the drive head from its current cylinder location to the target cylinder. As the radius of the drives has been decreasing, and drive components have become smaller and lighter, so too has the seek time been decreasing. Average seek time is usually 5-7 ms for most current SCSI-2 drives used in servers today. Rotational latency Once the head is at the target cylinder, the time it takes for the target sector to rotate under the head is called the rotational latency. Average latency is half the time it takes the drive to complete one rotation so it is inversely proportional to the RPM value of the drive: RPM drives have a 5.6 ms latency RPM drives have a 4.2 ms latency - 10,000 RPM drives have a 3.0 ms latency Data transfer time This value depends on the media data rate, which is how fast data can be transferred from the magnetic recording media, and the interface data rate, which is how fast data can be transferred between the disk drive and disk controller (that is, the SCSI transfer rate). The media data rate improves as a result of greater recording density and faster rotational speeds. A typical value is 0.8 ms. The interface data rate for SCSI-2 F/W is 20 MBps. With 4 KB I/O transfers (which are typical for Windows NT Server and Windows 2000), the interface data transfer time is 0.2 ms. Hence the data transfer time is approximately 1 ms. As you can see, the significant values that affect performance are the seek time and the rotational latency. For random I/O (which is normal for a multi-user server) this is true. Reducing the seek time will continue to improve as the physical drive attributes become less. For sequential I/O (such as with servers with small numbers of users requesting large amounts of data) or for I/O requests of large block sizes (for example 64 KB), the data transfer time does become important when compared to seek and latency, so the use of Ultra SCSI, Ultra2 SCSI or Ultra3 SCSI can have a significant positive effect on overall subsystem performance. 144 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

27 Likewise when caching and read-ahead is employed on the drives themselves, the time taken to perform the seek and rotation are eliminated, so the data transfer time becomes very significant. The easiest way to improve disk performance is to increase the number of accesses that can be made simultaneously. This is achieved by using many drives in a RAID array and spreading the data requests across all drives as described in 7.7.2, Number of drives on page 141. Table 15 shows the seek and latency values and buffer sizes for three of IBM s high-end drives. Table 15. Comparing and 7200 RPM drives Disk Capacity RPM Seek Latency Buffer size Media data transfer rate (MBps) Ultrastar 36LP 18.3 GB ms 4.17 ms 4 MB Ultrastar 36LZ 18.3 GB 10K 4.9 ms 2.99 ms 4 MB Ultrastar 72ZX 73.4 GB 10K 5.3 ms 2.99 ms 16 MB Logical drive configuration Using multiple logical drives on a single physical array is convenient for managing the location of different files types. However, depending on the configuration, it can significantly reduce server performance. When you use multiple logical drives, you are physically spreading the data across different sections of the array disks. If I/O is performed to each of the logical drives, the disk heads have to seek further across the disk surface than when the data is stored on one logical drive. Using multiple logical drives greatly increases seek time and can slow performance by as much as 25%. An example of this is creating two logical drives in the one RAID array and putting a database on one logical drive and the transaction log on the other. Because heavy I/O is being performed on both, the performance will be poor. If the two logical drives are configured with the operating system on one and data on the other, then there should be little I/O to the operating system code once the server has booted so this type of configuration would be OK. It is best to put the page file on the same drive as the data when using one large physical array. This is counterintuitive: Most think the page file should be put on the operating system drive since the operating system will not see much I/O during runtime. However, this causes long seek operations as the Chapter 7. Disk subsystem 145

28 7.7.5 Stripe size head swings over the two partitions. Putting the data and page file on the data array keeps the I/O localized and reduces seek time. Of course this is not the most optimal case, especially for applications with heavy paging. Ideally, the page drive will be a separate device that can be formatted to the correct stripe size to match paging. In general, most applications will not page when given sufficient RAM so usually this is not a problem. The fastest configuration is a single logical drive for each physical RAID array. Instead of using logical drives to manage files, you should create directories and store each type of file in a different directory. This will significantly improve disk performance by reducing seek times because the data will be as physically close together as possible. If you really want or need to partition your data and you have a sufficient number of disks, you should configure multiple RAID arrays instead of configuring multiple logical drives in one RAID array. This will improve disk performance; seek time will be reduced because the data will be physically closer together on each drive. Note: If you plan to use RAID-5E arrays, you can only have one logical drive per array. With RAID technology, data is striped across an array of hard disk drives. Striping is the process of storing data across all the disk drives that are grouped in an array. The granularity at which data from one file is stored on one drive of the array before subsequent data is stored on the next drive of the array is called the stripe unit (alsoreferredtoasinterleave depth). For the ServeRAID adapter family, the stripe unit can be set to a stripe unit size of 8 KB, 16 KB, 32 KB, or 64 KB. With Netfinity Fibre Channel, a stripe unit is called a segment, and segment sizes can also be 8 KB, 16 KB, 32 KB, or 64 KB. The collection of these stripe units, from the first drive of the array to the last drive of the array, is called a stripe. The stripe and stripe unit are shown in Figure Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

29 Stripe SU1 SU4 SU2 SU5 SU3 SU6 Stripe Unit Figure 51. RAID stripes and stripe units Note: Thetermstripe size should really be called stripe unit size since it refers to the length of the stripe unit (the piece of space on each drive in the array) Using stripes of data balances the I/O requests within the logical drive. On average, each disk will perform an equal number of I/O operations, thereby contributing to overall server throughout. Stripe size has no effect on the total capacity of the logical disk drive Selecting the correct stripe size The selection of stripe size affects performance. In general, the stripe size should be at least as large as the median disk I/O request size generated by server applications. Selecting too small a stripe size can reduce performance. In this environment, the server application requests data that is larger than the stripe size, which results in two or more drives being accessed for each I/O request. Ideally, only a single disk I/O occurs for each I/O request. Selecting too large a stripe size can reduce performance because a larger than necessary disk operation might constantly slow each request. This is a problem particularly with RAID-5 where the complete stripe must be read from disk to calculate a checksum. Use too large a stripe, and extra data must be read each time the checksum is updated. Selecting the correct stripe size is a matter of understanding the predominate request size performed by a particular application. Few applications use a single request size for each and every I/O request. Therefore, it is not possible to always have the ideal stripe size. However, there is always a best-compromise stripe size that will result in optimal I/O performance. Chapter 7. Disk subsystem 147

30 There are two ways to determine the best stripe size: Use a rule of thumb as per Table 16. Monitor the I/O characteristics of an existing server. The first and simplest way to choose a stripe size is to use Table 16. This table is based on tests performed by the Netfinity Performance Lab. Table 16. Stripe size setting for various applications Applications Groupware (Lotus Domino, Exchange etc.) Database server (Oracle, SQL Server, DB2, etc.) File server (Windows 2000, Windows NT) Web server Video file server Other Stripe size 16 KB 16 KB 16 KB 8 KB 64 KB 8 KB Notes about Table 16: SQL Server 7.0 uses 8 KB I/O blocks but experiments have shown that performance can usually be improved by using double the I/O block size (that is, 16KB). Oracle uses multiple block sizes: 2 KB, 4 KB or 8 KB. While using a 16 KB stripe size is not the optimum for all cases, neither is it significantly slower either. Further I/O analysis on specific customer data may determine that 8 KB or 16 KB block sizes may produce better performance. In general, stripe size only needs to be at least as large as the I/O size. Having a smaller stripe size implies multiple physical I/O operations for each logical I/O which will cause a drop in performance. Using a larger stripe size implies a read-ahead function which may or may not improve performance. Table 16 offers rule-of-thumb settings there is no way to offer the precise stripe size that will always give the best performance for every environment without doing extensive analysis on the specific workload. The second way to determine the correct stripe size involves observing the application while it is running using the Windows 2000 Performance console. The key is to determine the average data transfer size being requested by the application and select a stripe size that best matches. Unfortunately, this method requires the system to be running, so it either requires another system running the same application or the reconfiguring of the existing disk 148 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

31 array once the measurement has been made (and therefore backup, reformat and restore operations). The Windows 2000 Performance console or Windows NT 4.0 Performance Monitor can help you determine the proper stripe size. Select: Object: PhysicalDisk Counter: Avg. Disk Bytes/Transfer Instance: the drive that is receiving the majority of the disk I/O Monitor this value. As an example, the trend value for this counter is shown as the thick line in Figure 52. The running average is shown as indicated. Data drive average disk bytes per transfer: Range of 20 KB to 64 KB Maximum 64 KB Running average Figure 52. Average I/O size Figure 52 represents an actual server application. It can be seen that the application request size (represented by Avg. Disk Bytes/Transfer) varies from a peak of 64 KB to about 20 KB for the two run periods. As we said at the beginning of this section, in general, the stripe size should be at least as large as the median disk I/O request size generated by the server application. Chapter 7. Disk subsystem 149

32 This particular server was configured with an 8 KB stripe size, which produced very poor performance. Increasing the stripe size to 16 KB would improve performance and increasing the stripe size to 32 KB would increase performance even more. The simplest technique would be to place the time window around the run period and select a stripe size that is at least as large as the average size shown in the running average counter. Activating disk performance counters If you wish to monitor disk activity you need to enable the physical disk counters. In Windows NT 4.0, physical disk counters are disabled by default. To enable them, issue the command DISKPERF -Y then restart the computer. In Windows 2000, physical disk counters are enabled by default. Keeping this setting on all the time draws about 2-3% CPU but if your CPU is not a bottleneck, this is irrelevant and can be ignored. Type DISKPERF /? for more help on the DISKPERF command Page file drive Windows NT and Windows 2000 perform page transfers at up to 64 KB per operation, so the paging drive stripe size can be as large as 64 KB. However, in practice, it is usually closer to 32 KB because the application might not make demands for large blocks of memory which limits the size of the paging I/O. Monitor average bytes per transfer as described in , Selecting the correct stripe size on page 147. Setting the stripe size to this average size can make a significant increase in performance by reducing the amount of physical disk I/O that occurs due to paging. For example, if the stripe size is 8 KB and the page manager is doing 32 KB I/O transfers, then four physical disk reads or writes must occur for each page/sec you see in the Performance console. If the system is paging 10 pages/sec, then the disk will actually be doing 40 disk transfers/second SCSI bus organization and speed Concern often exists over the performance effects caused by the number of drives on the SCSI bus, or the speed at which the SCSI bus runs. Yet, in almost all modern server configurations, the SCSI bus is rarely the 150 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

33 bottleneck. In most cases, optimal performance can be obtained by simply configuring 10 drives per Ultra SCSI bus. If the application is byte-i/o-intensive (as is the case with video or audio) five drives on one Ultra2 SCSI bus can be used for a moderate (10-20%) increase in system performance. In general, it is rare that SCSI bus configuration or increasing SCSI bus speed can significantly improve overall server system performance. Consider that servers must access data stored on disk for each of the attached users. Each user is requesting access to different data stored in a unique location on the disk drives. Disk accesses are almost always random because the server must multiplex access to disk data for each user. This means that most server disk accesses require a seek and rotational latency before data is transferred across the SCSI bus SCSI bus speed As described in 7.7.3, Drive performance on page 143, total disk seek and latency times average about 8-12 ms, depending on the speed of the drive. Transferring a 2 KB block over a 40 MBps SCSI bus takes about 0.5 ms, or approximately 1/20th of the total disk access time. It is easy to see that increasing the bus speed to 80 MBps will only improve the 1/20 portion of time to roughly 1/40 of the total time, resulting in a small fractional gain in overall performance. Tests have shown that for random I/O, drive throughput usually does not approach the limits of the SCSI bus. In some cases, Ultra2 SCSI (80 MBps) can be shown to offer measurable performance improvements over Ultra SCSI (40 MBps). This usually occurs when measurements are made with a few drives (four to eight) running server benchmarks that transfer large blocks of data to and from the disk drives. System performance gains can be in the range of 5-10%. Typical examples are file-serving and benchmarks, where data transfer time becomes a larger component of the total disk access time. File-serving and benchmarks (and applications) transfer relatively large blocks of data (12 KB to 64 KB), which increases SCSI bus utilization. More importantly, however, these benchmarks usually build a relatively small set of data files, resulting in artificially reduced disk seek times. In production environments, disk drives are usually filled to at least 30-50% of their capacity, causing longer seek times compared to the benchmark files that might only use 2-3% of the disk capacity. After all, building a 2 GB database Chapter 7. Disk subsystem 151

34 for a benchmark might seem like a large data set, but on a disk array containing five 9 GB drives, that database utilizes less than 1/20th of the total space. This greatly reduces seek times, thereby inflating the performance contribution of the SCSI bus. Most IBM SCSI drive enclosures offer the ability to split the backplane to offer dual SCSI bus capability. This effectively provides the same performance of Ultra2 by using dual Ultra SCSI buses in one drive package. In the case of the EXP200, which supports Ultra2, the backplane can be split to provide throughput benefits similar to Ultra3 SCSI, provided that the other components in the system can handle this level of throughput PCI bus Don't forget that all this data must travel though the PCI bus. The peak data transfer rate for 32-bit 33 MHz PCI is 132 MBps, but the maximum sustained rate is only MBps. The ServeRAID-2 adapter used three 40 MBps Ultra SCSI buses, which had the potential to sustain peak rates of 120 MBps. This transfer rate was perfectly matched with 32-bit 33 MHz PCI. Using Ultra2 SCSI only provides the possibility of improving maximum sustainable performance when both the adapter and the system support faster data transfer rates. The ServeRAID-3HB adapter offers Ultra2 SCSI transfer performance and can transfer data over 64-bit PCI, which is better matched to the transfer requirements of its three Ultra2 SCSI buses. The adapter must be plugged into a 64-bit slot for Ultra2 SCSI capabilities to take maximum advantage of its potential. Ultra3 SCSI's 160 MBps rate has similar issues: even faster PCI-to-memory performance is required before maximum throughput can be achieved. A three-channel Ultra3 SCSI RAID adapter can potentially deliver peak rates of 480 MBps. For the moment, no PCI interface can offer such throughput performance. In fact, most memory subsystems cannot offer that much bandwidth for all of the PCI slots combined. All this is not to say that Ultra2 and Ultra3 SCSI performance have no place on servers. Several issues must be addressed. You should simply remember that these are important considerations when configuring a balanced server. Spec-driven technologies, such as SCSI, are often motivated by desktop environments. In the desktop environment, applications tend to be more sequential, and the system usually has a single SCSI adapter that can monopolize much of the PCI to memory bandwidth. In the desktop 152 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

35 environment, SCSI-3 provides significant performance gains. Because of the more random nature of server applications, these benefits often do not translate to server environments. The entire delivery path from memory through the PCI bus, over the adapter, and out to the drive must be optimized before faster SCSI bus speeds will realize any appreciable system performance gains for any workload Multiple SCSI buses The SCSI bus organization of drives on a multi-bus controller (such as ServeRAID) does not significantly affect performance for most server workloads. For example, in a four-drive configuration, it doesn t matter whether you attach all drives to a single SCSI bus or if you attach two drives each to two different SCSI buses. Both configurations will usually have identical disk subsystem performance. This applies to applications such as database transaction processing, which generate random disk operations of 2 KB or 4 KB. The SCSI bus does not contribute significantly to the total time required for each I/O operation. Each I/O operation usually requires drive seek and latency times; therefore, the sustainable number of operations per second is reduced, causing SCSI bus utilization to be low. For a configuration which runs applications that access image data or large sequential files, performance improvement can be achieved by using a balanced distribution of drives on the three SCSI buses of the ServeRAID Disk cache write-back versus write-through Most people think that write-back mode is always faster because it allows data to be written to the disk controller cache without waiting for disk I/O to complete. This is usually the case when the server is lightly loaded. However, as the server becomes busy, the cache fills completely, causing data writes to wait for space in the cache before being written to the disk. When this happens, data write operations slow to the speed at which the disk drives empty the cache. If the server remains busy, the cache is flooded by write requests, resulting in a bottleneck. This happens regardless of the size of the adapter s cache. In write-through mode, write operations do not wait in cache memory that must be managed by the processor on the RAID adapter. When the server is lightly loaded (the green zone on the left in Figure 53), write operations take Chapter 7. Disk subsystem 153

36 longer because they cannot be quickly stored in the cache. Instead, they must wait for the actual disk operation to complete. Thus, when the server is lightly loaded, throughput in write-through mode is generally lower than in write-back mode. I/Os per Second Write-through Write-back Comparing write-through versus write-back IBM ServeRAID-3HB 8 KB random I/O RAID Increasing Load Figure 53. Comparing write-through and write-back modes under increasing load However, when the server becomes very busy (the pink zone on the right in Figure 53), I/O operations do not have to wait for available cache memory. They go straight to disk, and throughput is usually greater for write-through than in write-back mode. Write-through is also faster when battery-backup cache is installed, due partly to the fact that the cache is mirrored. Data in the primary cache has to be copied to the memory on the battery-backup cache card. This copy operation eliminates a single point of failure, thereby increasing the reliability of the controller in write-back mode, but it takes time and slows writes, especially when the workload floods the adapter with write operations. Rule of thumb Based on Figure 53, the following rule of thumb is appropriate: If the disk subsystem is very busy, use write-through mode. If the disks are configured correctly, and the server is not heavily loaded, use write-back mode. 154 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

37 7.7.8 RAID adapter cache size IBM performance tests show that the ServeRAID-3H adapter with 32 MB of cache typically outperforms other RAID adapters with 64 MB for most real-world application workloads. Once the cache size is above the minimum required for the job, the extra cache usually offers little additional performance benefit. The cache increases performance by providing data that would otherwise be accessed from disk. However, in real-world applications, total data space is so much larger than disk cache size that, for random operations, there is very little statistical chance of finding the requested data in the cache. For example, a 50 GB database would not be considered very large by today's standards. A typical database of this size might be placed on an array consisting of seven or more 9 GB drives. For random accesses to such a database, the probability of finding a record in the cache would be the ratio of 32 MB/50 GB, or approximately 1 in 1,600 operations. Double the cache size, and this value is decreased by half; still a very discouraging hit-rate. You can easily see that it would take a very large cache to increase the cache hit-rate to the point where caching becomes advantageous for random accesses. In RAID-5 mode, significant performance gains from write-back mode are derived from the ability of the disk controller to merge multiple write commands into a single disk write operation. In RAID-5 mode, the controller must update the checksum information for each data update. Write-back mode allows the disk controller to keep the checksum data in adapter cache and perform multiple updates before completing the update to the checksum information contained on the disk. In addition, this does not require a large amount of RAM. In most cases, disk array caches can usually provide high hit rates only when I/O requests are sequential. In this case, the controller can pre-fetch data into the cache so that on the next sequential I/O request, a cache hit occurs. Pre-fetching for sequential I/O requires only enough buffer space or cache memory to stay a few steps ahead of the sequential I/O requests. This can be done with a small circular buffer. The cache size needs to increase in proportion to the number of concurrent I/O streams supported by the array controller. The earlier ServeRAID adapters supported up to 32 concurrent I/O streams, so 32 MB of cache was deemed enough to provide a high-performance hit rate for sequential I/O. For newer RAID adapters, the number of outstanding I/O requests can be as high as 128; thus, these adapters will have proportionally larger caches. (Note, it is Chapter 7. Disk subsystem 155

38 7.7.9 Device drivers a coincidence that the number of I/O streams matches the size of the cache in MB) Having a large cache often means more memory to manage when the workload is heavy and during light loads very little cache memory is required. Most people don't invest the time to think about how cache works. Without much thought, it's easy to reach the conclusion that bigger is always better. The drawback is that larger caches take longer to search and manage. This can slow I/O performance, especially for random operations since there is a very low probability of finding data in the cache. Benchmarks often do not reflect a customer production environment. In general, most retail benchmark results run with very low amounts of data stored on the disk drives. In these environments, a very large cache will have a high hit-rate that is artificially inflated compared to the hit-rate from a production workload. In a production environment, an overly-large cache can actually slow performance as the adapter continuously searches the cache for data that is never found, before it starts the required disk I/O. This is the reason that many array controllers turn off the cache when the hit-rates fall below an acceptable threshold. In identical hardware configurations it will take more CPU overhead to manage 64 MB of cache compared to 32 MB, and even more for 128 MB. The point is that bigger caches do not always translate to better performance. Although ServeRAID-4H has a 266 MHz PowerPC 750 with 1 MB L2, this CPU is approximately 5-7 times faster than the 80 MHz CPU used on ServeRAID-3H. Therefore ServeRAID-4H can manage the larger cache without running slower than ServeRAID-3H. Furthermore, the amount of cache must be proportional to the number of drives attached. Typically cache hits are generated from sequential read ahead. You do not need to read ahead very much to have 100% hits. More drives have more I/O streams to prefetch. ServeRAID-4H has 4 SCSI buses that support up to 40 drives compared to 30 for ServeRAID-3H. Device drivers play a major role in performance of the subsystem with which the driver is associated. A device driver is software written to recognize a specific device. Most of the device drivers are vendor specific. These drivers 156 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

39 are supplied by the hardware vendor (such as IBM in the case of ServeRAID). Most of the device drivers can be downloaded from the Web. Choosing the correct device driver for specific hardware is very important. The device drivers are also specific to the operating system. Some of the device drivers are supplied on the Windows NT CD-ROM, and some are supplied with the hardware on diskette. A technically-competent person should select a proper driver during installation. Selecting an incorrect device driver for a specific device (if it works at all) can cause poor performance or data loss. Most of the Netfinity systems have integrated SCSI adapters. We recommend you refer to the technical manual of the specific model of Netfinity system to select the correct driver. Windows NT can automatically detect many SCSI adapters and automatically load the appropriate device driver. For other adapters, such as the ServeRAID, you need to instruct Windows NT to copy the device driver from the supplied diskette. The same applies to Windows The Windows 2000 CD-ROM contains a version of the ServeRAID driver that is equivalent to the v3.5-level driver and will allow you to install the operating system onto ServeRAID-attached disks. We recommend you install Windows 2000 using that driver, then, once the installation is complete, upgrade to the latest driver. It should also be noted that often the latest driver is not the best or correct driver to use. This is especially important with specific hardware configurations that are certified by an application vendor. An example of this is Microsoft s Cluster Server. You must check the certified configuration to determine what driver level is supported. Before installing the latest version, check the IBM support Web site for the latest hints and tips: 1. Go to URL 2. Select Servers. 3. In the Family pull-down list, select ServeRAID. 4. Click on Hints and Tips from the navigation bar on the left side of the page. 5. In the category pull-down list, select Service Bulletins. You may also want to examine the other hints and tips categories. Chapter 7. Disk subsystem 157

40 Firmware Version 3.5 provides a significant improvement in performance compared to earlier versions of ServeRAID software. Version 3.5 involved many firmware optimizations that resulted in system level gain in performance as much as 20-25% for typical server workloads. This firmware version was also used to introduce RAID-5E to the ServeRAID-3 family of adapters. Version 3.5 of the firmware and device driver introduced automatic read ahead algorithms that turned on and off the read ahead function based upon the demands of the active workload. Whenever the adapter firmware detected transfers that would benefit from read ahead the option was dynamically turned on. If the I/O workload changed so that read ahead reduced the overall performance it was turned off. This feature reduced the complexity of configuring an array for maximum performance by automating the setting of the parameter. Version 3.5 also improved performance by optimizing I/O for RAID-1E. This feature improved performance by better balancing physical I/O operations between the mirror drive pairs. The net gain in performance for RAID-1E was as much as 66%. Version 3.6 software for the ServeRAID-3 family of adapters introduced additional performance enhancements, including: Refined instruction path length This feature significantly improves the performance of the ServeRAID-3 family of adapters when executing cache hit operations. Since many customers base purchase decisions by running small data size benchmarks, the design lab could not ignore performance obtained while accessing the majority of data from adapter cache. Version 3.6 offers greater performance by restructuring the executed code for a better fit and to stay resident in the L1 processor cache. The on-board CPU now runs significantly faster because most key instructions remain resident in the L1 thereby reducing the CPU wait times for slower memory accesses. Greater concurrent I/O This feature enables the ServeRAID-3 family of adapters to have up to 128 concurrent outstanding I/O operations. This change increases performance for configurations that utilize large numbers of disk drives. Allowing a larger number of concurrent outstanding I/O operations enables the disk drives to optimize I/O by reordering seek operations. 158 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

41 Removed the 8-drive limitation for 32 KB and 64 KB stripe sizes Removing the 8 physical drive limitation for 32 KB and 64 KB stripe sizes lets you have configurations of up to 16 physical disks for applications that require large block transfers. Applications such as video and image serving can now use larger arrays. These larger arrays provide both greater capacity and increased throughput for a single physical drive. In general, customers can expect to see as much as 20-25% improvements in throughput for average business applications from these modifications. Figure 54 shows the gains obtained for typical random I/O server applications. The specific configuration is 8 KB block, 67% read, and 33% write, random transactions. I/Os Per Second (Thousands) RAID-0 WB Maximum Controller Throughput Requires Maximum Number Of Drives Attached 8K Byte Random 67% Read 33% Write Operations ServeRAID-3L code v3.0 ServeRAID-2 code v2.4 ServeRAID-3H code v2.7 ServeRAID-3H code v3.5 ServeRAID-3H code v3.6 Figure 54. Maximum ServeRAID family RAID-0 throughput performance Firmware levels Always upgrade the firmware on the ServeRAID card to the latest level. 7.8 Fibre Channel Fibre Channel introduces new techniques to attach storage to servers and as a result, it has unique performance issues that affect the overall performance of a server. The purpose of this section is to provide a brief introduction to the Chapter 7. Disk subsystem 159

42 motivation behind Fibre Channel, to explain how Fibre Channel affects server performance and identify important issues for configuring Fibre Channel for optimal performance. SCSI has been the standard for server disk attachment for the last ten years. However, SCSI technology has recently been under stress as it attempts to satisfy the I/O demands of current high-performance 4 and 8-way servers. Some of the fundamental problems with SCSI are its parallel cable design which limit cable length, transfer speed, and the maximum number of drives that can be attached to the cable. Another significant limitation is that a maximum of two systems can share devices attached to one SCSI bus. This is significant when using SCSI for server clustering configurations. Fibre Channel was designed to be a transport for both network traffic and an I/O channel for attaching storage. In fact the Fibre Channel specification provides for many protocols such as 802.2, IP (Internet Protocol) and SCSI. Our discussion in this redbook will be limited to its use for disk storage attachment. Fibre Channel provides low latency and high throughput capabilities. As a result, Fibre Channel is rapidly becoming the next generation I/O technology used to connect servers and high-speed storage. Fibre Channel addresses many of the shortcomings of SCSI with improvement in the following areas: Cable distance Bandwidth Reliability Scaleability The parallel cable used for Ultra, Ultra2, and Ultra3 SCSI limit cable distances to 25 meters or shorter. This is due to electromagnetic effects impacting signal integrity as cable length increases. Parallel cables such as the type used by SCSI tend to have signal interference problems because of electromagnetic coupling that occurs between parallel signals traversing the wires. Serial technologies use fewer signals, typically two or four, compared to as many as 68 for SCSI. Fewer signal lines means less electromagnetic energy emitted and less total signal interference from coupling of the electromagnetic energy into adjacent wires. Lower signal interference allows the serial cable to transfer data at much higher rates than is possible using a parallel connection. Fibre Channel provides the capability to use either a serial copper or fiber optic link to connect the server with storage devices. Fiber optic technology 160 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

43 allows for storage to be located a maximum distance of up to 10 kilometers away from the attaching server. The same electromagnetic noise problems that limit SCSI cable length also limit the speed at which data can traverse the SCSI bus. First generation Fibre Channel is capable of transmitting data at 1 Gbit per second in both transmit and receive directions. The most popular version of SCSI, Ultra2 is limited to 80 MBps (bytes) or 640 Mbps (bits). This difference in performance (1 Gb vs Gb) does not appear to be significant; however, Fibre Channel offers a full-duplex communication path while SCSI is half-duplex. This means that Fibre Channel can achieve up to 2 Gb throughput by transferring data on both send and receive paths at same time. Therefore, the maximum bandwidth of current Fibre Channel implementations is actually 2 Gb while it is only 640 Mb for Ultra2 SCSI. However, maximum bandwidth is often touted as an important specification but in actual use, sustainable bandwidth may be much less. Another significant advantage of Fibre Channel is its ability to connect redundant paths between storage and one or more servers. Redundant Fibre Channel paths improve server availability because cable or connector failures do not cause server down time because storage can be accessed via a redundant path. In addition, both Fibre Channel and SCSI throughput can scale by utilizing multiple channels or buses between the servers and storage. In addition to a simpler cable scheme, Fibre Channel offers improved scaleability because it offers several very flexible connection topologies. Basic point-to-point connections can be made between a server and storage devices providing a low-cost simple stand-alone connection. Fibre Channel can also be used in both loop and switch topologies. These topologies increase server-to-storage connection flexibility. The Fibre Channel loop allows up to 127 devices to be configured to share the same Fibre Channel connection. A device can be a server or a storage subsystem. Fibre Channel switch topologies provide the most flexible configuration scheme by theoretically providing the connection of up to 16 million devices! The Fibre Channel specification provides many possibilities for how Fibre Channel is configured but we will confine our discussion to the implementation of the IBM Netfinity Fibre Channel RAID Controller. The IBM Fibre Channel RAID Controller operation can be conceptualized by combining LAN and disk array controller operations. Chapter 7. Disk subsystem 161

44 Figure 55 below illustrates the primary components in the IBM Fibre Channel configuration. The important factors contributing to performance are caused by the RAID controller and storage being attached to the server by a Fibre Channel link. This introduces two factors which contribute to overall Fibre Channel performance. These are: The throughput of the Fibre Channel links, shown as the FC bandwidth arrow The aggregate throughput of the RAID controller and link combination, shown as the FC-to-disk bandwidth arrow. EXP-15 Netfinity server Optional second FC adapter Netfinity Fibre Channel Fibre Host Adapter Channel host adapter FC-AL Optional second RAID controller Netfinity RAID Fibre Controller Channel RAID controller FC bandwidth FC-to-disk bandwidth Up to 60 (6x10) disk drives per RAID controller pair Figure 55. IBM Netfinity Fibre Channel RAID organization In March 2000, IBM introduced Netfinity Fibre Array Storage Technology (FAStT). This new Fibre Channel technology employs second-generation Fibre Channel integrated circuits which greatly improve throughput performance. In addition, device drivers and firmware are optimized to enhance throughput performance. FAStT utilizes the same Fibre Channel protocols used in first generation Netfinity Fibre Channel products but four host connection and four drive connection Fibre Channel links are supported per controller pair to significantly improve total available fibre channel bandwidth. 162 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

45 7.8.1 Fibre Channel performance issues Let's look at what happens when a read I/O operation is requested to a Fibre Channel subsystem, and the data requested is not located in the RAID controller disk cache: 1. A read command is generated by the Netfinity server and the read command contains the logical block address of the data being requested. 2. The command is transmitted by the Fibre Channel host adapter to the RAID controller over the Fibre Channel link. 3. The RAID controller parses the read command and uses the logical block address to issue the disk read command to the correct drive. 4. The disk drive performs the read operation and returns the data to the RAID controller. 5. The Fibre Channel electronics within the RAID controller format the data into the Fibre Channel protocol format. The data is transferred to the Netfinity server over the Fibre Channel link. 6. Once in the Fibre Channel adapter, the data is transferred over the PCI bus into memory of the Netfinity server. Of course, a large amount of the detail was left out, but this level of observation is sufficient to understand the most important performance implication of Fibre Channel. The Fibre Channel link, like most network connections, sustains a data transfer rate that is largely determined by the payload of the frame. Or stated another way, the throughput of Fibre Channel is a function of the disk I/O size being transferred. This is because Fibre Channel frames have a maximum data payload of 2112 bytes. Data transfers for larger data sizes require multiple Fibre Channel frames. Figure 56 illustrates the effects of disk request size on Fibre Channel throughput. At small disk request sizes such as 2 KB the maximum Fibre Channel throughput is about 20 MBps or about 20% the maximum transfer rate of Fibre Channel. This is critical information as many people think the maximum 1 Gbps throughput is obtained for all operations. Chapter 7. Disk subsystem 163

46 Fibre Channel throughput vs. disk I/O size IBM Fibre Channel solution MB/Sec Throughput Transfer Size KBytes Overhead Throughput Figure 56. Fibre Channel throughput vs. disk I/O size Only when the disk I/O size is as large as 64 KB does Fibre Channel reach it's maximum sustainable throughput. In this case the maximum throughput is about 82 MBps. But Fibre Channel is suppose to have one Gigabit throughput? One Gigabit is roughly 100 MBps (taking into account a 2-bit serial overhead for every byte). The difference between this measured result of 82 MBps and the theoretical maximum throughput of 1 Gbps (100 MBps) can be explained by overhead of command and control bits that accompany each Fibre Channel frame. This is discussed in the following sections Fibre Channel protocol layers We can get a better appreciation for this overhead if we take a brief look at the Fibre Channel layers and the Fibre Channel frame composition. The Fibre Channel specification defines five independent protocol layers. These layers are structured so that each layer has a specific function to enable reliable communications for all of the protocols supported by Fibre Channel standard. 164 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

47 SCSI HiPPi IPI SBCCS IP FC-4 Mapping Protocol FC-3 Common Services Protocol FC-2 Signaling and Framing Protocol FC-1 Transmission Protocol FC-0 Physical Figure 57. Fibre Channel functional levels Figure 57 illustrates the five independent layers: FC-0 is the physical layer. This is comprised of the actual wire or optical fibre over which data travels. FC-1 is the transmission protocol. The Transmission layer is responsible for encoding of the bits on the physical medium, for data transmission error detection, and for signal clock generation. FC-2 is important from a performance perspective because this is the layer that is responsible for building the data frames that flow over the Fibre Channel link. FC-2 is also responsible for segmenting large transfer requests into multiple Fibre Channel frames. FC-3 defines the common services layer. This layer is responsible for defining the common services that are accessible across all Fibre Channel ports. One of these services is the Name Server. The Name Server provides a directory of all the Fibre Channel nodes accessible on the connection. For example a Fibre Channel switch would be a name server and maintain a directory of all the ports attached to that switch. Other Fibre Channel nodes could query the switch to determine what node addresses are accessible via that switch. FC-4 defines the protocol standards that can be used to transport data over Fibre Channel. Some of these protocols include: - SCSI (Small Computer Systems Interface) - HiPPI (High Performance Parallel Interface) - IPI (Intelligent Peripheral Interface) - SBCCS (Single Byte Command Code Set) to support ESCON Chapter 7. Disk subsystem 165

48 - IP (Internet Protocol) Our discussion is limited to SCSI because Netfinity Fibre Channel RAID controller products are based upon the SCSI protocol. Fibre Channel allows the SCSI protocol commands to be encapsulated and transmitted over Fibre Channel to SCSI devices connected to the RAID controller unit. This is significant because this technique allows Fibre Channel to be quickly developed and function with existing SCSI devices and software The importance of the I/O size Regarding the shape of the throughput chart in Figure 56 on page 164, the throughput of Fibre Channel is clearly sensitive to the disk access size. Small disk access sizes have low throughput while larger blocks have greater overall throughput. The reason for this can be seen by looking at the read command example we discussed in 7.8.1, Fibre Channel performance issues on page 163. In the case of a 2 KB read operation, the sequence is: 1. A SCSI read command is issued by the device driver to the Fibre Channel host adapter at level FC On the Netfinity host side, the SCSI read command must flow down from FC-4 to FC-0 before it is transferred over the Fibre Channel link to the external RAID controller. 3. The RAID controller also has a Fibre Channel interface that receives the read command at FC-0 and sends it up through FC-1, FC-2, FC-3, to the SCSI layer at FC The SCSI layer then sends the read command to the Fibre Channel RAID controller. 5. The SCSI read command is issued to the correct disk drive. 6. When the read operation completes, data is transferred from the drive to SCSI layer FC-4 of the Fibre Channel interface within the RAID controller. 7. Now the read data must make the return trip down layers FC-4, FC-3, FC-2, FC-1 on the RAID controller side and onto the Fibre Channel link. 8. When the data arrives on the Fibre Channel link, it is transmitted to the host adapter in the Netfinity server. 9. Again it must travel up the layers to FC-4 on the Netfinity side before the SCSI device driver responds with data to the requesting process. 166 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

49 Contrast the 2 KB read command with a 64 KB read command and the answer becomes clear: Like the 2 KB read command the 64 KB read command travels down FC-4, FC-3, FC-2, and to FC-1 on the Netfinity side. It also travels up the same layers on the RAID controller side. But here is where things are different. After the 64 KB read command completes the data is sent to FC-4 of the Fibre Channel interface on the RAID controller side. The 64 KB data travels down from FC-4, FC-3 and to FC-2. At layer FC-2 the 64 KB data is formatted into a 2112-byte payload to be sent over the link. But 64 KB does not fit into a 2112-byte payload. Therefore, layer FC-2 performs segmentation and breaks the 64 KB disk data up into 32 separate Fibre Channel frames to be sent to the Netfinity Fibre Channel controller. 31 of these frames never had to traverse layers FC-4 and FC3 on the RAID controller side. Furthermore, 31 of these frames never required a separate read command to be generated at all. They were transmitted with one read command. Thus, reading data in large blocks introduces significant efficiencies because much of the protocol overhead is reduced. Any transfer exceeding the 2112 byte payload is shipped as low-cost frames back to the host. This knowledge explains why throughput at smaller frame sizes (Figure 56 on page 164) is so low and throughput for larger frames improves as the disk I/O size increases. The overhead of the FC-4, FC-3 layers and the additional SCSI read or write commands slow throughput Configuring Fibre Channel for performance The important point is to understand that degradation of throughput with smaller I/O sizes occurs, and to use that information to better configure your Fibre Channel configuration. One way to do this is to profile an existing server to get an idea of the average disk transfer size. This can easily be obtained using Performance Monitor and examining the following physical disk counters: Average disk bytes/transfer This counter can be graphed versus time to tell you the predominant transfer size for the particular application. This value can be compared to Figure 56 on page 164 to determine the maximum level of throughout a single Fibre Channel link can sustain for a particular application. Chapter 7. Disk subsystem 167

50 Disk bytes/second This counter tells you what the current disk subsystem is able to sustain for this particular application. This value can also be compared to the maximum throughput obtained from Figure 56 on page 164 to determine whether multiple links should be used to reach the target level of throughput demanded for the target number of users. As a rule of thumb, all things being equal, double the number of users requires double the amount of disk I/O. For example, if the current server is doing 8 KB transfers and supporting 100 users and you are asked to build a Fibre Channel based server configuration to support 300 users, the analysis is fairly straightforward: At 8 KB, Fibre Channel can sustain about 52 MBps (from Figure 56 on page 164). If the current server with 100 users is sustaining 10 MBps then a single Fibre Channel link will be sufficient to handle 300 users at 30 MBps (which is less than the 52 MBps maximum). If the server were sustaining 20 MBps (a total requirement for 60 MBps) then it would be best to configure dual Fibre Channel adapters connecting the host to the Fibre Channel RAID controller. As well as adding a PCI host adapter, you can also improve performance by adding a second Fibre Channel controller module to the Netfinity Fibre Channel RAID Controller Unit. With both the first-generation Netfinity Fibre Channel and the FAStT technology, throughput nearly doubles for all transfer sizes when a second controller is added to the system as shown in Figure Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

51 FAStT single controller vs. dual controller 200 Throughput (MBps) Netfinity FAStT Dual controllers Netfinity FAStT Single controller Transfer size (KB) Figure 58. Comparing single vs. dual controller throughputs Rules of thumb Some rules of thumb: Double the number of users requires double the amount of disk I/O. Use Figure 56 on page 164 to determine the maximum sustainable throughput. If your expected throughput exceeds this value add a second host adapter. Adding a second RAID controller module doubles the throughput. The rest of the challenges of optimizing Fibre Channel are similar to configuring a standard RAID controller. Disk layout and organization, such as RAID strategy, stripe size and the number of disks, all affect performance of the IBM Fibre Channel RAID controller in much the same way that it does for ServeRAID. The same techniques used to determine these settings for ServeRAID can be used to optimize the IBM Fibre Channel RAID controller solution. Figure 59 shows a comparison between ServeRAID and Fibre Channel. Chapter 7. Disk subsystem 169

52 Maximum throughput controller-to-disk I/O operations per second Maximum number of drives attached: ServeRAID: 30 disks Fibre Channel: 60 disks RAID-5 arrays 8KB I/O size Random I/O 67/33 R/W mix Arrays are 8% full 0 ServeRAID-3HB v3.6 ServeRAID-4H FC single FC dual FAStT single FAStT dual Figure 59. Throughput comparisons Figure 59 compares: ServeRAID-3HB with Version 3.6 of the firmware, BIOS and driver ServeRAID-4H with Version 4 of the firmware, BIOS and driver Netfinity Fibre Channel with one module in the RAID controller unit Netfinity Fibre Channel with two modules in the RAID controller unit Netfinity Fibre Array Storage Technology (FAStT) with one RAID controller module Netfinity Fibre Array Storage Technology with two RAID controller modules Fibre Channel offers improved performance over SCSI as the ability to configure a larger number of drives per RAID controller. With ServeRAID-4H, you can have up to 56 drives connected to the adapter (14 drives per channel using the new Netfinity EXP300 enclosure) and with Netfinity FAStT, up to 220 drives can be connected. In addition, Fibre Channel offers benefits related to high availability, such as fault tolerance and the distance between the server and the disk enclosures. Using a large number of drives in an array is the best way to increase throughput for applications that have high I/O demands. These applications 170 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

53 include database transaction processing, decision support, e-commerce, video serving, and groupware such as Lotus Notes and Microsoft Exchange Tuning with Netfinity FAStT Storage Manager Netfinity FAStT Storage Manager Version 7 is the software that lets you manage the Netfinity FAStT RAID Controller. It includes its own performance monitoring tool, the Subsystem Management Performance Monitor which gives you information about the performance aspects of your Fibre Channel subsystem. Note: This performance monitor tool is not related to the Windows NT Performance Monitor tool. Figure 60. Subsystem Management Performance Monitor This section describes how to use data from the Subsystem Management Performance Monitor and what tuning options are available in the Storage Manager for optimizing the Fibre Channel subsystem s performance. Use the Subsystem Management Performance Monitor to monitor storage subsystem performance in real-time and save performance data to a file for later analysis. You can specify the logical drives and/or controllers to monitor and the polling interval. Also, you can receive storage subsystem totals, which is data that combines the statistics for both controllers in an active-active controller pair. Chapter 7. Disk subsystem 171

54 Table 17 describes the data that is displayed for selected devices. Table 17. Subsystem management performance monitor parameters Data field Total I/Os Read percentage Cache hit percentage Current K/B per second Maximum K/B per second Current I/O per second Maximum I/O per second Description Total I/Os performed by this device since the beginning of the polling session. For more information, see , Balancing the I/O load on page 172. The percentage of Total I/Os that are read operations for this device. Write percentage can be calculated as 100 minus this value. For more information, see , Optimizing the I/O request rate on page 173. The percentage of reads that are processed with data from the cache rather than requiring a read from disk. For more information, see , Optimizing the I/O request rate on page 173. Average transfer rate during the polling session. The transfer rate is the amount of data in Kilobytes that can be moved through the I/O Data connection in a second (also called throughput). For more information, see , Optimizing the transfer rate on page 173. The maximum transfer rate that was achieved during the Performance Monitor polling session. For more information, see , Optimizing the transfer rate on page 173. The average number of I/O requests serviced per second during the current polling interval (also called an I/O request rate). For more information, , Optimizing the I/O request rate on page 173. The maximum number of I/O requests serviced during a one- second interval over the entire polling session. For more information, see , Optimizing the I/O request rate on page Balancing the I/O load The Total I/O data field is useful for monitoring the I/O activity to a specific controller and a specific logical drive. This field helps you identify possible I/O hot spots. Identify actual I/O patterns to the individual logical drives and compare those with the expectations based on the application. If a particular controller has considerably more I/O activity than expected, consider moving a array to the other controller in the storage subsystem using the Array > Change Ownership option. Since I/O loads are constantly changing, it can be difficult to perfectly balance I/O load across controllers and logical drives. The logical drives and data accessed during your polling session depends on which applications and users were active during that time period. It is important to monitor 172 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

55 performance during different time periods and gather data at regular intervals so you can identify performance trends. The performance monitor tool allows you to save data to a comma-delimited file so you can import it to a spreadsheet for further analysis. If you notice the workload across the storage subsystem (Storage Subsystem Totals Total I/O statistic) continues to increase over time while application performance decreases, this can indicate the need to add additional storage subsystems to your enterprise. By doing this, you can continue to meet application needs at an acceptable performance level Optimizing the transfer rate As described in 7.8.1, Fibre Channel performance issues on page 163, the transfer rates of the controller are determined by the application I/O size and the I/O request rate. In general, a small application I/O request size results in a lower transfer rate, but provides a faster I/O request rate and a shorter response time. With larger application I/O request sizes, higher throughput rates are possible. Understanding your typical application I/O patterns can give you an idea of the maximum I/O transfer rates that are possible for a given storage subsystem. Because of the dependency on I/O size and transmission media, the only technique you can use to improve transfer rates is to improve the I/O request rate. Use the Windows 2000 Performance console (or Windows NT Performance Monitor) to gather I/O size data so you understand the maximum transfer rates possible. Then use tuning options available in Storage Manager to optimize the I/O request rate so you can reach the maximum possible transfer rate Optimizing the I/O request rate The factors that affect the I/O request rate include: I/O access pattern (random or sequential) and I/O size Whether write caching is enabled Cache hit percentage RAID level Segment size Number of drives in the arrays or storage subsystem Fragmentation of files Logical drive modification priority Note: Fragmentation affects logical drives with sequential I/O access patterns, not random I/O access patterns Chapter 7. Disk subsystem 173

56 To determine if your I/O has sequential characteristics, try enabling a conservative cache read-ahead multiplier (4, for example) using the Logical drive > Properties option. Then examine the logical drive cache hit percentage to see if it has improved. An improvement indicates your I/O has a sequential pattern. Use the Windows 2000 Performance console (or Windows NT Performance Monitor) to determine the typical I/O size for a logical drive. Higher write I/O rates are experienced with write-caching enabled compared to disabled, especially for sequential I/O access patterns. Regardless of your I/O pattern, it is recommended that you enable write-caching to maximize I/O rate and shorten application response time Optimizing the cache hit percentage A higher cache hit percentage is also desirable for optimal application performance and is positively correlated with I/O request rate. If the cache hit percentage of all logical drives is low or trending downward, and you do not have the maximum amount of controller cache memory installed, this could indicate the need to install more memory. If an individual logical drive is experiencing a low cache hit percentage, consider enabling cache read-ahead (or prefetch) for that logical drive. Cache read-ahead can increase the cache hit percentage for a sequential I/O workload. If cache read-ahead is enabled, the cache reads the data from the disk. But in addition to the requested data, the cache also fetches more data, usually from adjacent data blocks on the drive. This feature increases the chance that a future request for data could be fulfilled from the cache rather than requiring a disk access. The cache read-ahead multiplier values specify the multiplier to use for determining how many additional data blocks are read into cache. Choosing a higher cache read-ahead multiplier can increase the cache hit percentage. If you have determined that your I/O has sequential characteristics, try enabling an aggressive cache read-ahead multiplier (8, for example) using the Logical drive > Properties option. Then examine the logical drive cache hit percentage to see if it has improved. Continue to customize logical drive cache read-ahead to arrive at the optimal multiplier. (In the case of a random I/O pattern, the optimal multiplier is zero.) 174 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

57 Choosing an appropriate RAID level Use the read percentage for a logical drive to determine actual application behavior. Applications with a high read percentage will do very well using RAID-5 logical drives because of the outstanding read performance of the RAID-5 configuration. However, applications with a low read percentage (write-intensive) do not perform as well on RAID-5 logical drives because of the way a controller writes data and redundancy data to the drives in a RAID-5 array. If there is a low percentage of read activity relative to write activity, you might consider changing the RAID level of a array from RAID-5 to RAID-1 for faster performance Choose an optimal logical drive modification priority The modification priority defines how much processing time is allocated for logical drive modification operations versus system performance. The higher the priority, the faster logical drive modification operations complete but the slower system I/O is serviced. Logical drive modification operations include reconstruction, copyback, initialization, media scan, defragmentation, change of RAID level, and change of segment size. The modification priority is set for each logical drive using a slider bar on the Logical drive > Properties dialog. There are five relative settings on the reconstruction rate slider bar ranging from Low to Highest. The actual speed of each setting is determined by the controller. Choose the Low setting to maximize the I/O request rate. If the controller is idle (not servicing any I/O) it will ignore the individual logical drive rate settings and process logical drive modification operations as fast as possible Choosing an optimal segment size A segment is the amount of data, in kilobytes, that the controller writes on a single drive in a logical drive before writing data on the next drive. With ServeRAID, this is the stripe unit size or stripe size. Data blocks store 512 bytes of data and are the smallest units of storage. The size of a segment determines how many data blocks it contains. For example, an 8 KB segment holds 16 data blocks and a 64 KB segment holds 128 data blocks. Note: The segment size was expressed in number of data blocks in previous versions of this storage management software. It is now expressed in KB. Chapter 7. Disk subsystem 175

58 When you create a logical drive, the default segment size is a good choice for the expected logical drive usage. The default segment size can be changed using the Logical drive > Change Segment Size option. If your typical I/O size is larger than your segment size, increase your segment size in order to minimize the number of drives needed to satisfy an I/O request. If you are using the logical drive in a single-user, large I/O environment such as multimedia application storage, performance is optimized when a single I/O request can be serviced with a single array data stripe (the segment size multiplied by the number of drives in the array used for I/ O). In this case, multiple disks are used for the same request, but each disk is only accessed once Minimize disk accesses by defragmentation Each access of the drive to read or write a file results in spinning of the drive platters and movement of the read/write heads. Make sure the files on your array are defragmented. When the files are defragmented, the data blocks making up the files are next to each other so the read/write heads do not have to travel all over the disk to retrieve the separate parts of the file. Fragmented files are detrimental to the performance of a logical drive with sequential I/O access patterns. 7.9 Disk subsystem rules of thumb A performance relationship can be developed for the disk subsystem. This relationship is based upon the RAID strategy, number of drives, and the disk drive model. The disk subsystem rules of thumb are stated in Table 18. Table 18. Disk subsystem rules of thumb Performance of this configuration RAID-0 RAID-1E RAID-5E Doubling number of drives Is equivalent to % more throughput than RAID-1 (same number of drives) 33-50% more throughput than RAID-5 (same number of drives) 10-20% more throughput than RAID-5. 50% increase in drive throughput (until disk controller becomes a bottleneck) 176 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0

59 Performance of this configuration One 10,000 RPM drive Ultra2 SCSI Ultra3 SCSI Single logical drive Is equivalent to % improvement over 7200 RPM drives (50% when considering RPM only, 10% when comparing with 7200 RPM drives with rotational positioning optimization) 5-10% more throughput than Ultra SCSI for typical server environments. 5-10% more throughput than Ultra2 SCSI for typical server environments. 25% increase in throughout compared to a multiple logical drive configuration. A ServeRAID-3HB can support: Up to about 30 10K RPM drives before a performance bottleneck occurs. Up to about RPM drives before a performance bottleneck occurs. A ServeRAID-4H can support Up to about 60 10K RPM drives before a performance bottleneck occurs. Up to about RPM drives before a performance bottleneck occurs. Chapter 7. Disk subsystem 177

60 178 Tuning Netfinity Servers for Performance Getting the most out of Windows 2000 and Windows NT 4.0