Understanding the root cause of the I/O bottleneck November 2010
2 Introduction Many data centers have performance bottlenecks that impact application performance and service delivery to users. These bottlenecks exist across data center locations including servers (application, web, file, email and database), networks, application software, and storage systems as shown in Figure 1. Resolving performance problems is challenging and requires the analysis and understanding of complex interdependent system environments. Server bottlenecks due to lack of CPU processing power, memory or under sized I/O interfaces can result in poor performance or in worse case scenarios application instability. Figure 1: Data center performance bottleneck locations Application bottlenecks due to excessive database locking, poor query design, and data contention result in poor user response time. Storage and I/O performance bottlenecks can occur due to lack of I/O interconnect bandwidth, storage device
3 contention, and lack of available storage system I/O throughput and response time. Impact On Application Performance These areas, in particular I/O performance bottlenecks, impact most enterprise applications. There are many applications across different industries that are sensitive to timely data access and impacted by common datacenter performance bottlenecks. For example, as more users access a popular file, database table, or other stored data item, resource contention will increase. One way resource contention manifests itself is in the form of database deadlock which translates into slower response times and lost productivity. Given the rise and popularity of internet search engines and on-line price shopping, some businesses have been forced to create expensive read-only copies of databases. These read-only copies are used to support more queries to prevent extra workloads from impacting time sensitive transaction databases. The direct impact of data center performance bottlenecks includes: Additional IT staff attention to trouble shoot, analyze, re-configure and react to application delays and service disruptions Poor quality of service (QoS) causing missed service level agreements (SLAs) Premature infrastructure upgrades combined with increased management and operating costs Inability to meet peak and seasonal workload demands resulting in lost business opportunities The indirect impact of data center I/O performance issues includes: General slowing of the systems and applications Lost productivity for users of IT services Lost business and unhappy customers
4 I/O Performance Metrics: Response Time And Throughput There are two I/O main performance metrics: I/O response time and I/O throughput. I/O response time is the time it takes from initiating an I/O operation through completion. I/O throughput refers to the amount of data (number of bytes) processed simultaneously. There are many applications across different industries that are sensitive to timely data access and impacted by common I/O performance bottlenecks. For example, as more users access a popular file, database table, or other stored data item, resource contention will increase. Depending on the application profile, one of these metrics becomes more relevant and causes the main bottleneck. Applications such as databases with many small I/O transactions are sensitive to response time issues, while applications with large I/O operations are prone to suffer from an I/O throughput bottleneck as shown in Figure 2. Figure 2: I/O performance metrics and impact on typical applications The I/O throughput problem frequently delays image and file access resulting in lost productivity. The I/O response time problem leads to application and database contention, including deadlock conditions, due to slow transactions.
5 Server- Storage Performance Gap In the future this problem will worsen exponentially because of the Server-Storage Performance Gap. Historically, different computer system components have advanced at different relative rates. Although disk capacity has improved somewhat, disk performance ranks at the bottom with no significant improvement compared to million-fold boosts by other system components. Figure 3 outlines the varying growth rates between CPU and disk performance. Figure 3: Server- Storage Performance Gap For example, CPU performance has progressed at an impressive clip, driven by Moore s law, multi-core processors, and threading technology to increase 2,000,000 times since 1987. In comparison, disk performance only improved by 11 times. The net impact is that bottlenecks associated with the server to I/O performance lapse result in lost productivity for IT personal and customers who must wait for transactions, queries, and data access requests to be resolved. This has created a significant and growing Server-Storage Performance Gap which is a wide spread issue across most data centers, affecting many applications and industries.
6 The Root Cause: Disk Drive Shortfall The root cause for the server-storage performance gap is the mechanical process of accessing disk data. Moving physical parts rotating the magnetic platter and the actuator implies a significant delay or latency. As additional activity or application workload increases, subsequent I/O requests are put on hold, causing an I/O request queue shown in Figure 4. Figure 4: Disk Drive Shortfall: Disk Latency and Queue Wait Time There are two primary disk access problems, disk latency and queue wait time. 1. Disk latency: for each disk access the magnetic platter has rotate and the actuator has to seek the requested data block 2. Queue Wait Time: accesses to mechanical disks are sequential and more I/O requests leads to the development of an I/O request queue
7 As more workload is added to a system with existing I/O issues, response time will correspondingly decrease as shown in Figure 5. The more severe the bottleneck, the faster response time will deteriorate (e.g. increase) from acceptable levels. Figure 5: Effect of the Disk Drive Shortfall on Response Times
8 The Disk Drive Shortfall Creates The I/O Bottleneck With most performance metrics more is better; however, in the case of response time or latency, less is better. Figure 6 shows the impact of additional workload resulting in I/O bottlenecks that negative impact performance by increasing response times (grey curve) above acceptable levels. The specific acceptable response time threshold will vary by application and SLA requirements. The acceptable threshold level based on performance plans, testing, SLAs and real world experience serves as a guideline between acceptable and poor application performance. Figure 6: Response Times Compared to Throughput As more workload is added to a system with existing I/O issues, response times correspondingly increase. The more severe the bottleneck, the faster response times will deteriorate (e.g. increase) from acceptable levels. The elimination of bottlenecks enables more work to be performed while maintaining response times below acceptable service level threshold limits.
9 A Makeshift Approach Is Insufficient The various I/O performance improvement approaches to address I/O bottlenecks go from doing nothing (incur and deal with the service disruptions) to over-provisioning by throwing more hardware and software at the problem. A makeshift approach to compensate for lack of I/O performance and counter the resulting negative impact to IT users is to add more hardware to mask or move the problem. The simple idea to cut the I/O queue in half by adding another disk doesn t work, because it doesn t change the response time which is the root cause. However, it often leads to extra storage capacity being added to make up for a shortfall in I/O performance. By over-configuring to support peak workloads and prevent loss of business revenue, excess storage capacity must be managed throughout the non-peak periods, adding to data center and management costs. The resulting ripple affect is that now more storage needs to be managed, including allocating storage network ports, configuring, tuning, and backing up of data. Conclusions The Server-Storage Performance Gap is based on the shortfall of disk drives that worsens exponentially every year. Specifically, I/O operations per unit of capacity are decreasing...a bad sign compared to other massive performance improvements in the data center. Today however, there are many makeshift approaches based on adding more hardware or addressing bandwidth or throughput issues. These approaches do not address the Server-Storage Performance Gap but rather move and hid the bottleneck elsewhere. They do not improve applications that depend on low response times as workload including throughput increases. The key to removing data center I/O bottlenecks is to address the problem directly instead of simply moving or hiding it with more hardware. Eliminating the root cause slow I/O response times and not just I/O throughput is the only way out of the vicious circle.
10 Violin Memory accelerates storage and delivers real time application performance with vcache NFS caching. Deployed in the data center, Violin Memory vcache caching systems provide scalable and transparent acceleration for existing storage infrastructures to speed up applications, eliminate peak load disruptions, and simplify enterprise configurations. 2010 Violin Memory. All rights reserved. All other trademarks and copyrights are property of their respective owners. Information provided in this paper may be subject to change. For more information, visit www.violin-memory.com Contact Violin Violin Memory, Inc. USA 2700 Garcia Ave, Suite 100, Mountain View, CA 94043 33 Wood Ave South, 3rd Floor, Iselin, NJ 08830 888) 9- VIOLIN Ext 10 or (888) 984-6546 Ext 10 Email: sales@violin- memory.com www.violin- memory.com