don t blame disks for every storage subsystem failure
|
|
|
- Emory Sanders
- 10 years ago
- Views:
Transcription
1 W e i h a n g J i a n g, C h o n g f e n g H u, Y u a n y u a n Z h o u, a n d A r k a d y K a n e v s k y don t blame disks for every storage subsystem failure Weihang Jiang is a PhD candidate student in the Department of Computer Science at the University of Illinois at Urbana-Champaign. He is particularly interested in system mining that applies data mining techniques to improve system performance, dependability, and manageability. [email protected] Chongfeng Hu spent his college life at Peking University, China, and is currently a graduate student in the Department of Computer Science at the University of Illinois at Urbana-Champaign. His interests include operating systems, software reliability, storage systems reliability, and data mining. [email protected] Yuanyuan Zhou is an associate professor at the University of Illinois, Urbana-Champaign. Her research spans the areas of operating system, storage systems, software reliability, and power management. [email protected] Arkady Kanevsky is a senior research engineer at NetApp Advanced Technology Group. Arkady has done extensive research on RDMA technology, storage resiliency, scalable storage systems, and parallel and distributed computing. He received a PhD in computer science from the University of Illinois in He was a faculty member at Dartmouth College and Texas A&M University prior to joining the industry world. Arkady has written over 60 publications and is a chair of DAT Collaborative and MPI-RT standards. [email protected] Trademark Notice: NetApp and the Network Appliance logo are registered trademarks and Network Appliance is a trademark of Network Appliance, Inc., in the U.S. and other countries. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such. D i s k s a r e t h e k e y c o m p o n e n t s o f storage systems. Researchers at CMU and NetApp had demonstrated a trend of increasing gap between the size of individual disks and disk access time [11], and hence the probability that a secondary failure happens during RAID reconstruction becomes too high for comfort. This led to RAID-6 [4] and RAID-DP [5]. Interestingly, even though disk reliability is critical to storage systems reliability, we found that disks themselves were not the component most likely to fail in storage systems. In this work, we looked at other components in storage systems beyond disks to answer the following questions: Are disk failures the main source of storage system failures? Can enterprise disks help to build a more reliable storage system than SATA disks? Are disk failures independent? What about other storage component failures? Are techniques that went into RAID design, such as redundant interconnect between disks and storage controllers, really helpful in increasing the reliability of storage systems? Reliability is a critically important issue for storage systems because storage failures not only can cause service downtime but can also lead to data loss. Building reliable storage systems becomes increasingly challenging as the complexity of modern storage systems grows into an unprecedented level. For example, the EMC Symmetrix DMX-4 can be configured with up to 2400 disks [6], the Google File System cluster is composed of 1000 storage nodes [7], and the NetApp FAS6000 series can support more than 1000 disks per node, with up to 24 nodes in a system [9]. To make things even worse, disks are not the only component in storage systems. To connect and access disks, modern storage systems also contain many other components, including shelf enclosures, cables and host adapters, and complex software protocol stacks. Failures in these components can lead to downtime and/or data loss of the storage system. Hence, in complex storage systems, component failures are very common and critical to storage system reliability. Although we are interested in failures of a whole storage system, this study concentrates on the core part of it the storage subsystem, which contains 22 ; LOGIN : VO L. 33, N O. 3
2 disks and all components providing connectivity and usage of disks to the entire storage system. We analyzed the NetApp AutoSupport logs collected from about 39,000 storage systems commercially deployed at various customer sites. The data set covers a period of 44 months and includes about 1,800,000 disks hosted in about 155,000 storage shelf enclosures. Our study reveals many interesting findings, providing useful guidelines for designing reliable storage systems. Some of our major findings include: n Physical interconnects failures make up the largest part (27% 68%) of storage subsystem failures, and disk failures make up the second largest part (19% 56%). Choices of disk types, shelf enclosure models, and other components of storage subsystems contribute to the variability. n Each individual storage subsystem failure type and storage subsystem failure as a whole exhibit strong self-correlations. n Storage subsystems configured with redundant interconnects experience 30% 40% lower failure rates than those with a single interconnect. Data on latent sector errors from the same AutoSupport Database was first analyzed by Bairavasundaram et al. [2], and data on data corruptions was further analyzed by Bairavasundaram et al. [3]. Background In this section, we detail the typical architecture of storage systems we study in NetApp, the definitions and terminology used in this article, and the source of the data studied in this work. storage system architecture Figure 1 shows the typical architecture of a NetApp storage system node. A NetApp storage system can be composed of several storage system nodes. Resilient Mechanism (RAID) Storage System Node Storage Layer (software protocol stack) HBA HBA Storage Subsystem Backplane AC power Disk Shelf Enclosure 1 Fan FC Cables Shelf Enclosure 2 Redundant Cables F i g u r e 1 : S t o r a g e s y s t e m n o d e a r c h i t e c t u r e From the customer s perspective, a storage system is a virtual device that is attached to customers systems and provides customers with the desired storage capacity with high reliability, good performance, and flexible management. Looking from inside, we see that a storage system node is composed of storage subsystems, resiliency mechanisms, a storage head/controller, and other ; LOGIN : J U N E d O N t B l a m e D i sks 23
3 higher-level system layers. The storage subsystem is the core part of a storage system node and provides connectivity and usage of disks to the entire storage system node. It contains various components, including disks, shelf enclosures, cables and host adapters, and complex software protocol stacks. Shelf enclosures provide a power supply, a cooling service, and a prewired backplane for the disks mounted in them. Cables initiated from host adapters connect one or multiple shelf enclosures to the network. Each shelf enclosure can be optionally connected to a secondary network for redundancy. In the Results section we will show the impact of this redundancy mechanism on failures of the storage subsystem. Usually, on top of the storage subsystem, resiliency mechanisms, such as RAID, are used to tolerate failures in storage subsystems. Protocol Stack Disk Drivers SCSI Protocol FC Adapter w/ drivers Storage Layer Networks Disk F i g u r e 2 : I / O r e q u e s t p a t h i n s t o r a g e s u b s y s t e m terminology We use the followings terms in this article: n Disk family: A particular disk product. The same product may be offered in different capacities. For example, Seagate Cheetah 10k.7 is a disk family. n Disk model: The combination of a disk family and a particular disk capacity. For example, Seagate Cheetah 10k GB is a disk model. For disk family and disk model, we use the same naming convention as in Bairavasundaram et al. [2, 3]. n Failure types: Refers to the four types of storage subsystem failures disk failure, physical interconnect failure, protocol failure, and performance failure. n Shelf enclosure model: A particular shelf enclosure product. All shelf enclosure models studied in this work can host at most 14 disks. n Storage subsystem failure: Refers to failures that prevent the storage subsystem from providing storage service to the whole storage system node. However, not all storage subsystem failures are experienced by customers, since some of the failures can be handled by resiliency mechanisms on top of storage subsystems (e.g., RAID) and other mechanisms at higher layers. n Storage system class: Refers to the capability and usage of storage systems. There are four storage system classes studied in this work: nearline systems (mainly used as secondary storage), low-end, midrange, and high-end (mainly used as primary storage). n Other terms in the article are used as defined by SNIA [12]. 24 ; LOGIN : VO L. 33, N O. 3
4 definition and classification of storage subsystem failures Results Figure 2 shows the steps and components that are involved in fulfilling an I/ O request in a storage subsystem. As shown in Figure 2, for the storage layer to fulfill an I/O request, the I/O request will first be processed and transformed by protocols and then delivered to disks through networks initiated by host adapters. Storage subsystem failures are the failures that break the I/O request path; they can be caused by hardware failures, software bugs, and protocol incompatibilities along the path. To better understand storage subsystem failures, we categorize them into four types along the I/O request path: n Disk failure: This type of failure is triggered by failure mechanisms of disks. Imperfect media, media scratches caused by loose particles, rotational vibration, and many other factors internal to a disk can lead to this type of failure. Sometimes, the storage layer proactively fails disks based on statistics collected by on-disk health monitoring mechanisms (e.g., a disk has experienced too many sector errors [1]). These incidences are also counted as disk failures. n Physical interconnect failure: This type of failure is triggered by errors in the networks connecting disks and storage heads. It can be caused by host adapter failures, broken cables, shelf enclosure power outages, shelf backplanes errors, and/or errors in shelf FC drivers. When physical interconnect failures happen, the affected disks appear to be missing from the system. n Protocol failure: This type of failure is caused by incompatibility between protocols in disk drivers or shelf enclosures and storage heads and software bugs in the disk drivers. When this type of failure happens, disks are visible to the storage layer but I/O requests are not correctly responded to by disks. n Performance failure: This type of failure happens when the storage layer detects that a disk cannot serve I/O requests in a timely manner while none of the previous three types of failures are detected. It is mainly caused by partial failures, such as unstable connectivity or when disks are heavily loaded with disk-level recovery (e.g., broken sector remapping). The occurrences of these four types of failures are recorded in AutoSupport logs collected by NetApp. frequency of storage subsystem failures As we categorize storage subsystem failures into four failure types based on their root causes, a natural question is therefore, What is the relative frequency of each failure type? To answer this question, we study the NetApp AutoSupport logs collected from 39,000 storage systems. Figure 3 presents the breakdown of the average failure rate (AFR) for storage subsystems based on failure types, for all four system classes studied in this work. Finding (1): Physical interconnects failures make up the largest part (27% 68%) of storage subsystem failures; disk failures make up the second largest part (20% 55%). Protocol failures and performance failures both make up noticeable fractions. ; LOGIN : J U N E d O N t B l a m e D i sks 25
5 Implications: Disk failures are not always a dominant factor in storage subsystem failures, and a reliability study for storage subsystems cannot focus only on disk failures. Resilient mechanisms should target all failure types Performance Failure Protocol Failure Physical Interconnects Failure Disk Failure 5.0 AFR (%) Nearline Low-end Mid-range High-end F i g u r e 3 : A F R f o r s t o r a g e s u b s y s t e m s i n f o u r s y s t e m c l a s s e s a n d t h e b r e a k d o w n b a s e d o n f a i l u r e t y p e s As Figure 3 shows, across all system classes, disk failures do not always dominate storage subsystem failures. For example, in low-end storage systems, the AFR for storage subsystems is about 4.6%, whereas the AFR for disks is only 0.9%, about 20% of the overall AFR. However, physical interconnect failures account for a significant fraction of storage subsystem failures, ranging from 27% to 68%. The other two failure types, protocol failures and performance failures, contribute to 5% 10% and 4% 8% of storage subsystem failures, respectively. Finding (2): For disks, nearline storage systems show higher (1.9%) AFR than low-end storage systems (0.9%). But for the whole storage subsystem, nearline storage systems show lower (3.4%) AFR than low-end primary storage systems (4.6%). Implications: Disk failure rate is not indicative of the storage subsystem failure rate. Figure 3 also shows that nearline systems, which mostly use SATA disks, experience about 1.9% AFR for disks, whereas for low-end, mid-range, and high-end systems, which mostly use FC disks, the AFR for disks is under 0.9%. This observation is consistent with the common belief that enterprise disks (FC) are more reliable than nearline disks (SATA). However, the AFR for storage subsystems does not follow the same trend. Storage subsystem AFR of nearline systems is about 3.4%, lower than that of low-end systems (4.6%). This indicates that other factors, such as shelf enclosure model and network configurations, strongly affect storage subsystem reliability. The impacts of these factors are examined next. impact of system parameters on storage subsystem failures As we have seen, storage subsystems of different system classes show different AFRs. Although these storage subsystems are architecturally similar, the characteristics of their components, such as shelves, and their redundancy mechanisms, such as multipathing, differ. We now explore the impact of these factors on storage subsystem failures. shelf enclosure model Shelf enclosures contain power supplies, cooling devices, and prewired backplanes that carry power and I/O bus signals to the disks mounted in 26 ; LOGIN : VO L. 33, N O. 3
6 them. Different shelf enclosure models are different in design and have different mechanisms for providing these services; therefore, it is interesting to see how shelf enclosure model affects storage subsystem failures. AFR (%) Performance Failure Protocol Failure Physical Interconnects Failure Disk Failure Shelf Enclosure Model A Shelf Enclosure Model B F i g u r e 4 : A F R f o r s t o r a g e s u b s y s t e m s b y s h e l f e n c l o s u r e m o d e l s u s i n g t h e s a m e d i s k m o d e l. T h e e r r o r b a r s s h o w % c o n f i d e n c e i n t e r v a l s f o r p h y s i c a l i n t e r c o n n e c t f a i l u r e s. Finding (3): The shelf enclosure model has a strong impact on storage subsystem failures. Implications: To build a reliable storage subsystem, hardware components other than disks (e.g., shelf enclosure) should also be carefully selected. Figure 4 shows AFR for storage subsystems when configured with different shelf enclosure models but the same disk models. As expected, shelf enclosure model primarily impacts physical interconnect failures, with little impact on other failure types. To confirm this observation, we tested the statistical significance using a T- test [10]. As Figure 4 shows, the physical interconnect failures with different shelf enclosure models are quite different ( % versus %). A T-test shows that this is significant at the 99.9% confidence interval, indicating that the hypothesis that physical interconnect failures are impacted by shelf enclosure models is very strongly supported by the data. network redundancy mechanism As we have seen, physical interconnect failures contribute to a significant fraction (27% 68%) of storage subsystem failures. Since physical interconnect failures are mainly caused by network connectivity issues in storage subsystems, it is important to understand the impact of network redundancy mechanisms on storage subsystem failures. For the mid-range and high-end systems studied in this work, FC drivers support a network redundancy mechanism, commonly called active/passive multipathing. This network redundancy mechanism connects shelves to two independent FC networks, redirecting I/O requests through the redundant FC network when one FC network experiences network component failures (e.g., broken cables). To study the effect of this network redundancy mechanism, we look at the data collected from mid-range and high-end storage systems, and we group them based on whether the network redundancy mechanism is turned on. Owing to the space limitation, we only show results for the mid-range storage systems here. As we observed from our data set, about 1/3 of storage subsystems are utilizing the network redundancy mechanism, whereas the ; LOGIN : J U N E d O N t B l a m e D i sks 27
7 other 2/3 are not. We call these two groups of storage subsystems dual path systems and single path systems, respectively. Finding (4): Storage subsystems configured with network redundancy mechanisms experience much lower (30% 40% lower) AFR than other systems. AFR for physical interconnects is reduced by 50% 60%. Implications: Network redundancy mechanisms such as multipathing can greatly improve the reliability of storage subsystems. Figure 5 shows the AFR for storage subsystems in mid-range systems. As expected, secondary path reduces physical interconnect failures by 50% 60% ( % versus %), with little impact on other failure types. Since physical interconnect failure is just a subset of all storage subsystem failures, AFR for storage subsystems is reduced by 30% 40%. This indicates that multipathing is an exceptionally good redundancy mechanism that delivers reduction of failure rates as promised. As we applied a T-test on these results, we found that the observation is significant at the 99.9% confidence interval, indicating that the data strongly support the hypothesis that physical interconnect failures are reduced by multipathing configuration. However, the observation also tells us that there is still further potential in network redundancy mechanism designs. For example, given that the probability for one network to fail is about 2%, the idealized probability for two networks to both fail should be a few orders of magnitude lower (about 0.04%). But the AFR we observe is far from the ideal number. AFR (%) Single Path Performance Failure Protocol Failure Physical Interconnects Failure Disk Failure Dual Paths F i g u r e 5 : A F R f o r s t o r a g e s u b s y s t e m s b r o k e n d o w n b y t h e n u m b e r o f p a t h s. T h e e r r o r b a r s s h o w % c o n f i d e n c e i n t e r v a l s f o r p h y s i c a l i n t e r c o n n e c t f a i l u r e s. correlations between failures In this subsection, we will study the statistical property of storage subsystem failures both from a shelf perspective and from a RAID group perspective. Our analysis of the correlation between failures is composed of two steps: 1. Derive the theoretical failure probability model based on the assumption that failures are independent. 2. Evaluate the assumption by comparing the theoretical probability against empirical results. Next, we describe the statistical method we use for deriving the theoretical failure probability model. 28 ; LOGIN : VO L. 33, N O. 3
8 Statistical Method We denote the probability for a shelf enclosure (including all mounted disks) to experience one failure during time T as P(1) and denote the probability for it to experience two failures during T as P(2). The relationship between P(1) and P(2) is as follows: For a complete proof, refer to our conference paper [8]. P(2) = 1 P(1) 2 2 Next, we will compare this theoretically derived model against the empirical results collected from NetApp AutoSupport logs. Failure Rate (%) Disk Failure Physical Interconnects Failure Correlation Results To evaluate the theoretical relation between P(1) and P(2) shown in equation 1, we first calculate empirical P(1) and empirical P(2) from NetApp Auto- Support logs. Empirical P(1) is the percentage of shelves (RAID groups) that have experienced exactly one failure during time T (where T is set to one year), and empirical P(2) is the percentage of the ones that have experienced exactly two failures during time T. Only storage systems that have been in the field for one year or more are considered. Finding (5): For each failure type, storage subsystem failures are not independent. After one failure, the probability of additional failures (of the same type) is higher. Implications: The probability of storage subsystem failures depends on factors shared by all disks in the same shelf enclosures (or RAID groups). Figure 6a shows the comparison between empirical P(2) and theoretical P(2), which is calculated based on empirical P(1). As we can see in the figure, empirical P(2) is higher than theoretical P(2). More specifically, for disk failure, the observed empirical P(2) is higher than theoretical P(2) by a factor of 6. For other types of storage subsystem failures, the empirical probability is higher than the theoretical correspondences by a factor of Furthermore, T-tests confirm that the theoretical P(2) and the empirical P(2) are statistically different with 99.5% confidence intervals. Protocol Failure Empirical P(2) Theoretical P(2) Performance Failure Failure Rate (%) Disk Failure Physical Interconnects Failure Protocol Failure Empirical P(2) Theoretical P(2) Performance Failure F i g u r e 6 : C o m p a r i s o n b e t w e e n t h e o r e t i c a l m o d e l [ w i t h P ( 2 ) c a l c u l a t e d f r o m e q u a - t i o n 1 ] a g a i n s t e m p i r i c a l r e s u lt s ; t h e e r r o r b a r s s h o w % + c o n f i d e n c e i n t e r v a l s. These statistics provide a strong indication that, when a shelf experiences a storage subsystem failure, the probability for it to have another storage subsystem failure increases. In other words, storage subsystem failures from the same shelves are not independent. ; LOGIN : J U N E d O N t B l a m e D i sks 29
9 Figure 6b shows an even stronger trend for the failures from the same RAID groups. Therefore, the same conclusion can be made for storage subsystem failures from the same RAID groups. Conclusion In this article we have presented a study of NetApp storage subsystem failures, examining the contribution of different failure types, the effect of some factors on failures, and the statistical properties of failures. Our study is based on AutoSupport logs collected from 39,000 NetApp storage systems, which contain about 1,800,000 disks mounted in about 155,000 shelf enclosures. The studied data cover a period of 44 months. The findings of our study provide guidelines for designing more reliable storage systems and developing better resiliency mechanisms. Although disks are the primary components of storage subsystems, and disk failures contribute to 19% 56% of storage subsystem failures, other components such as physical interconnect and protocol stacks also account for significant percentages (27% 68% and 5% 10%, respectively) of storage subsystem failures. The results clearly show that the other storage subsystem components cannot be ignored when designing a reliable storage system. One way to improve storage system reliability is to select more reliable components. As the data suggests, storage system reliability is highly dependent on shelf enclosure model. Another way to improve reliability is to employ redundancy mechanisms to tolerate component failures. One such mechanism studied in the work is multipathing, which can reduce AFR for storage systems by 30% 40% when the number of paths is increased from one to two. We also found that the storage subsystem failure and individual storage subsystem failure types exhibit strong self-correlations. This finding motivates a revisit to resiliency mechanisms such as RAID that assume independent failures. A preliminary version of this article was published at FAST 08 [8]. Because of limited space, neither this article nor our FAST paper includes all our results. Readers who are interested in a complete set of results should refer to our NetApp technical report [13]. acknowledgments We wish to thank Rajesh Sundaram and Sandeep Shah for providing us with insights on storage failures. We are grateful to Frederick Ng, George Kong, Larry Lancaster, and Aziz Htite for offering help on understanding and gathering NetApp AutoSupport logs. We appreciate useful comments from members of the Advanced Development Group, including David Ford, Jiri Schindler, Dan Ellard, Keith Smith, James Lentini, Steve Byan, Sai Susarla, and Shankar Pasupathy. Also, we would like to thank Lakshmi Bairavasundaram for his useful comments. Finally, we appreciate our shepherd, Andrea Arpaci-Dusseau, for her invaluable feedback and precious time, and the anonymous reviewers for their insightful comments for our conference paper [8]. This research has been funded by Network Appliance under the Intelligent Log Mining project at CS UIUC. Work of the first two authors was conducted in part as summer interns at Network Appliance. 30 ; LOGIN : VO L. 33, N O. 3
10 references [1] Bruce Allen, Monitoring Hard Disks with SMART, Linux Journal, 2004(117):9 (2004). [2] Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler, An Analysis of Latent Sector Errors in Disk Drives, SIG- METRICS Perform. Eval. Rev. 35(1): (2007). [3] Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, An Analysis of Data Corruption in the Storage Stack, in FAST 08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, San Jose, CA, February [4] Mario Blaum, Jim Brady, Jehoshua Bruck, and Jai Menon, Evenodd: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures, IEEE Transactions on Computing, 44: (1995). [5] Peter Corbett, Bob English, Atul Goel, Tomislav Grcanac, Steven Kleiman, James Leong, and Sunitha Sankar, Row-diagonal Parity for Double Disk Failure Correction, in FAST 04: Proceedings of the 3rd USENIX Conference on File and Storage Technologies, pp (2004). [6] EMC Symmetrix DMX-4 Specification Sheet (July 2007): [7] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, in SOSP 03: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, New York, pp (2003). [8] Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky, Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics, in FAST 08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, San Jose, CA, February [9] FAS6000 Series Technical Specifications: [10] A.C. Rosander, Elementary Principles of Statistics (Princeton, NJ: Van Nostrand, 1951). [11] Bianca Schroeder and Garth A. Gibson, Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?, in FAST 07: Proceedings of the 5th USENIX Conference on File and Storage Technologies, Berkeley, CA, USA, [12] Storage Networking Industry Association Dictionary: [13] Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky, Don t Blame Disks for Every Storage Subsystem Failure A Comprehensive Study of Storage Subsystem Failure Characteristics (April 2008): ; LOGIN : J U N E d O N t B l a m e D i sks 31
An Analysis of Data Corruption in the Storage Stack
Department of Computer Science, Institute for System Architecture, Operating Systems Group An Analysis of Data Corruption in the Storage Stack Lakshmi N. Bairavasundaram, Garth Goodson, Bianca Schroeder,
End-to-end Data integrity Protection in Storage Systems
End-to-end Data integrity Protection in Storage Systems Issue V1.1 Date 2013-11-20 HUAWEI TECHNOLOGIES CO., LTD. 2013. All rights reserved. No part of this document may be reproduced or transmitted in
An Analysis of Data Corruption in the Storage Stack
An Analysis of Data Corruption in the Storage Stack Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin-Madison Network
How To Understand And Understand The Power Of Aird 6 On Clariion
A Detailed Review Abstract This white paper discusses the EMC CLARiiON RAID 6 implementation available in FLARE 26 and later, including an overview of RAID 6 and the CLARiiON-specific implementation, when
Silent data corruption in SATA arrays: A solution
Silent data corruption in SATA arrays: A solution Josh Eddy August 2008 Abstract Recent large academic studies have identified the surprising frequency of silent read failures that are not identified or
RAID-DP: NetApp Implementation of Double- Parity RAID for Data Protection
Technical Report RAID-DP: NetApp Implementation of Double- Parity RAID for Data Protection Jay White & Chris Lueth, NetApp May 2010 TR-3298 ABSTRACT This document provides an in-depth overview of the NetApp
Redundancy in enterprise storage networks using dual-domain SAS configurations
Redundancy in enterprise storage networks using dual-domain SAS configurations technology brief Abstract... 2 Introduction... 2 Why dual-domain SAS is important... 2 Single SAS domain... 3 Dual-domain
Achieving High Availability
Achieving High Availability What You Need to Know before You Start James Bottomley SteelEye Technology 21 January 2004 1 What Is Availability? Availability measures the ability of a given service to operate.
Red Hat Enterprise linux 5 Continuous Availability
Red Hat Enterprise linux 5 Continuous Availability Businesses continuity needs to be at the heart of any enterprise IT deployment. Even a modest disruption in service is costly in terms of lost revenue
FlexArray Virtualization
Updated for 8.2.1 FlexArray Virtualization Installation Requirements and Reference Guide NetApp, Inc. 495 East Java Drive Sunnyvale, CA 94089 U.S. Telephone: +1 (408) 822-6000 Fax: +1 (408) 822-4501 Support
SAN Conceptual and Design Basics
TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer
STORAGE CENTER. The Industry s Only SAN with Automated Tiered Storage STORAGE CENTER
STORAGE CENTER DATASHEET STORAGE CENTER Go Beyond the Boundaries of Traditional Storage Systems Today s storage vendors promise to reduce the amount of time and money companies spend on storage but instead
Reliability and Fault Tolerance in Storage
Reliability and Fault Tolerance in Storage Dalit Naor/ Dima Sotnikov IBM Haifa Research Storage Systems 1 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
Nondisruptive Operations and Clustered Data ONTAP
White Paper Nondisruptive Operations and Clustered Data ONTAP Arun Raman, Charlotte Brooks, NetApp January 2014 WP-7195 TABLE OF CONTENTS 1 Introduction... 3 2 Nondisruptive Operations (NDO) and Clustered
RAID-DP : NETWORK APPLIANCE IMPLEMENTATION OF RAID DOUBLE PARITY FOR DATA PROTECTION A HIGH-SPEED IMPLEMENTATION OF RAID 6
RAID-DP : NETWORK APPLIANCE IMPLEMENTATION OF RAID DOUBLE PARITY FOR DATA PROTECTION A HIGH-SPEED IMPLEMENTATION OF RAID 6 Chris Lueth, Network Appliance, Inc. TR-3298 [12/2006] ABSTRACT To date, RAID
an analysis of RAID 5DP
an analysis of RAID 5DP a qualitative and quantitative comparison of RAID levels and data protection hp white paper for information about the va 7000 series and periodic updates to this white paper see
Enterprise Disk Storage Subsystem Directions
White Paper Enterprise Disk Storage Subsystem Directions Steve Bohac and Andrew Chen, NetApp June 2009 WP-7074-0609 EXECUTIVE SUMMARY Significant changes are coming to enterprise disk technology and disk
Improving the Customer Support Experience with NetApp Remote Support Agent
NETAPP WHITE PAPER Improving the Customer Support Experience with NetApp Remote Support Agent Ka Wai Leung, NetApp April 2008 WP-7038-0408 TABLE OF CONTENTS 1 INTRODUCTION... 3 2 NETAPP SUPPORT REMOTE
an introduction to networked storage
an introduction to networked storage How networked storage can simplify your data management The key differences between SAN, DAS, and NAS The business benefits of networked storage Introduction Historical
Data Corruption In Storage Stack - Review
Theoretical Aspects of Storage Systems Autumn 2009 Chapter 2: Double Disk Failures André Brinkmann Data Corruption in the Storage Stack What are Latent Sector Errors What is Silent Data Corruption Checksum
Pivot3 Desktop Virtualization Appliances. vstac VDI Technology Overview
Pivot3 Desktop Virtualization Appliances vstac VDI Technology Overview February 2012 Pivot3 Desktop Virtualization Technology Overview Table of Contents Executive Summary... 3 The Pivot3 VDI Appliance...
Overview of I/O Performance and RAID in an RDBMS Environment. By: Edward Whalen Performance Tuning Corporation
Overview of I/O Performance and RAID in an RDBMS Environment By: Edward Whalen Performance Tuning Corporation Abstract This paper covers the fundamentals of I/O topics and an overview of RAID levels commonly
Redundant Arrays of Inexpensive Disks (RAIDs)
38 Redundant Arrays of Inexpensive Disks (RAIDs) When we use a disk, we sometimes wish it to be faster; I/O operations are slow and thus can be the bottleneck for the entire system. When we use a disk,
Symantec NetBackup 5220
A single-vendor enterprise backup appliance that installs in minutes Data Sheet: Data Protection Overview is a single-vendor enterprise backup appliance that installs in minutes, with expandable storage
iscsi: Accelerating the Transition to Network Storage
iscsi: Accelerating the Transition to Network Storage David Dale April 2003 TR-3241 WHITE PAPER Network Appliance technology and expertise solve a wide range of data storage challenges for organizations,
Delivering a New Level of Data Protection Resiliency with Appliances
SOLUTION BRIEF: SYMANTEC NETBACKUP........................................ Delivering a New Level of Data Protection Resiliency with Appliances Who should read this paper - Directors of IT and IT Managers
The Google File System
The Google File System Motivations of NFS NFS (Network File System) Allow to access files in other systems as local files Actually a network protocol (initially only one server) Simple and fast server
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
PARALLELS CLOUD STORAGE
PARALLELS CLOUD STORAGE The Ideal Storage Solution for Hosters 1 Table of Contents Introduction... 3 The Problem with Direct Attached Storage... 3 The Solution: Parallels Cloud Storage... Error! Bookmark
Why RAID is Dead for Big Data Storage. The business case for why IT Executives are making a strategic shift from RAID to Information Dispersal
Why RAID is Dead for Big Data Storage The business case for why IT Executives are making a strategic shift from RAID to Information Dispersal Executive Summary Data is exploding, growing 10X every five
QuickSpecs. HP Smart Array 5312 Controller. Overview
Overview Models 238633-B21 238633-291 (Japan) Feature List: High Performance PCI-X Architecture High Capacity Two Ultra 3 SCSI channels support up to 28 drives Modular battery-backed cache design 128 MB
Ultra-Scalable Storage Provides Low Cost Virtualization Solutions
Ultra-Scalable Storage Provides Low Cost Virtualization Solutions Flexible IP NAS/iSCSI System Addresses Current Storage Needs While Offering Future Expansion According to Whatis.com, storage virtualization
Symantec Enterprise Vault And NetApp Better Together
Symantec Enterprise Vault And NetApp Better Together John Martin, Consulting Systems Engineer Information Archival with Symantec and NetApp Today s Customer Headaches Data is growing exponentially Scaling
High Availability Server Clustering Solutions
White Paper High vailability Server Clustering Solutions Extending the benefits of technology into the server arena Intel in Communications Contents Executive Summary 3 Extending Protection to Storage
Designing a Cloud Storage System
Designing a Cloud Storage System End to End Cloud Storage When designing a cloud storage system, there is value in decoupling the system s archival capacity (its ability to persistently store large volumes
OVERVIEW. CEP Cluster Server is Ideal For: First-time users who want to make applications highly available
Phone: (603)883-7979 [email protected] Cepoint Cluster Server CEP Cluster Server turnkey system. ENTERPRISE HIGH AVAILABILITY, High performance and very reliable Super Computing Solution for heterogeneous
NetApp RAID-DP : Dual-Parity RAID 6 Protection Without Compromise
White Paper NetApp RAID-DP : Dual-Parity RAID 6 Protection Without Compromise Paul Feresten, NetApp April 2010 WP-7005-0410 EXECUTIVE SUMMARY As disk drive densities continue to increase, and companies
IBM System Storage DS5020 Express
IBM DS5020 Express Manage growth, complexity, and risk with scalable, high-performance storage Highlights Mixed host interfaces support (Fibre Channel/iSCSI) enables SAN tiering Balanced performance well-suited
Emerald TM Storage Taxonomy
Emerald TM Storage Taxonomy PRESENTATION TITLE GOES HERE Patrick Stanko SNIA Emerald TM Training SNIA Emerald Power Efficiency Measurement Specification, for use in EPA ENERGY STAR June 24-27, 2013 Why
Intel RAID Controllers
Intel RAID Controllers Best Practices White Paper April, 2008 Enterprise Platforms and Services Division - Marketing Revision History Date Revision Number April, 2008 1.0 Initial release. Modifications
Data Backup and Archiving with Enterprise Storage Systems
Data Backup and Archiving with Enterprise Storage Systems Slavjan Ivanov 1, Igor Mishkovski 1 1 Faculty of Computer Science and Engineering Ss. Cyril and Methodius University Skopje, Macedonia [email protected],
An Oracle White Paper November 2010. Backup and Recovery with Oracle s Sun ZFS Storage Appliances and Oracle Recovery Manager
An Oracle White Paper November 2010 Backup and Recovery with Oracle s Sun ZFS Storage Appliances and Oracle Recovery Manager Introduction...2 Oracle Backup and Recovery Solution Overview...3 Oracle Recovery
High Availability and MetroCluster Configuration Guide For 7-Mode
Data ONTAP 8.2 High Availability and MetroCluster Configuration Guide For 7-Mode NetApp, Inc. 495 East Java Drive Sunnyvale, CA 94089 U.S. Telephone: +1(408) 822-6000 Fax: +1(408) 822-4501 Support telephone:
EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage
EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage Applied Technology Abstract This white paper describes various backup and recovery solutions available for SQL
Easier - Faster - Better
Highest reliability, availability and serviceability ClusterStor gets you productive fast with robust professional service offerings available as part of solution delivery, including quality controlled
WHITE PAPER: DATA PROTECTION. Veritas NetBackup for Microsoft Exchange Server Solution Guide. Bill Roth January 2008
WHITE PAPER: DATA PROTECTION Veritas NetBackup for Microsoft Exchange Server Solution Guide Bill Roth January 2008 White Paper: Veritas NetBackup for Microsoft Exchange Server Solution Guide Content 1.
Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?
In FAST'7: 5th USENIX Conference on File and Storage Technologies, San Jose, CA, Feb. -6, 7. Disk failures in the real world: What does an MTTF of,, hours mean to you? Bianca Schroeder Garth A. Gibson
The Google File System
The Google File System By Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (Presented at SOSP 2003) Introduction Google search engine. Applications process lots of data. Need good file system. Solution:
Block based, file-based, combination. Component based, solution based
The Wide Spread Role of 10-Gigabit Ethernet in Storage This paper provides an overview of SAN and NAS storage solutions, highlights the ubiquitous role of 10 Gigabit Ethernet in these solutions, and illustrates
List of Figures and Tables
List of Figures and Tables FIGURES 1.1 Server-Centric IT architecture 2 1.2 Inflexible allocation of free storage capacity 3 1.3 Storage-Centric IT architecture 4 1.4 Server upgrade: preparation of a new
Scale and Availability Considerations for Cluster File Systems. David Noy, Symantec Corporation
Scale and Availability Considerations for Cluster File Systems David Noy, Symantec Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted.
INCREASING EFFICIENCY WITH EASY AND COMPREHENSIVE STORAGE MANAGEMENT
INCREASING EFFICIENCY WITH EASY AND COMPREHENSIVE STORAGE MANAGEMENT UNPRECEDENTED OBSERVABILITY, COST-SAVING PERFORMANCE ACCELERATION, AND SUPERIOR DATA PROTECTION KEY FEATURES Unprecedented observability
HDD Ghosts, Goblins and Failures in RAID Storage Systems
HDD Ghosts, Goblins and Failures in RAID Storage Systems Jon G. Elerath, Ph. D. February 19, 2013 Agenda RAID > HDDs > RAID Redundant Array of Independent Disks What is RAID storage? How does RAID work?
The Benefit of Migrating from 4Gb to 8Gb Fibre Channel
The Benefit of Migrating from 4Gb to 8Gb Fibre Channel Notices The information in this document is subject to change without notice. While every effort has been made to ensure that all information in this
NetApp Storage System Plug-In 12.1.0.1.0 for Oracle Enterprise Manager 12c Installation and Administration Guide
NetApp Storage System Plug-In 12.1.0.1.0 for Oracle Enterprise Manager 12c Installation and Administration Guide Sachin Maheshwari, Anand Ranganathan, NetApp October 2012 Abstract This document provides
Managing a Fibre Channel Storage Area Network
Managing a Fibre Channel Storage Area Network Storage Network Management Working Group for Fibre Channel (SNMWG-FC) November 20, 1998 Editor: Steven Wilson Abstract This white paper describes the typical
OPTIMIZING EXCHANGE SERVER IN A TIERED STORAGE ENVIRONMENT WHITE PAPER NOVEMBER 2006
OPTIMIZING EXCHANGE SERVER IN A TIERED STORAGE ENVIRONMENT WHITE PAPER NOVEMBER 2006 EXECUTIVE SUMMARY Microsoft Exchange Server is a disk-intensive application that requires high speed storage to deliver
NetApp FAS2000 Series
Systems NetApp FAS2000 Series Take control of your fast-growing data and maximize your shrinking budgets with an affordable, and easy-to-use storage system from NetApp KEY BENEFITS Experience value Acquire
Server and Storage Consolidation with iscsi Arrays. David Dale, NetApp
Server and Consolidation with iscsi Arrays David Dale, NetApp SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual members may use this
Storage Subsystem Resiliency Guide
Technical Report Jay White, NetApp September 2011 TR-3437 ABSTRACT This document provides technical recommendations and best practices as they relate to data availability and resiliency in the NetApp storage
Best Practices for Data Sharing in a Grid Distributed SAS Environment. Updated July 2010
Best Practices for Data Sharing in a Grid Distributed SAS Environment Updated July 2010 B E S T P R A C T I C E D O C U M E N T Table of Contents 1 Abstract... 2 1.1 Storage performance is critical...
NETAPP WHITE PAPER USING A NETWORK APPLIANCE SAN WITH VMWARE INFRASTRUCTURE 3 TO FACILITATE SERVER AND STORAGE CONSOLIDATION
NETAPP WHITE PAPER USING A NETWORK APPLIANCE SAN WITH VMWARE INFRASTRUCTURE 3 TO FACILITATE SERVER AND STORAGE CONSOLIDATION Network Appliance, Inc. March 2007 TABLE OF CONTENTS 1 INTRODUCTION... 3 2 BACKGROUND...
The next generation of hard disk drives is here
Next Generation Mobile Hard Disk Drives The next generation of hard disk drives is here The evolution to smaller form factors is a natural occurrence in technology and, indeed, in many industries. This
white paper A CASE FOR VIRTUAL RAID ADAPTERS Beyond Software RAID
white paper A CASE FOR VIRTUAL RAID ADAPTERS Beyond Software RAID Table of Contents 1. Abstract...3 2. Storage Configurations...4 3. RAID Implementation...4 4. Software RAID.4-5 5. Hardware RAID Adapters...6
System Availability and Data Protection of Infortrend s ESVA Storage Solution
System Availability and Data Protection of Infortrend s ESVA Storage Solution White paper Abstract This white paper analyzes system availability and data protection on Infortrend s ESVA storage systems.
Optimizing Large Arrays with StoneFly Storage Concentrators
Optimizing Large Arrays with StoneFly Storage Concentrators All trademark names are the property of their respective companies. This publication contains opinions of which are subject to change from time
Xanadu 130. Business Class Storage Solution. 8G FC Host Connectivity and 6G SAS Backplane. 2U 12-Bay 3.5 Form Factor
RAID Inc Xanadu 200 100 Series Storage Systems Xanadu 130 Business Class Storage Solution 8G FC Host Connectivity and 6G SAS Backplane 2U 12-Bay 3.5 Form Factor Highlights Highest levels of data integrity
MESOS CB220. Cluster-in-a-Box. Network Storage Appliance. A Simple and Smart Way to Converged Storage with QCT MESOS CB220
MESOS CB220 Cluster-in-a-Box Network Storage Appliance A Simple and Smart Way to Converged Storage with QCT MESOS CB220 MESOS CB220 A Simple and Smart Way to Converged Storage Tailored for SMB storage
Application Brief: Using Titan for MS SQL
Application Brief: Using Titan for MS Abstract Businesses rely heavily on databases for day-today transactions and for business decision systems. In today s information age, databases form the critical
The Advantages of Multi-Port Network Adapters in an SWsoft Virtual Environment
The Advantages of Multi-Port Network Adapters in an SWsoft Virtual Environment Introduction... 2 Virtualization addresses key challenges facing IT today... 2 Introducing Virtuozzo... 2 A virtualized environment
Lessons learned from parallel file system operation
Lessons learned from parallel file system operation Roland Laifer STEINBUCH CENTRE FOR COMPUTING - SCC KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association
Solution Brief July 2014. All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux
Solution Brief July 2014 All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux Traditional SAN storage systems cannot keep up with growing application performance needs.
HP StorageWorks P2000 G3 and MSA2000 G2 Arrays
HP StorageWorks P2000 G3 and MSA2000 G2 Arrays Family Data sheet How can the flexibility of the HP StorageWorks P2000 G3 MSA Array Systems help remedy growing storage needs and small budgets? By offering
Total Cost Comparison: IT Decision-Maker Perspectives on EMC, HP and Network Appliance Storage Area Network Solutions in Microsoft
Total Cost Comparison: IT Decision-Maker Perspectives on EMC, HP and Network Appliance Storage Area Network Solutions in Microsoft Exchange Environments June, 2006 Prepared for: Copyright 2006, Mercer
Using EonStor FC-host Storage Systems in VMware Infrastructure 3 and vsphere 4
Using EonStor FC-host Storage Systems in VMware Infrastructure 3 and vsphere 4 Application Note Abstract This application note explains the configure details of using Infortrend FC-host storage systems
Best Practices RAID Implementations for Snap Servers and JBOD Expansion
STORAGE SOLUTIONS WHITE PAPER Best Practices RAID Implementations for Snap Servers and JBOD Expansion Contents Introduction...1 Planning for the End Result...1 Availability Considerations...1 Drive Reliability...2
PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute
PADS GPFS Filesystem: Crash Root Cause Analysis Computation Institute Argonne National Laboratory Table of Contents Purpose 1 Terminology 2 Infrastructure 4 Timeline of Events 5 Background 5 Corruption
Dell PowerVault MD32xx Deployment Guide for VMware ESX4.1 Server
Dell PowerVault MD32xx Deployment Guide for VMware ESX4.1 Server A Dell Technical White Paper PowerVault MD32xx Storage Array www.dell.com/md32xx THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND
DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering
DELL RAID PRIMER DELL PERC RAID CONTROLLERS Joe H. Trickey III Dell Storage RAID Product Marketing John Seward Dell Storage RAID Engineering http://www.dell.com/content/topics/topic.aspx/global/products/pvaul/top
Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication Software
Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication September 2002 IBM Storage Products Division Raleigh, NC http://www.storage.ibm.com Table of contents Introduction... 3 Key
VERITAS Business Solutions. for DB2
VERITAS Business Solutions for DB2 V E R I T A S W H I T E P A P E R Table of Contents............................................................. 1 VERITAS Database Edition for DB2............................................................
Online Remote Data Backup for iscsi-based Storage Systems
Online Remote Data Backup for iscsi-based Storage Systems Dan Zhou, Li Ou, Xubin (Ben) He Department of Electrical and Computer Engineering Tennessee Technological University Cookeville, TN 38505, USA
Server and Storage Virtualization with IP Storage. David Dale, NetApp
Server and Storage Virtualization with IP Storage David Dale, NetApp SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this
RAID Overview: Identifying What RAID Levels Best Meet Customer Needs. Diamond Series RAID Storage Array
ATTO Technology, Inc. Corporate Headquarters 155 Crosspoint Parkway Amherst, NY 14068 Phone: 716-691-1999 Fax: 716-691-9353 www.attotech.com [email protected] RAID Overview: Identifying What RAID Levels
WHITE PAPER. Best Practices to Ensure SAP Availability. Software for Innovative Open Solutions. Abstract. What is high availability?
Best Practices to Ensure SAP Availability Abstract Ensuring the continuous availability of mission-critical systems is a high priority for corporate IT groups. This paper presents five best practices that
Post-production Video Editing Solution Guide with Microsoft SMB 3 File Serving AssuredSAN 4000
Post-production Video Editing Solution Guide with Microsoft SMB 3 File Serving AssuredSAN 4000 Dot Hill Systems introduction 1 INTRODUCTION Dot Hill Systems offers high performance network storage products
Software-defined Storage at the Speed of Flash
TECHNICAL BRIEF: SOFTWARE-DEFINED STORAGE AT THE SPEED OF... FLASH..................................... Intel SSD Data Center P3700 Series and Symantec Storage Foundation with Flexible Storage Sharing
New Cluster-Ready FAS3200 Models
New Cluster-Ready FAS3200 Models Steven Miller Senior Technical Director and Platform Architect NetApp recently introduced two new models in the FAS3200 series: the FAS3220 and the FAS3250. Our design
Characterizing Cloud Computing Hardware Reliability
Characterizing Cloud Computing Hardware Reliability Kashi Venkatesh Vishwanath and Nachiappan Nagappan Microsoft Research One Microsoft Way, Redmond WA 98052 {kashi.vishwanath,nachin}@microsoft.com ABSTRACT
Building Optimized Scale-Out NAS Solutions with Avere and Arista Networks
Building Optimized Scale-Out NAS Solutions with Avere and Arista Networks Record-Breaking Performance in the Industry's Smallest Footprint Avere Systems, Inc. 5000 McKnight Road, Suite 404 Pittsburgh,
Total Cost of Solid State Storage Ownership
Total Cost of Solid State Storage Ownership An In-Depth Analysis of Many Important TCO Factors SSSI Members and Authors: Esther Spanjer, SMART Modular Technologies Dan Le, SMART Modular Technologies David
High Availability and Disaster Recovery Solutions for Perforce
High Availability and Disaster Recovery Solutions for Perforce This paper provides strategies for achieving high Perforce server availability and minimizing data loss in the event of a disaster. Perforce
