Mean time to meaningless: MTTDL, Markov models, and storage system reliability
|
|
|
- Angelica Opal Adams
- 10 years ago
- Views:
Transcription
1 Mean time to meaningless: MTTDL, Markov models, and storage system reliability Kevin M. Greenan ParaScale, Inc. James S. Plank University of Tennessee Jay J. Wylie HP Labs Abstract Mean Time To Data Loss (MTTDL) has been the standard reliability metric in storage systems for more than 20 years. MTTDL represents a simple formula that can be used to compare the reliability of small disk arrays and to perform comparative trending analyses. The MTTDL metric is often misused, with egregious examples relying on the MTTDL to generate reliability estimates that span centuries or millennia. Moving forward, the storage community needs to replace MTTDL with a metric that can be used to accurately compare the reliability of systems in a way that reflects the impact of data loss in the real world. 1 Introduction Essentially, all models are wrong, but some are useful George E.P. Box Since Gibson s original work on RAID [3], the standard metric of storage system reliability has been the Mean Time To Data Loss (MTTDL). MTTDL is an estimate of the expected time that it would take a given storage system to exhibit enough failures such that at least one block of data cannot be retrieved or reconstructed. One of the reasons that MTTDL is so appealing as a metric is that it is easy to construct a Markov model that yields an analytic closed-form equation for MTTDL. Such formulae have been ubiquitous in research and practice due to the ease of estimating reliability by plugging a few numbers into an expression. Given simplistic assumptions about the physical system, such as independent exponential probability distributions for failure and repair, a Markov model can be easily constructed resulting in a nice, closed-form expression. There are three major problems with using the MTTDL as a measure of storage system reliability. First, the models on which the calculation depends rely on an extremely simplistic view of the storage system. Second, the metric does not reflect the real world, but is often interpreted as a real world estimate. For example, the Pergamum archival storage system estimates a MTTDL of 1400 years [13]. These estimates are based on the assumptions of the underlying Markov models and are typically well beyond the life of any storage system. Finally, MTTDL values tend to be incomparable because each is a function of system scale and omits the (expected) magnitude of data loss. In this position paper, we argue that MTTDL is a bad reliability metric and that Markov models, the traditional means of determining MTTDL, do a poor job of modeling modern storage system reliability. We then outline properties we believe a good storage system reliability metric should have, and propose a new metric with these properties: NOrmalized Magnitude of Data Loss (NOMDL). Finally, we provide example reliability results using various proposed metrics, including NOMDL, for a simple storage system. 2 The Canonical MTTDL Calculation In the storage reliability community, the MTTDL is calculated using Continuous-time Markov Chains (a.k.a. Markov model). The canonical Markov model for storage systems is based on RAID4, which tolerates exactly one device failure. Figure 1 shows this model. There are a total of three states. State 0 is the state with all n devices operational. State 1 is the state with one failed device. State 2 is the state with two failed devices, i.e., the data loss state. The model in Figure 1 has two rate parameters: λ, a failure rate and, a repair rate. It is assumed that all devices fail at the same rate and repair at the same rate. At t = 0, the system starts pristinely in state 0, and remains in state 0 for an average of (n λ) 1 hours (n device failures are exponentially distributed with failure rate λ), when it transitions to state 1. The system is then in state1for on average(((n 1) λ))+) 1 hours. The system transitions out of state 1 to state 2, which is the (n 1) λ data loss state, with probability ((n 1) λ))+. Otherwise, the system transitions back to state 0, where the system 1
2 n (n 1) n Latent sector error (n m 1) (n m) m 1 m m+1 Figure 1: Canonical RAID4/RAID5 Markov model. is fully operational and devoid of failures. The canonical Markov model can be solved analytically for MTTDL, and simplified: MTTDL = +(2n 1)λ n(n 1)λ 2 n(n 1)λ 2. Such an analytic result is appealing because reassuringly large MTTDL values follow from disk lifetimes measured in hundreds of thousands of hours. Also, the relationship between expected disk lifetime (1/λ), expected repair time (1/), and MTTDL for RAID4 systems is apparent. 3 The MTTDL: Meaningless Values Siewiorek and Swarz [12] define reliability as follows: The reliability of a system as a function of time, R(t), is the conditional probability that the system has survived the interval [0, t], given that the system was operational at time t = 0. Implicit in the statement that the system has survived is that the system is performing its intended function under some operating conditions. For a storage system, this means that the system does not lose data during the expected system lifetime. An important aspect of the definition of reliability, that is lost in many discussions about storage system reliability, is the notion of mission lifetime. MTTDL does not measure reliability directly; it is an expectation based on reliability: MTTDL = R(t)dt. MTTDL literally measures the expected time to failure over an infinite inter- 0 val. This may make the MTTDL useful for quick, relative comparisons, but the absolute measurements are essentially meaningless. For example, an MTTDL measurement of 1400 years tells us very little about the probability of failure during a realistic system mission time. A system designer is most likely interested in the probability and extent of data loss, every year for the first10 years of a system. The aggregate nature of the MTTDL provides little useful information, and no insight into these calculations. 4 The MTTDL: Unrealistic Models While Markov Models are appealingly simple, the assumptions that make them convenient also render them inadequate for multi-disk systems In this section, we Figure 2: Illustrative multi-disk fault tolerant Markov model with sector errors. highlight three ways in which the assumptions significantly do not match modern systems. A more detailed discussion that includes additional concerns may be found in Chapter 4 of Greenan s PhD thesis [4]. For illustration, in Figure 2 we present a Markov Model based on one from [5] that models a n-disk system composed of k disks of data and m disks of parity, in the presence of disk failures and latent sector errors. 4.1 Exponential Distributions Implicit in the use of Markov models for storage system reliability analysis is the assumption that failure and repair rates follow an exponential distribution and are constant. The exponential distribution is a poor match to observed disk failure rates, latent sector error rates, and disk repairs. Empirical observations have shown that disks do not fail according to independent exponential distributions [2, 6, 10], and that Weibull distributions are more successful in modeling observed disk failure behavior. Latent sector failures exhibit significant correlation both temporally and spatially within a device [1, 11]. Beyond this, sector failures are highly usage dependent and difficult to quantify [2, 9]. The most recent synthesis of empirical data by Schroeder et al. suggests that Pareto distributions can best capture the burstiness of latent sector errors, as well as spatial and temporal correlations [11]. Disk repair activities such as rebuild and scrubbing tend to require some fixed minimal amount of time to complete. Moreover, in well-engineered systems, there tends to be some fixed upper bound on the time the repair activity may take. Events with lower and upper time bounds are poorly modeled by exponential distributions. Again, Weibull distributions capture reality better [2]. 4.2 Memorylessness, Failure & Repair Exponential distributions and Markov Models are memoryless. When the system modeled in Figure 2 transitions into a new state, it is as if all the components in the system are reset. Available components ages are reset to 0 (i.e., brand new), and any repair of failed components is forgotten. Both cases are problematic. 2
3 When the system transitions to a new state, it is as if all available disks are refreshed to a good-as-new state. In particular, the transition from state 1 to state 0 models a system where the repair of one disk converts all disks into their pristine states. In reality, only the recently repaired component is brand-new, while all the others have a nonzero age. Now, consider the system under repair in state i such that 1 i < m. If a failure occurs, moving the system to state i + 1, any previous rebuilding is assumed to be discarded, and the variable governs the transition back to statei. It is as if the most recent failure dictates the repair transition. In reality, it is the earliest failure, whose rebuild is closest to completion, that governs repair transitions. These two issues highlight the difficulty with the memorylessness assumption: each transition forgets about progress that has been made in a previous state neither component wear-out nor rebuild progress are modeled. Correctly incorporating these time-dependent properties into such a model is quite difficult. The difficulty lies in the distinction between absolute time and relative time. Absolute time is the time since the system was generated, while relative time applies to the individual device lifetime and repair clocks. Analytic models operate in absolute time; therefore, there is no reasonable way to determine the values of each individual clock. Simulation methods can track relative time and thus can effectively model reliability of a storage system with time-dependent properties. 4.3 Memorylessness & Sector Errors In a m-disk fault tolerant system, the storage system enters critical mode upon the m-th disk failure. The transition in the Markov model in Figure 2 from the m 1 to the m + 1 state is intended to model data loss due to sector errors in critical mode. In this model, any sector errors or bit errors encountered during rebuild in critical mode lead to data loss. Unfortunately, such a model overestimates the system unreliability. A sector failure only leads to data loss if it occurs in the portion of the failed disk that is critically exposed. For example, in a two-disk fault tolerant system, if the first disk to fail is 90% rebuilt when a second disk fails, only 10% of the disk is critically exposed. Figure 3 illustrates this point in general. This difficulty with Markov models again follows from the memorylessness assumption. 5 Metrics Many alternatives have been proposed to replace MTTDL as a reliability metric (e.g., [2, 7, 8]). While these metrics have some advantages, none have all the qualities that we believe such a metric must. In our opinion, a storage system reliability metric must have the following properties: Critical region of first failed disk First disk fails 1 0 m th disk fails time First failed disk is rebuilt while other disks fail Critical region exposed to latent sector errors Figure 3: Critical region of first failed disk susceptible to data loss due to latent sector errors. Calculable. There must exist a reasonable method to calculate the metric. For example, a software package based on well understood simulation or numerical calculation principles can be used. Meaningful. The metric must relate directly to the reliability of a deployed storage system. Understandable. The metric, and its units, must be understandable to a broad audience that includes developers, marketers, industry analysts, and researchers. Comparable. The metric must allow us to compare systems with different scales, architectures, and underlying storage technologies (e.g. solid-state disk). 5.1 NOMDL: A Better Metric? We start by creating a metric called Magnitude of Data Loss (MDL). Let MDL t be the expected amount of data lost (in bytes) in a target system within mission time t. There are two advantages to MDL t. First, like system reliability, the metric deals with arbitrary time periods. This means that a system architect can use MDL t to estimate the expected number of bytes lost in the first year of deployment, or first ten years of deployment. Second, the units are understandable: bytes and years. Unfortunately, MDL t is not a metric that compares well across systems. Consider an 8 disk RAID4 array with intra-disk parity [11] and a 24 disk RAID6 array, both composed of the exact same 1 TB drives. The RAID6 array will result in a higher MDL t, but has more than3times the usable capacity of the RAID4 array. The MDL can be made comparable by normalizing to the system s usable capacity; doing so yields the NOrmalized Magnitude of Data Loss (NOMDL) metric. NOMDL t measures expected amount of data lost per usable terabyte within mission time t. For example, the NOMDL t of a system may be bytes lost per usable terabyte in the first 5 years of deployment. Since 3
4 NOMDL t is a normalized version of MDL t, both metrics can be output from the same base calculation. 5.2 Calculating NOMDL t NOMDL t is calculable. Markov models can measure the probability of being in a specific state within a mission time. With care, the probabilities can be calculated for all paths in a Markov model and used to derive the number of expected bytes lost. Unfortunately, as we have described above, we do not believe that Markov models accurately capture the behavior of contemporary storage systems. Maybe other modeling paradigms such as Petri Nets and such variants can be used to calculate NOMDL t while addressing the deficiencies of Markov models. Our recommendation is to use Monte Carlo simulation to calculate NOMDL t. At a high level, such a simulation can be accomplished as follows. Initially, all devices are assigned failure and repair characteristics. Then, device failures and their repair times are drawn from an appropriate statistical distribution (e.g., Exponential, Weibull, Pareto) for each device. Devices are queued and processed by failure time and the simulation stops at a predefined mission time. Once a device is repaired, another failure and repair time is drawn for that device. Each time a failure occurs, the simulator analyzes the system state to determine if data loss has occurred. Detail of the techniques and overhead associated with simulation is discussed in Greenan s PhD thesis [4]. If the time of data loss is F and the mission time is t, then the simulator implements the function, I(F < t) = {0,1} (0 is no data loss, 1 is data loss). That is, the system either had a data loss event within the mission time or not. Many iterations of the simulator are required to get statistically meaningful results. The standard method of computing system reliability via simulation is to run N iterations (typically chosen experimentally) of the simulator and make the following calculation: R(t) = 1 N i=1 I(F i < t) N Since I(F i < t) evaluates to 1 when there is data loss in iterationiand 0 otherwise, this directly calculates the probability of no data loss in [0,t]. Given the magnitude of data loss upon a data loss event,c i, this standard calculation can produce the MDL t : MDL t = N i=1 I(F i < t) C i. N The NOMDL t is the MDL t normalized to the usable capacity of the system, D: NOMDL t = MDL t /D. Using simulation thus produces{r(t), MDL t, NOMDL t } which in our opinion makes NOMDL t a calculable, meaningful, understandable, and comparable metric. 5.3 Comparison of Metrics Table 1 provides a high-level comparison of MTTDL and other recently proposed storage reliability metrics. We compare the metrics qualitatively in terms of the aforementioned properties of a good metric. We also perform a sample calculation for each metric of a simple storage system: an 8-disk RAID4 array of terabyte hard drives with periodic scrubbing for a 10 year mission time. The failure/repair/scrub characteristics are taken from Elerath and Pecht [2]. All calculations were performed using the HFRS reliability simulation suite (see 8). Here we compare MTTDL, Bit Half-Life (BHL) [8], Double-Disk Failures Per 1000 Reliability Groups (DDF pkrg) [2], Data Loss events per Petabyte Year (DALOPY) [5] and NOMDL t. MTTDL and DALOPY are calculated via Markov models. BHL is calculated by finding the time at which a bit has a 0.50 probability of failure, which is difficult to calculate via simulation and can be estimated using a Markov model. For BHL, this time is calculated for the entire system instead of a single bit. DDF pkrg and NOMDL t are computed using Monte Carlo simulation. Both calculations for MTTDL and BHL result in reliability metrics that are essentially meaningless. Even in a RAID4 system with the threat of sector errors (2.6% chance when reading an entire disk) both metrics produce numbers that are well beyond the lifetime of most existing systems. In addition, both metrics produce results that are not comparable between systems that differ in terms of technology and scale. DDF pkrg and DALOPY are interesting alternatives to the original MTTDL calculation. DDF pkrg is not sensitive to technological change, but is bound architecturally to a specific RAID level or erasure-coding scheme (double disk failure is specific to RAID4 or RAID5). DALOPY has most of properties of a good metric, but is not comparable. In particular, it is not comparable across systems based on different technologies or architectures. While DALOPY normalizes the expected number of data loss events to the system size, it does not provide the magnitude of data loss. Without magnitude it is hard to compare DALOPY between systems; data loss event gives no information on what or how much data was lost. In addition, the units of the metric are hard to reason about. NOMDL t is not sensitive to technological change, architecture or scale. The metric is normalized to system scale, is comparable between architectures and directly measures the expected magnitude of data loss. As shown in Table 1, the units of NOMDL t bytes lost per usable TB are easy to understand. Beyond this, the subscript 4
5 Meaningful Understandable Calculable Comparable Result MTTDL years BHL years DDF pkrg 183 DDFs DALOPY 3.32 DL per (PB*Yr) NOMDL 10y bytes lost per usable TB Table 1: Qualitative comparison of different storage reliability metrics. t = 10y, clearly indicates the mission lifetime and so helps ensure that only numbers based on the same mission lifetime are actually compared. 6 Conclusions We have argued that MTTDL is essentially a meaningless reliability metric for storage systems and that Markov models, the normal method of calculating MTTDL, is flawed. We are not the first to make this argument (see [2] and [8]) but hope to be the last. We believe NOMDL t has the desirable features of a good reliability metric, namely that it is calculable, meaningful, understandable, and comparable, and we exhort researchers to exploit it for their future reliability measurements. Currently, we believe that Monte Carlo simulation is the best way to calculate NOMDL t. 7 Acknowledgments This material is based upon work supported by the National Science Foundation under grants CNS HFRS Availability The High-Fidelity Reliability (HFR) Simulator is a command line tool written in Python and is available at kmgreen/. References [1] BAIRAVASUNDARAM, L. N., GOODSON, G. R., SCHROEDER, B., ARPACI-DUSSEAU, A. C., AND ARPACI-DUSSEAU, R. H. An Analysis of Data Corruption in the Storage Stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST 08) (San Jose, California, February 2008). [2] ELERATH, J., AND PECHT, M. Enhanced reliability modeling of raid storage systems. In Dependable Systems and Networks, DSN th Annual IEEE/IFIP International Conference on (june 2007), pp [3] GIBSON, G. A. Redundant Disk Arrays: Reliable, Parallel Secondary Storage. PhD thesis, Univeristy of California, Berkeley, December Technical Report CSD [4] GREENAN, K. M. Reliability and Power-Efficiency in Erasure-Coded Storage Systems. PhD thesis, Univeristy of California, Santa Cruz, December Technical Report UCSC-SSRC [5] HAFNER, J. L., AND RAO, K. Notes on reliability models for non-mds erasure codes. Tech. Rep. RJ 10391, IBM, October [6] PINHEIRO, E., WEBER, W.-D., AND BARROSO, L. A. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 07) (2007), USENIX Association. [7] RAO, K. K., HAFNER, J. L., AND GOLDING, R. A. Reliability for networked storage nodes. In DSN-06: International Conference on Dependable Systems and Networks (Philadelphia, 2006), IEEE, pp [8] ROSENTHAL, D. S. H. Bit preservation: A solved problem? In Proceedings of the Fifth International Conference on Preservation of Digital Objects (ipres 08) (London, UK, September 2008). [9] ROZIER, E., BELLUOMINI, W., DEENADHAYALAN, V., HAFNER, J., RAO, K., AND ZHOU, P. Evaluating the impact of undetected disk errors in raid systems. In Dependable Systems Networks, DSN 09. IEEE/IFIP International Conference on ( july ), pp [10] SCHROEDER, B., AND GIBSON, G. A. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 07) (2007), USENIX Association, pp [11] SCHROEDER, B., GILL, P., AND DAMOURAS, S. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th Conference on File and Storage Technologies (FAST 10) (San Jose, California, February 2010). [12] SIEWIOREK, D. P., AND SWARZ, R. S. Reliable computer systems (3rd ed.): design and evaluation. A. K. Peters, Ltd., Natick, MA, USA, [13] STORER, M. W., GREENAN, K., MILLER, E. L., AND VORUGANTI, K. Pergamum: Replacing tape with energy efficient, reliable, disk-based archival storage. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST 08) (Feb. 2008), pp
Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe
Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe 2010 Storage Developer Conference. Insert Your Company Name. All Rights Reserved. 1 In the beginning There was replication Long before
Reliability of Data Storage Systems
Zurich Research Laboratory Ilias Iliadis April 2, 25 Keynote NexComm 25 www.zurich.ibm.com 25 IBM Corporation Long-term Storage of Increasing Amount of Information An increasing amount of information is
Why RAID is Dead for Big Data Storage. The business case for why IT Executives are making a strategic shift from RAID to Information Dispersal
Why RAID is Dead for Big Data Storage The business case for why IT Executives are making a strategic shift from RAID to Information Dispersal Executive Summary Data is exploding, growing 10X every five
HDD Ghosts, Goblins and Failures in RAID Storage Systems
HDD Ghosts, Goblins and Failures in RAID Storage Systems Jon G. Elerath, Ph. D. February 19, 2013 Agenda RAID > HDDs > RAID Redundant Array of Independent Disks What is RAID storage? How does RAID work?
An Analysis of Data Corruption in the Storage Stack
Department of Computer Science, Institute for System Architecture, Operating Systems Group An Analysis of Data Corruption in the Storage Stack Lakshmi N. Bairavasundaram, Garth Goodson, Bianca Schroeder,
PIONEER RESEARCH & DEVELOPMENT GROUP
SURVEY ON RAID Aishwarya Airen 1, Aarsh Pandit 2, Anshul Sogani 3 1,2,3 A.I.T.R, Indore. Abstract RAID stands for Redundant Array of Independent Disk that is a concept which provides an efficient way for
Silent data corruption in SATA arrays: A solution
Silent data corruption in SATA arrays: A solution Josh Eddy August 2008 Abstract Recent large academic studies have identified the surprising frequency of silent read failures that are not identified or
RAID-DP: NetApp Implementation of Double- Parity RAID for Data Protection
Technical Report RAID-DP: NetApp Implementation of Double- Parity RAID for Data Protection Jay White & Chris Lueth, NetApp May 2010 TR-3298 ABSTRACT This document provides an in-depth overview of the NetApp
Revolutionary Methods to Handle Data Durability Challenges for Big Data
WHITE PAPER Data Center Scalability/ Cloud Storage Big Data Revolutionary Methods to Handle Data Durability Challenges for Big Data Intel and Amplidata address the storage challenges presented by the coming
an analysis of RAID 5DP
an analysis of RAID 5DP a qualitative and quantitative comparison of RAID levels and data protection hp white paper for information about the va 7000 series and periodic updates to this white paper see
Data Corruption In Storage Stack - Review
Theoretical Aspects of Storage Systems Autumn 2009 Chapter 2: Double Disk Failures André Brinkmann Data Corruption in the Storage Stack What are Latent Sector Errors What is Silent Data Corruption Checksum
Using Shared Parity Disks to Improve the Reliability of RAID Arrays
Using Shared Parity Disks to Improve the Reliability of RAID Arrays Jehan-François Pâris Department of Computer Science University of Houston Houston, TX 7704-010 [email protected] Ahmed Amer Department
Reliability of Systems and Components
Reliability of Systems and Components GNS Systems GmbH Am Gaußberg 2 38114 Braunschweig / Germany Fri 13 Apr 2012, Revision 0.1.1 Please send comments about this document to: [email protected]
An On-Line Algorithm for Checkpoint Placement
An On-Line Algorithm for Checkpoint Placement Avi Ziv IBM Israel, Science and Technology Center MATAM - Advanced Technology Center Haifa 3905, Israel [email protected] Jehoshua Bruck California Institute
Replication and Erasure Coding Explained
Replication and Erasure Coding Explained Caringo Elastic Content Protection Technical Overview Paul Carpentier, CTO, Caringo Introduction: data loss Unlike what most storage vendors will try to make you
RAID-DP : NETWORK APPLIANCE IMPLEMENTATION OF RAID DOUBLE PARITY FOR DATA PROTECTION A HIGH-SPEED IMPLEMENTATION OF RAID 6
RAID-DP : NETWORK APPLIANCE IMPLEMENTATION OF RAID DOUBLE PARITY FOR DATA PROTECTION A HIGH-SPEED IMPLEMENTATION OF RAID 6 Chris Lueth, Network Appliance, Inc. TR-3298 [12/2006] ABSTRACT To date, RAID
Alternatives to Big Backup
Alternatives to Big Backup Life Cycle Management, Object- Based Storage, and Self- Protecting Storage Systems Presented by: Chris Robertson Solution Architect Cambridge Computer Copyright 2010-2011, Cambridge
Storage node capacity in RAID0 is equal to the sum total capacity of all disks in the storage node.
RAID configurations defined 1/7 Storage Configuration: Disk RAID and Disk Management > RAID configurations defined Next RAID configurations defined The RAID configuration you choose depends upon how you
RAID Overview: Identifying What RAID Levels Best Meet Customer Needs. Diamond Series RAID Storage Array
ATTO Technology, Inc. Corporate Headquarters 155 Crosspoint Parkway Amherst, NY 14068 Phone: 716-691-1999 Fax: 716-691-9353 www.attotech.com [email protected] RAID Overview: Identifying What RAID Levels
RAID Technology Overview
RAID Technology Overview HP Smart Array RAID Controllers HP Part Number: J6369-90050 Published: September 2007 Edition: 1 Copyright 2007 Hewlett-Packard Development Company L.P. Legal Notices Copyright
Module 1: Introduction to Computer System and Network Validation
Module 1: Introduction to Computer System and Network Validation Module 1, Slide 1 What is Validation? Definition: Valid (Webster s Third New International Dictionary) Able to effect or accomplish what
RAID for the 21st Century. A White Paper Prepared for Panasas October 2007
A White Paper Prepared for Panasas October 2007 Table of Contents RAID in the 21 st Century...1 RAID 5 and RAID 6...1 Penalties Associated with RAID 5 and RAID 6...1 How the Vendors Compensate...2 EMA
Reliability and Fault Tolerance in Storage
Reliability and Fault Tolerance in Storage Dalit Naor/ Dima Sotnikov IBM Haifa Research Storage Systems 1 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
Storage Design for High Capacity and Long Term Storage. DLF Spring Forum, Raleigh, NC May 6, 2009. Balancing Cost, Complexity, and Fault Tolerance
Storage Design for High Capacity and Long Term Storage Balancing Cost, Complexity, and Fault Tolerance DLF Spring Forum, Raleigh, NC May 6, 2009 Lecturer: Jacob Farmer, CTO Cambridge Computer Copyright
Moving Beyond RAID DXi and Dynamic Disk Pools
TECHNOLOGY BRIEF Moving Beyond RAID DXi and Dynamic Disk Pools NOTICE This Technology Brief contains information protected by copyright. Information in this Technology Brief is subject to change without
General Parallel File System (GPFS) Native RAID For 100,000-Disk Petascale Systems
General Parallel File System (GPFS) Native RAID For 100,000-Disk Petascale Systems Veera Deenadhayalan IBM Almaden Research Center 2011 IBM Corporation Hard Disk Rates Are Lagging There have been recent
CS420: Operating Systems
NK YORK COLLEGE OF PENNSYLVANIA HG OK 2 RAID YORK COLLEGE OF PENNSYLVAN James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz,
Designing a Cloud Storage System
Designing a Cloud Storage System End to End Cloud Storage When designing a cloud storage system, there is value in decoupling the system s archival capacity (its ability to persistently store large volumes
Keys to Successfully Architecting your DSI9000 Virtual Tape Library. By Chris Johnson Dynamic Solutions International
Keys to Successfully Architecting your DSI9000 Virtual Tape Library By Chris Johnson Dynamic Solutions International July 2009 Section 1 Executive Summary Over the last twenty years the problem of data
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
RAID Basics Training Guide
RAID Basics Training Guide Discover a Higher Level of Performance RAID matters. Rely on Intel RAID. Table of Contents 1. What is RAID? 2. RAID Levels RAID 0 RAID 1 RAID 5 RAID 6 RAID 10 RAID 0+1 RAID 1E
IBM System x GPFS Storage Server
IBM System x GPFS Storage Server Schöne Aussicht en für HPC Speicher ZKI-Arbeitskreis Paderborn, 15.03.2013 Karsten Kutzer Client Technical Architect Technical Computing IBM Systems & Technology Group
Benefits of Intel Matrix Storage Technology
Benefits of Intel Matrix Storage Technology White Paper December 2005 Document Number: 310855-001 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,
RAID0.5: Active Data Replication for Low Cost Disk Array Data Protection
RAID0.5: Active Data Replication for Low Cost Disk Array Data Protection John A. Chandy Department of Electrical and Computer Engineering University of Connecticut Storrs, CT 06269-2157 [email protected]
Object Storage: Out of the Shadows and into the Spotlight
Technology Insight Paper Object Storage: Out of the Shadows and into the Spotlight By John Webster December 12, 2012 Enabling you to make the best technology decisions Object Storage: Out of the Shadows
technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels
technology brief RAID Levels March 1997 Introduction RAID is an acronym for Redundant Array of Independent Disks (originally Redundant Array of Inexpensive Disks) coined in a 1987 University of California
PARALLELS CLOUD STORAGE
PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...
A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster
A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster K. V. Rashmi 1, Nihar B. Shah 1, Dikang Gu 2, Hairong Kuang
Operating Systems. RAID Redundant Array of Independent Disks. Submitted by Ankur Niyogi 2003EE20367
Operating Systems RAID Redundant Array of Independent Disks Submitted by Ankur Niyogi 2003EE20367 YOUR DATA IS LOST@#!! Do we have backups of all our data???? - The stuff we cannot afford to lose?? How
VERY IMPORTANT NOTE! - RAID
Disk drives are an integral part of any computing system. Disk drives are usually where the operating system and all of an enterprise or individual s data are stored. They are also one of the weakest links
Disk Array Data Organizations and RAID
Guest Lecture for 15-440 Disk Array Data Organizations and RAID October 2010, Greg Ganger 1 Plan for today Why have multiple disks? Storage capacity, performance capacity, reliability Load distribution
Price/performance Modern Memory Hierarchy
Lecture 21: Storage Administration Take QUIZ 15 over P&H 6.1-4, 6.8-9 before 11:59pm today Project: Cache Simulator, Due April 29, 2010 NEW OFFICE HOUR TIME: Tuesday 1-2, McKinley Last Time Exam discussion
End-to-end Data integrity Protection in Storage Systems
End-to-end Data integrity Protection in Storage Systems Issue V1.1 Date 2013-11-20 HUAWEI TECHNOLOGIES CO., LTD. 2013. All rights reserved. No part of this document may be reproduced or transmitted in
IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO)
IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO) Rick Koopman IBM Technical Computing Business Development Benelux [email protected] Enterprise class replacement for HDFS
Technical White paper RAID Protection and Drive Failure Fast Recovery
Technical White paper RAID Protection and Drive Failure Fast Recovery RAID protection is a key part of all ETERNUS Storage Array products. Choices of the level chosen to meet customer application requirements
RAID 5 rebuild performance in ProLiant
RAID 5 rebuild performance in ProLiant technology brief Abstract... 2 Overview of the RAID 5 rebuild process... 2 Estimating the mean-time-to-failure (MTTF)... 3 Factors affecting RAID 5 array rebuild
How To Understand And Understand The Power Of Aird 6 On Clariion
A Detailed Review Abstract This white paper discusses the EMC CLARiiON RAID 6 implementation available in FLARE 26 and later, including an overview of RAID 6 and the CLARiiON-specific implementation, when
Don t Let RAID Raid the Lifetime of Your SSD Array
Don t Let RAID Raid the Lifetime of Your SSD Array Sangwhan Moon Texas A&M University A. L. Narasimha Reddy Texas A&M University Abstract Parity protection at system level is typically employed to compose
Nutanix Tech Note. Failure Analysis. 2013 All Rights Reserved, Nutanix Corporation
Nutanix Tech Note Failure Analysis A Failure Analysis of Storage System Architectures Nutanix Scale-out v. Legacy Designs Types of data to be protected Any examination of storage system failure scenarios
Building Storage Clouds for Online Applications A Case for Optimized Object Storage
Building Storage Clouds for Online Applications A Case for Optimized Object Storage Agenda Introduction: storage facts and trends Call for more online storage! AmpliStor: Optimized Object Storage Cost
Scalable Multi-Node Event Logging System for Ba Bar
A New Scalable Multi-Node Event Logging System for BaBar James A. Hamilton Steffen Luitz For the BaBar Computing Group Original Structure Raw Data Processing Level 3 Trigger Mirror Detector Electronics
Technology Update White Paper. High Speed RAID 6. Powered by Custom ASIC Parity Chips
Technology Update White Paper High Speed RAID 6 Powered by Custom ASIC Parity Chips High Speed RAID 6 Powered by Custom ASIC Parity Chips Why High Speed RAID 6? Winchester Systems has developed High Speed
Compact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
Definition of RAID Levels
RAID The basic idea of RAID (Redundant Array of Independent Disks) is to combine multiple inexpensive disk drives into an array of disk drives to obtain performance, capacity and reliability that exceeds
How To Build A Clustered Storage Area Network (Csan) From Power All Networks
Power-All Networks Clustered Storage Area Network: A scalable, fault-tolerant, high-performance storage system. Power-All Networks Ltd Abstract: Today's network-oriented computing environments require
DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering
DELL RAID PRIMER DELL PERC RAID CONTROLLERS Joe H. Trickey III Dell Storage RAID Product Marketing John Seward Dell Storage RAID Engineering http://www.dell.com/content/topics/topic.aspx/global/products/pvaul/top
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
"Reliability and MTBF Overview"
"Reliability and MTBF Overview" Prepared by Scott Speaks Vicor Reliability Engineering Introduction Reliability is defined as the probability that a device will perform its required function under stated
Data Storage - II: Efficient Usage & Errors
Data Storage - II: Efficient Usage & Errors Week 10, Spring 2005 Updated by M. Naci Akkøk, 27.02.2004, 03.03.2005 based upon slides by Pål Halvorsen, 12.3.2002. Contains slides from: Hector Garcia-Molina
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.
The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms Abhijith Shenoy Engineer, Hedvig Inc. @hedviginc The need for new architectures Business innovation Time-to-market
Using RAID6 for Advanced Data Protection
Using RAI6 for Advanced ata Protection 2006 Infortrend Corporation. All rights reserved. Table of Contents The Challenge of Fault Tolerance... 3 A Compelling Technology: RAI6... 3 Parity... 4 Why Use RAI6...
Trends in Enterprise Backup Deduplication
Trends in Enterprise Backup Deduplication Shankar Balasubramanian Architect, EMC 1 Outline Protection Storage Deduplication Basics CPU-centric Deduplication: SISL (Stream-Informed Segment Layout) Data
Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
How To Make A Backup System More Efficient
Identifying the Hidden Risk of Data De-duplication: How the HYDRAstor Solution Proactively Solves the Problem October, 2006 Introduction Data de-duplication has recently gained significant industry attention,
Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1
Introduction - 8.1 I/O Chapter 8 Disk Storage and Dependability 8.2 Buses and other connectors 8.4 I/O performance measures 8.6 Input / Ouput devices keyboard, mouse, printer, game controllers, hard drive,
Storage Architecture in XProtect Corporate
Milestone Systems White Paper Storage Architecture in XProtect Corporate Prepared by: John Rasmussen Product Manager XProtect Corporate Milestone Systems Date: 7 February, 2011 Page 1 of 20 Table of Contents
RAID: Redundant Arrays of Independent Disks
RAID: Redundant Arrays of Independent Disks Dependable Systems Dr.-Ing. Jan Richling Kommunikations- und Betriebssysteme TU Berlin Winter 2012/2013 RAID: Introduction Redundant array of inexpensive disks
Information management software solutions White paper. Powerful data warehousing performance with IBM Red Brick Warehouse
Information management software solutions White paper Powerful data warehousing performance with IBM Red Brick Warehouse April 2004 Page 1 Contents 1 Data warehousing for the masses 2 Single step load
Performance Analysis of RAIDs in Storage Area Network
Performance Analysis of RAIDs in Storage Area Network Sneha M. Assistant Professor, Department of Computer Science and Engineering, R V College of Engineering Bengaluru-560059 ABSTRACT Direct Attached
EMC DATA DOMAIN EXTENDED RETENTION SOFTWARE: MEETING NEEDS FOR LONG-TERM RETENTION OF BACKUP DATA ON EMC DATA DOMAIN SYSTEMS
SOLUTION PROFILE EMC DATA DOMAIN EXTENDED RETENTION SOFTWARE: MEETING NEEDS FOR LONG-TERM RETENTION OF BACKUP DATA ON EMC DATA DOMAIN SYSTEMS MAY 2012 Backups are essential for short-term data recovery
What is RAID and how does it work?
What is RAID and how does it work? What is RAID? RAID is the acronym for either redundant array of inexpensive disks or redundant array of independent disks. When first conceived at UC Berkley the former
PROTECTING MICROSOFT SQL SERVER TM
WHITE PAPER PROTECTING MICROSOFT SQL SERVER TM Your company relies on its databases. How are you protecting them? Published: February 2006 Executive Summary Database Management Systems (DBMS) are the hidden
Copyright. Network and Protocol Simulation. What is simulation? What is simulation? What is simulation? What is simulation?
Copyright Network and Protocol Simulation Michela Meo Maurizio M. Munafò [email protected] [email protected] Quest opera è protetta dalla licenza Creative Commons NoDerivs-NonCommercial. Per
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Securing PHP Based Web Application Using Vulnerability Injection
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 5 (2013), pp. 391-398 International Research Publications House http://www. irphouse.com /ijict.htm Securing
How To Write A Disk Array
200 Chapter 7 (This observation is reinforced and elaborated in Exercises 7.5 and 7.6, and the reader is urged to work through them.) 7.2 RAID Disks are potential bottlenecks for system performance and
RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29
RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f. one large disk) Parallelism improves performance Plus extra disk(s) for redundant data storage Provides fault tolerant
FlexArray Virtualization
Updated for 8.2.1 FlexArray Virtualization Installation Requirements and Reference Guide NetApp, Inc. 495 East Java Drive Sunnyvale, CA 94089 U.S. Telephone: +1 (408) 822-6000 Fax: +1 (408) 822-4501 Support
Storage Technologies for Video Surveillance
The surveillance industry continues to transition from analog to digital. This transition is taking place on two fronts how the images are captured and how they are stored. The way surveillance images
Information Stewardship: Moving From Big Data to Big Value
Information Stewardship: Moving From Big Data to Big Value By John Burke Principal Research Analyst, Nemertes Research Executive Summary Big data stresses tools, networks, and storage infrastructures.
A Performance Comparison of Open-Source Erasure Coding Libraries for Storage Applications
A Performance Comparison of Open-Source Erasure Coding Libraries for Storage Applications Catherine D. Schuman James S. Plank Technical Report UT-CS-8-625 Department of Electrical Engineering and Computer
RAID Level Descriptions. RAID 0 (Striping)
RAID Level Descriptions RAID 0 (Striping) Offers low cost and maximum performance, but offers no fault tolerance; a single disk failure results in TOTAL data loss. Businesses use RAID 0 mainly for tasks
