Nutanix Tech Note. Failure Analysis. 2013 All Rights Reserved, Nutanix Corporation

Nutanix Tech Note Failure Analysis

A Failure Analysis of Storage System Architectures Nutanix Scale-out v. Legacy Designs Types of data to be protected Any examination of storage system failure scenarios must begin with a baseline understanding of the types of stored data. Every storage system maintains three distinct types of data: 1. Configuration data 2. Metadata 3. User data Protecting and ensuring availability for all three types of data is critical to maintaining the integrity and availability of the entire storage system. Nutanix Virtual Computing Platforms incorporate a distributed, scale-out file system to provide reliable storage for all three data types. Configuration data Storage systems maintain essential configuration data about physical components, including IP addresses, capacities and replication rules pertaining to hosts, disks and storage constructs (e.g. RAID groups, etc.). Many legacy architectures keep only two copies of configuration data, based upon the common dual-controller model. If one of these copies is lost, the system is put into an error-prone state. In the case of another controller failure, the entire user data is lost with no chance of recovery. In contrast, Nutanix s distributed architecture stores configuration data in a minimum of three nodes, giving the system the ability to survive the failure of two nodes 2x most other systems. In other words, in the case where two nodes are lost, the configuration data will still be available. Additionally, because configuration data is maintained on a minimum of three nodes in a cluster, increasing the size of the cluster (i.e. adding more nodes) does not increase the risk of data loss. In a large cluster of 32 nodes, for example, the probability that two of the three configuration nodes failing is much lower than the probability of any two nodes failing. Also, given the n-controller model of the Nutanix Virtual Computing Platform, more copies of this data can be elected to be stored.

Metadata Storage systems must also maintain metadata, which is used to describe characteristic attributes of actual stored user data. Legacy storage systems typically store two copies of metadata, again based upon the dual-controller construct. In the case of a single failure, the system will be in a weakened state. When both are lost, the entire user data set it described is completely lost. In comparison, Nutanix maintains a minimum of three copies of all metadata using a distributed key-value store, which is strictly consistent utilizing the Paxos algorithm. In the case of a node failure the keys can be redistributed to maintain a minimum of three copies. The three metadata copies are mapped to three consecutive nodes in a ring. In the unlikely event that two nodes fail, as long as the nodes lost due to failure are separated by at least two other nodes, the availability of the metadata is unaffected. Even in the rare case of two consecutive nodes failing, the remaining copy can be made active. Metadata replication factor uses a default of RF3 for small to medium size clusters B1N1 B2N3 New nodes are automatically inserted into the ring in a dynamic manner B1N1 B2N1 Node placement is rack and block aware Metadata replication factor can be dynamic and based upon cluster size B2N4 B1N1 B2N1 B1N4 Initial State B1N2 B1N4 Cluster Expansion B1N2 B1N4 Scaled State B1N2 B1N3 B2N3 B1N3 B2N2 B2N3 B1N3 B2N2 [Block NumberNodeNumber] Scale Out Figure 1: Nutanix key-value database for managing metadata User data Given the sensitive nature of user data, maximum steps must be taken to ensure it is always available and never lost. Availability of user data is handled at multiple levels, including controllers, devices and drives. Traditional storage arrays attempt to ensure availability at the write level by leveraging technologies such as NVRAM. However, utilization of NVRAM comes at a high cost, and is normally insufficient to facilitate adequate capacity. For persistent storage, RAID constructs are leveraged using multiple drives and parity bits - allowing the ability to re-compute and

rebuild data in the case of drive failure. However, this comes at the price of computation overhead in the event of a failure, as well as long rebuild times impacting operational performance. Also, the number of controllers is important when addressing user data as they are fundamental in meeting all I/O requests for user data. In the case of a controller failure, I/O performance may degrade due to an increased load on a single controller. Nutanix takes an entirely different approach, managing data protection without the need for expensive offloads or RAID constructs. Nutanix implements a fully distributed design that prevents data loss in the event of a node failure. Before any write is acknowledged to the host, it is synchronously replicated on an adjacent node. All nodes in the cluster participate in replication. Only after the data and its associated metadata is replicated, will the host receive an acknowledgment of a successful write. This ensures that data exists in at least two independent locations within the cluster and is fault tolerant. When Nutanix detects an accumulation of errors for a particular disk (e.g., I/O errors or bad sectors), that disk is then marked offline, and is immediately removed from the storage pool. Nutanix software then identifies all extents stored on the failed disk, and then initiates the rereplicatation of the associated replicas in order to restore the desired replication factor. Unlike traditional RAID arrays that must undergo a full array rebuild to restore data redundancy, including the associated performance penalty incurred, Nutanix replications happen as background processes with no impact to cluster performance. In fact, as the cluster size grows the length of time needed to recover from a disk failure decreases as every node participates in the replication. Since the data needed to rebuild a disk is distributed throughout the cluster, more disks are involved in the rebuild process. This increases the speed at which the affected extents are re-replicated.

Architectural Differences Impacting Storage Failures For discussion, we will examine two competing storage architectures: 1. Traditional centralized storage arrays connected to the server tier via a storage network fabric (e.g. fibre channel) 2. Nutanix scale-out architecture that converges data storage and storage control logic into clusterable nodes Typical configuration diagrams are shown below. Figure 2: Traditional storage architecture with dual HA controllers and multiple disk shelves

VM 1... VM N VM 1... VM N VM 1... VM N Hypervisor Private vswitch Hypervisor Private vswitch Hypervisor Private vswitch Storage CVM Storage CVM Storage CVM 10GbE Network NDFS NDFS Figure 3: Nutanix scale-out architecture with controller logic and data storage local to each node. External Sources The Nutanix architecture meaningfully reduces the number of failure points. For example: Use of direct-attached storage curtails the amount of physical cabling in the environment. Nutanix requires significantly fewer physical cables, thus reducing cabling errors and avoiding failures due to inadvertently disconnected cables. Distributing control logic among all nodes in a cluster yields a more fault tolerant design than traditional architectures that rely upon a fixed number of storage controllers. For example, when a Nutanix controller VM fails, continuous monitoring quickly identifies the unavailability of local control logic, and auto-pathing techniques automatically reroute a host s I/O requests to other controllers in the cluster. This is all done transparently to the hypervisor and VMs. Software-enabled controllers that fail can quickly be restarted by a failsafe monitoring process. In contrast, hardware-based controllers must be replaced upon failure. This can typically endanger the availability of the system until the replacement controller is installed and initialized. Failure Modes This section details what happens when components in the storage architecture fail. Switch failures or port failures In legacy architectures, each storage controller consumes four switch ports to maintain connection redundancy. These ports, of course, are in addition to those required for the hosts themselves. In Nutanix s converged architecture, each node uses only two ports for communication between nodes. All intra-node communications (e.g., communications between a host and a controller) leverage strictly virtual switches. Software-enabled ports do not fail and do not need to be replaced or serviced. Fewer physical switch ports translate to

less complexity, reduced deployment errors and a more reliable switching fabric compared with legacy architectures. Cable failures As is shown in figure 2, the number of physical cables required by the Nutanix architecture can be dramatically lower than legacy approaches. This is achieved by using more virtual cables and virtual switches, as well as by leveraging more reliable direct-attached technologies, like PCIe and SAS/SATA. Fewer physical cables result in less downtime due cable misconfigurations or failures. Controller failures In legacy dual-controller architectures, a peer controller takes over in the event that the primary controller fails. Similar behavior occurs with Nutanix, except that there is no fixed peer controller. During a failure of a controller VM, a fully operational peer is chosen and auto-pathing technologies are employed to transparently divert traffic to that chosen peer. Because all control logic is implemented in software, there is no dependence upon preconfigured cabling. The storage system continues to function as long there is a single connection from the affected Nutanix node to a top of rack switch. Disk shelf failures Legacy architectures are particularly prone to issues when a disk shelf fails. To understand failure scenarios completely it is helpful to associate components in legacy systems with their logical Nutanix equivalent. For example, a shelf enclosure in a central array is equivalent to the SAS/SATA controller on a Nutanix node. A Nutanix node can have as many as five 1TB disks for storing user data. Usually, correlated disk failures happen within a shelf, or within a single node in the case of Nutanix. Uncorrelated failures, on the other hand, typically occur across shelves, with a much smaller probability that failures occur on the same shelf. Correlated failures usually happen at the same time. Uncorrelated failures, however, are spread out over time, and almost never happen synchronously. Correlated Failures Correlated failures most often occur within the same storage shelf. In the worst-case scenario, the whole storage shelf can fail. With Nutanix s converged architecture, any number of individual disks in the node can fail (four or five), and overall system availability is unaffected. In fact, in a 12-node system, where a 2U appliance is used to house four nodes each, all 20

disks in the appliance (four nodes with five disks per node) can fail at the same time and there will be no data unavailability or loss. Further, system performance is also not impacted by such failures because the reading data does not require reconstruction. When data is rebuilt, it is done in the background and does not require computationally expensive parity calculations. Contrast this with legacy storage systems where the shelf enclosure constitutes a single point of failure (SPOF) when the RAID volume has more than two disks in a given shelf. As a workaround, array vendors recommend striping the RAID array across storage shelves in a way that no more than two disks belong to any given storage shelf. This approach is not always feasible. For example, in systems with a 14-disk or 24-disk shelf, when a shelf fails up to 12 RAID groups would then be forced to run in a degraded mode. Further, legacy vendors only allow reconstruction of two RAID groups at a time, while the other groups are left completely unprotected and exposed. RAID-DP reconstruction is a very expensive alternative, and is very slow compared to the simple mirroring implemented by Nutanix. Uncorrelated Failures Uncorrelated failures are those that occur independently. The probability of uncorrelated failures occurring at the same time can be considered negligible in most circumstances. Usually multiple uncorrelated failures are separated a period of time. The Nutanix architecture can withstand an arbitrary number of uncorrelated failures, as long as the cluster has sufficient capacity to keep two copies of the live data. This is the normal state for nearly all Nutanix deployments. If one or multiple disks in a node fail, the data is reduplicated using the compute, I/O, and storage resources across the cluster. Applying cluster-wide resources enables Nutanix to recover from the loss of one or more disks in just a matter of hours. In fact, clusters can recover from disk failures inversely to the size of cluster. The larger the cluster, the faster the recovery. After the redundancy is restored, the Nutanix cluster is ready to handle more uncorrelated failures with the same high reliability. Additionally, because there are no physical boundaries and replication silos in the Nutanix architecture, there is no need to set aside hot spares. Any available space in the cluster can serve as hot spare space.

Traditional dual-controller-based solutions cannot withstand two shelf failures, even if the failures are not temporally correlated. The Nutanix platform is much more reliable and more redundant compared with RAID-DP, or equivalent incumbent technologies. It is also useful to contrast the time to reconstruct data in the event of failure. The loss of one storage shelf (or all disks therein) in a legacy system is usually cripples performance while hot spares are consumed to restore redundancy. The reconstruction time of RAID-DP is much longer, typically 10 to 30 hours for full reconstruction. In fact, reconstruction times increase with the configured size of the RAID group. Further, if two drives in the same RAID group fail, the rebuild times increase by 2x or more. During this window of vulnerability, which is higher for larger disks, the system is exposed to the possibility of another disk or shelf loss, which will lead to a data loss event. Spares have to be strategically pre-allocated. Even if space is available in the cluster, if there are no more spares, it will be impossible to recover from a disk failure. Another benefit accruing to the Nutanix architecture is that recovery time is proportional to the actual amount of real user data. Thus, with thin provisioning, even if a large 1TB drives fails, the amount of user data that needs to be rebuilt could be small, and recovery times could be in minutes. In some legacy systems, however, the RAID array is not aware of which blocks in the disk need not be replicated as they have not yet been written to, or have been invalidated since being written to. This leads to the RAID array replicating more data than is necessary while exposing the user data to another failure for a longer time.