EMC SCALEIO OPERATION OVERVIEW

EMC SCALEIO OPERATION OVERVIEW Ensuring Non-disruptive Operation and Upgrade ABSTRACT This white paper reviews the challenges organizations face as they deal with the growing need for always-on levels of service availability. It illustrates how the EMC ScaleIO architecture provides the tools needed to address these challenges. March 2015 EMC WHITE PAPER

To learn more about how EMC products, services, and solutions can help solve your business and IT challenges, contact your local representative or authorized reseller, visit www.emc.com, or explore and compare products in the EMC Store Copyright 2015 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. All other trademarks used herein are the property of their respective owners. Part Number H14036 2

TABLE OF CONTENTS EMC SCALEIO INTRODUCTION... 4 ARCHITECTURE OVERVIEW... 4 SCALEIO DATA CLIENT (SDC)... 4 SCALEIO DATA SERVER (SDS)... 4 METADATA MANAGER (MDM)... 4 NON-DISRUPTIVE OPERATION... 5 DATA STORAGE AND ACCESS... 5 DISTRIBUTED VOLUME... 5 TWO-COPY MESH MIRRORING... 6 ADDING NEW SDS NODES AND AUTO-REBALANCE... 6 REMOVING SDS NODES AND FAULT HANDLING... 6 PROTECTION DOMAINS... 7 FAULT SETS... 7 FORWARD AND BACKWARD REBUILD... 7 ADDING AND REMOVING STORAGE MEDIA... 8 MDM FAILURE... 8 ADDING AND REMOVING SDC... 8 NON-DISRUPTIVE UPGRADE... 9 AVAILABLE TOOLS... 9 UPGRADE PREPARATION... 9 UPGRADE PROCESS... 9 CONCLUSION... 10 3

EMC SCALEIO INTRODUCTION Organizations have experienced similar pain points over the past few decades to manage growth, operational efficiency, service levels and cost. In a typical environment, the IT department has dedicated SAN environments where there is a limitation on how much they can scale. It s a lot of effort to acquire new hardware and perform data migrations. There isn t a way to pool the overall available resources so that the users can share I/O operations or capacity. In some cases, users may run out of capacity in one application but have resources under-utilized in other areas. It s extremely hard to maintain when the environment gets complex over time. This is even more difficult when the organization is required to reduce cost. Many other alternative solutions cannot deliver similar capabilities as EMC ScaleIO. For example, many would claim to provide scalability and high performance. But they are often black-box solutions which tend to be expensive to maintain over time and difficult to scale up or down as needed. There are also open source alternatives; however they require a lot of manual labor and internal developer expertise to maintain and tune the system during normal operation. Most cannot take advantage of modern media such as SSD and PCIe/NVMe persistent storage due to poor performance. ScaleIO is an industry leading technology that can offer what the competitors cannot: hyper-convergence, scalability, elasticity, and performance. The software converges storage and compute resources into a single architectural layer, which resides on the application server. The architecture allows for scaling out from as little as three servers to thousands by simply adding servers (nodes) to the environment. This is done elastically; increasing and decreasing capacity and compute resources can happen on the fly without impact to users or applications. ScaleIO also has self-healing capabilities, which enables it to easily recover from server or disk failures. ScaleIO aggregates all the IOPS in the various servers into one high-performing virtual SAN. All servers participate in servicing I/O requests using massively parallel processing. In addition, ScaleIO is hardware agnostic so there is no limitation on what hardware customers are required to use. In a VMware environment, ScaleIO can even support normal VMware block storage features such as VMotion and DRS. This paper will discuss some of the key characteristics of ScaleIO in regards to resiliency and flexibility. Those are required because organizations need a solution that can support an always-on infrastructure. This happens specifically in two areas: Non-disruptive Operation (NDO): Any node in a cluster may go down anytime due to failure. ScaleIO s capability to ensure NDO would need to allow high tolerance of failure due to unplanned outages or graceful maintenance. Data migrations are normally very time consuming and costly for most data centers. However, ScaleIO s auto-rebalancing and rebuilding allows this process to happen seamlessly. Non-disruptive Upgrade (NDU): This is an update to the ScaleIO software without interruption to the storage system service and data access. ScaleIO version 1.3x and later supports rolling upgrade. Such activity will not require any downtime during maintenance process. ARCHITECTURE OVERVIEW ScaleIO is comprised of three software components. Understanding these components is critical to ensure non-disruptive operation and help troubleshooting performance of the cluster. SCALEIO DATA CLIENT (SDC) SDC is a lightweight device driver that exposes ScaleIO volumes as block devices to the operating system residing on the same server that the SDC is installed on. The SDC should be installed on any server where the user wants to use the ScaleIO storage. All I/O requests will go through the SDC which will communicate with other SDSs over TCP/IP. SCALEIO DATA SERVER (SDS) SDS is used to contribute local storage of the node it s installed on. It manages the capacity of a single server and acts as a backend for data access. There could be many SDS nodes which contribute storage capacity to the entire ScaleIO cluster. It aggregates not only all application servers storage capacity but also performance. It can leverage Flash, SSD or HDD for storage and RAID cache or RAM for caching capability. METADATA MANAGER (MDM) MDM serves as the monitoring and configuration agent. It s important to know that the MDM is not a part of the data path. If each SDC wants to access data, it goes directly to specific SDS that contains the required information. The MDM is mainly used for management which consists of migration, rebuilds, and all system-related functions. To support high availability, two instances of MDM can be run on different servers. An MDM may run on servers that also run SDCs and/or SDSs. 4

NON-DISRUPTIVE OPERATION ScaleIO architecture is very flexible in various situations of typical operation lifecycle such as proactive maintenance and unplanned outages. The table below describes the major operations and how ScaleIO handles them to ensure uptime and availability. OPERATION VALUE Write Data Data chunks are directly written in multiple nodes. Each chunk has a primary and a secondary copy. Read Data Data access will be pointed directly to the SDS that contains the primary chunks. No communication to MDM to prevent bottleneck and single point of failure. Add SDS Node ScaleIO will trigger auto-rebalancing process to redistribute all chunks in the cluster. This also allows linear scale by increasing capacity and I/O performance. Remove SDS Node Gracefully and Ungracefully A graceful SDS removal via CLI will trigger rebalancing. An ungraceful SDS removal or failure will start the rebuild process. There will be no disruption to existing I/O operations. Add/Remove Storage Media ScaleIO will follow similar process as SDS regarding rebalancing and rebuild. It treats the event as an SDS reconfiguration. MDM Failure A single MDM failure will not impact I/O operations because it s not in the data path and is clustered for high availability. SDC communicates with SDS directly for I/O request. SDC Reboot or Network Issues ScaleIO volume is shared volume by default. The CLI is available to troubleshoot connectivity with the primary MDM. DATA STORAGE AND ACCESS This section will discuss the mechanics of ScaleIO s storage engine. Its resilient design allows fault tolerance but still optimizes performance. This is possible because the data is sent in multiple chunks across many nodes using a distributed volume architecture and two-copy mesh mirroring mechanism. DISTRIBUTED VOLUME When there is a write request, the data chunks (1MB) are spread throughout the cluster randomly and evenly. The local SDC communicates with a specific SDS to perform this operation. The MDM will not be involved with data access unless there is a change in cluster topology. In that case, the MDM will provide the new mapping to the local SDC. ScaleIO has a complex algorithm to evaluate the cluster balancing and ensure randomness. Note that ScaleIO is very efficient in terms of managing network bandwidth. If an application writes out 4KB of data, only 4KB are written. The same goes for read operations only the required data is read. This scheme is designed to maximize protection and optimize performance. 5

TWO-COPY MESH MIRRORING Mirroring method is used only on data write to ensure there is a second copy of the primary chunk. For example (Figure 1), the below volume is broken down into multiple chunks across the cluster (3 SDS nodes). The mirror of each chunk (yellow color) must be located by ScaleIO in different nodes other than its own primary SDS node (blue color). In a failure scenario, the volume data is protected since ScaleIO can reach out to the redundant copies to reconstruct the same volume. In a cluster, there will be 50% primary and 50% mirror. Note that only a write operation will require 2x data transmission over the network due to mirroring. For a read operation, the SDC can reach out to the primary SDS directly to retrieve the data chunks. Volume SDS 1 SDS 2 SDS 3 Figure 1. Mesh mirroring example with 1 Volume and 3 SDS nodes ADDING NEW SDS NODES AND AUTO-REBALANCE ScaleIO is very flexible when it comes to scaling. There is no need to invest in a costly data migration effort because capacity can be added with no downtime and minimal planning. This is a major factor in reducing operational costs and growth complexity. In order to increase capacity in the cluster, new SDS nodes must be added. The system dynamically reacts to new addition events and recalculates its rebalancing plans. This process will happen automatically with minimal data movement. Unlike traditional methods where only new volumes can benefit from new capacity, the ScaleIO cluster will rearrange the data between the SDS servers to optimize performance. In the below Figure 2, when the storage administrator adds a new node to the cluster, data chunks in existing nodes automatically migrate and are distributed evenly using the new node. NEW NODE BEFORE + AUTO-REBALANCING AFTER Figure 2. Auto-rebalancing process when adding a new SDS node REMOVING SDS NODES AND FAULT HANDLING Protecting the SDS is important to ensure high availability and I/O performance. The SDS nodes could go down either by planned maintenance / removal or unplanned outages. For a planned scenario there is flexibility to ensure the cluster has enough capacity for redistribution and ways to optimize performance with minimal I/O operations for rebalancing activities. Removing an SDS node can 6

be done either via the CLI, ScaleIO GUI or REST API. The data are always in protected state. However, in a failure event, ScaleIO will trigger rebuild due to degraded protection mode. It s important to understand some key concepts of ScaleIO when it comes to fault handling of an SDS node: Protection Domains Fault Sets Forward and Backward Rebuild PROTECTION DOMAINS Protection Domains are subsets of SDS nodes. The administrator can divide SDSs into multiple Protection Domains of various sizes, designating volume to domain assignments. Within a Protection Domain, both primary and mirror data chunks of a particular volume will be stored in SDS nodes that belong to the same Protection Domain. So if there are two SDS nodes failing and they are in different Protection Domains, it will not cause data unavailability because each Protection Domain would have the mirror copy in different SDS nodes. Such isolation is helpful to increase the resilience of the overall system. In addition, Protection Domains can be used for separation of volumes for performance planning for example, assigning highly accessed volumes in less busy domains or dedicating a particular domain to an application. In a Service Provider environment, it s an important feature for data location and partitioning in multi-tenancy deployments so that tenants can be segregated efficiently and securely. Finally, there could be use cases where the administrators want to use Protection Domains as adjustments to different network constraints within the system. FAULT SETS Within a Protection Domain, the administrator can setup multiple Fault Sets. This is just a logical grouping of SDSs to ensure that the mirroring occurs outside of that grouping. This setup can be defined based on various risk factors. For example, the administrator can treat one rack as a Fault Set. In the case that particular rack goes down, the mirror chunks in a different rack (outside of the Fault Set) could be used to rebuild the cluster. Fault Set can act as a rack-level high availability feature. This design can support multiple host failures and not result in data loss. FORWARD AND BACKWARD REBUILD The rebuild process is triggered when there is a change in SDS topology due to an unexpected loss of a storage device or the entire node. There are two cases: Forward Rebuild: Once a disk or an SDS node fails, the rebuild load is balanced across all the Protection Domain s disks and nodes. This is a many-to-many process. Backward Rebuild: If that same failing disk or node returns to operational state during the forwards rebuild, ScaleIO will trigger a smart and selective transition to backward rebuild (re-slivering). There is very little performance penalty during rebuild because the ScaleIO algorithm can optimize this process and allow flexibility for control: The administrator can set policies to use for rebuild I/O such as concurrency, bandwidth, and priority vs. application I/O inprocess. Unlike traditional solutions which tend to treat a node coming back alive as a blank node, ScaleIO s intelligent engine can evaluate if the data chunks are out of date to determine if it should continue the forward rebuild process or leverage data chunks from that node. Therefore, a shorter outage will result in a smaller performance penalty. The example below (Figure 3) demonstrates how rebuild works when a node fails. Within a few seconds after failure detection, the mirrored chunks outside that node will be copied to different nodes via a many-to-many operation (chunk A in SDS 2 to SDS 3 and chunk C in SDS 3 to SDS N). No two copies of the same chunk are allowed to reside on the same server. Note that while this operation is in progress, all the data is still accessible to applications. The local SDC can communicate to the mirror SDS to get the data to ensure no outage or delays. 7

FAILED NODE SDS 1 SDS 2 SDS 3 SDS N Figure 3. Rebuild process when SDS 1 failed ADDING AND REMOVING STORAGE MEDIA In these operations, the number of SDS nodes does not change. The new media to be added (or removed) in one of the SDS nodes could be HDD, SSD or PCIe Flash Cards. ScaleIO will treat this event as an SDS reconfiguration. That means ScaleIO will redistribute the data accordingly and seamlessly. This process does not require the administrator to manually redistribute data if a storage media is taken out. It s important to note that if the storage media is removed for planned maintenance, there must be enough spare capacity for the data to be evacuated. Otherwise, ScaleIO will not allow the media removal. MDM FAILURE It s critical not to confuse the purpose of the MDM. It does not keep the metadata or index of the actual data nor perform data operations. The MDM exists for several purposes such as system maintenance, management operations, monitoring the state of all components, and calculating system reconfiguration to optimize available resources. The MDM is not on the ScaleIO data path. The purpose of such design is to prevent bottleneck where multiple clients access a single point of failure in order to identify the data chunk location. The resource consumption of the MDM cluster is very minimal and should not impact overall cluster performance and bandwidth. It s required to have a ScaleIO system with three nodes (Primary MDM, Secondary MDM and Tie-Breaker) for redundancy management purposes. In a failure event, ScaleIO will provide automated failover as well as allow manual intervention, depending on the scenario: If the Primary MDM goes down, the system will failover to the Secondary MDM. When it becomes operational again, the administrator can add a new MDM IP address via the CLI. If the Secondary MDM goes down, there is no impact to management traffic as the Primary MDM is still handling management functions. If the Tie-Breaker goes down, the system will still operate as normal since the Tie-Breaker is only for HA and conflict resolution. ADDING AND REMOVING SDC Since the SDC is just a client or a consumer of storage, ScaleIO only exposes the volume for I/O operations in this node. It s possible to designate which SDCs can access the given volumes. The volume access could be controlled or shared to multiple SDCs, depending on the configuration of the operating system to support clustered mode. Multiple SDCs can gain access to it. To make it easy for troubleshooting, the ScaleIO CLI allows several options such as: Check if the volume is mapped to any of the SDC servers Determine if the SDC is installed Determine if the SDC is connected to an MDM Scan for new volumes 8

NON-DISRUPTIVE UPGRADE AVAILABLE TOOLS It s recommended to use existing tools for automated deployment and upgrade unless a manual method is absolutely required. Those tools can help the administrator ensure that the upgrade process is non-disruptive: ScaleIO Gateway Installation Manager (IM) validates environment compatibility before performing installation, upgrade, extend, and uninstall operations. ScaleIO Light Installation Agent (LIA) is installed on all ScaleIO nodes and creates trust with the Installation Manager to facilitate its operation. ScaleIO vsphere Plug-In installs ScaleIO on an ESX environment and also installs the ScaleIO Gateway, which can be used for Get Info and upgrade operations. UPGRADE PREPARATION Starting from ScaleIO version 1.30, all components can be upgraded during normal operation with no downtime. There are some considerations to ensure a smooth upgrade process: ESX servers should be in a cluster with proper HA settings configured. This will allow the administrator to reboot the ESX server after upgrading the SDC component without impacting the virtual machines availability. Although ScaleIO vsphere web plug-in is not required, it s highly recommended to ensure NDU. The IM and LIA must be upgraded to the latest version first before they can trigger the component upgrade. If they are not installed previously (due to manual installation), they should be installed. Before upgrade, the IM will check the system to avoid degrading capacity. Otherwise, it will not proceed. For some cases, the organization may require a manual upgrade. This could be due to internal policy or previous failure during an automated upgrade. The administrator must verify in the ScaleIO GUI that: o o o o o No rebuild or rebalance is running in the background No degraded capacity exists No SDS or SDC is disconnected No SDS device is in error state The MDM cluster is not in degraded mode, including the Tie-Breaker UPGRADE PROCESS At a high level, the upgrade process will happen step-by-step as described in Figure 4 below. ScaleIO vsphere Web Plug-in Installation Manager Light Installation Agent Secondary MDM Primary MDM Tie-Breaker SDS SDC If ESXi 5.5 Figure 4. Upgrade process The vsphere web plug-in is only needed for VMware ESXi environments. This plug-in will orchestrate the updating of IM and LIA components also. For Windows and Linux, just the IM is used to trigger the upgrade. There are two scenarios in an ESX environment: 1. If the plug-in was never installed previously, it can be triggered from outside the cluster. The administrator needs to make sure that the host has connectivity to vcenter and its credentials are available. It s also possible to setup the ScaleIO Gateway first and point to its URL because the vsphere Web Client needs to access a web server to retrieve the plug-in. 9

2. If the plug-in was an older version, it must be updated. The administrator will need to unregister it first. Removing the older plug-in will not impact the existing ScaleIO system in the cluster because the plug-in is just used to trigger installation and upgrade within the vsphere Web Client. After the old plug-in is removed, the administrator can register the new plug-in and upload the OVA template. When the IM and LIA are upgraded, it will not result in any operational impact because these components are not used for either management or data transmission. The entire process can be executed via the IM GUI under Maintenance Operation screen. The IM will communicate with all ScaleIO nodes via the LIA components to retrieve system topology and perform the upgrade as in Figure 4: 1. Secondary MDM: It will replace the binary and restart the MDM process. There will be no reboot of the host and this event happens only within a second. After it s upgraded, the system will switch its role to be a Primary MDM. 2. Primary MDM: This MDM is now operating as Secondary. It will follow the similar process for upgrading and there will be no impact to any management traffic. The system will automatically assign control back to this MDM after the upgrade. 3. Tie-Breaker: Since the Tie-Breaker does not sit in the management path, this can be upgraded after the Primary and Secondary MDM upgrade is completed. 4. SDS: Each of the SDS will be upgraded in a series for the rebuild process to finish before the next SDS is upgraded. ScaleIO will leverage the spare capacity for this rebuild. It s a good practice to set spare capacity equal to the largest node. Once the upgrade is completed, only the services are restarted and there is no reboot required. 5. SDC: ScaleIO will install the newest version in the host while the old version is still operating. SDC has backwards compatibility and the old SDC will work perfectly with the new SDSs and MDMs until such time that the system is rebooted and the SDC is replaced. It will not switch to the new version until the reboot as this step is necessary when any device driver is changed. The administrator needs to make sure that the volume is not mapped and locked before the reboot. It s a good practice to wait until the next maintenance window to perform this step. In an ESX environment where the SDC is installed inside the kernel, HA cluster capability will assist in migration of VMs to a different machine during the reboot. CONCLUSION ScaleIO is a hyper-converged infrastructure with enterprise-grade resilience that offers greater choices. Customers have the option of deploying ScaleIO using existing storage, new storage or a mix of both. This can be done using any physical environment or virtualization/cloud platform including Linux, OpenStack, VMware or Hyper-V. ScaleIO is a great solution for Service Providers or Service Provider-like Enterprises to deliver Infrastructure-as-a-Service (IaaS) to customers or internal users. Customers not only achieve lower Total Cost of Ownership (TCO) but also gain complete control over performance, capacity and data location. The ScaleIO architecture is designed so that there is no bottleneck or single point of failure. In other storage systems, having a virtualization layer to keep track of data (e.g. index or journal) usually results in massive failure and disruption when such layer becomes unavailable. A ScaleIO cluster has a many-to-many communication in a mesh network which enables large parallelism and I/O performance. Maintenance and lifecycle tasks without downtime is critical. The benefits customers get from non-disruptive operations are significant. There are many ways that ScaleIO provides such resiliency and flexibility: No downtime when changing, scaling or upgrading the storage infrastructure Efficient distributed self-healing process that overcomes media and node failures without requiring administrator involvement Great control of auto-rebalancing and rebuild process to prevent application hogging scenarios High tolerance of multiple simultaneous failures via Protection Domains and Fault Sets Easy physical separation of tenants in multi-tenancy deployments Intelligent data protection engine which makes rebuild decision on chunk by chunk basis Flexible linear scale by adding nodes on the fly True elasticity by supporting any commodity hardware and any storage media 10