OceanStor UDS Massive Storage System Technical White Paper Reliability



Similar documents
Huawei OceanStor UDS Massive Storage System Technical White Paper

HUAWEI OceanStor Load Balancing Technical White Paper. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD.

Huawei Business Continuity and Disaster Recovery Solution

Doc. Code. OceanStor VTL6900 Technical White Paper. Issue 1.1. Date Huawei Technologies Co., Ltd.

OceanStor 9000 InfoProtector Technical White Paper. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD.

Technical White Paper for the Oceanspace VTL6000

HUAWEI OceanStor Enterprise Storage System Success Cases

HUAWEI OceanStor Series Enterprise Storage System Disaster Recovery White Paper HUAWEI TECHNOLOGIES CO., LTD. Issue 01.

End-to-end Data integrity Protection in Storage Systems

IBM System Storage DS5020 Express

Key Messages of Enterprise Cluster NAS Huawei OceanStor N8500

EMC DATA DOMAIN OPERATING SYSTEM

EMC DATA DOMAIN OPERATING SYSTEM

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

How To Build A Clustered Storage Area Network (Csan) From Power All Networks

Oceanspace Series Storage Product Technical White Paper for the WORM

OceanStor 18500/18800/18800F Data Sheet

High Availability Design Patterns

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

Huawei N2000 NAS Storage System Technical White Paper

HIGHLY AVAILABLE MULTI-DATA CENTER WINDOWS SERVER SOLUTIONS USING EMC VPLEX METRO AND SANBOLIC MELIO 2010

EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage

Huawei Cloud Data Center Solution

WHITE PAPER. QUANTUM LATTUS: Next-Generation Object Storage for Big Data Archives

Scala Storage Scale-Out Clustered Storage White Paper

An Oracle White Paper January A Technical Overview of New Features for Automatic Storage Management in Oracle Database 12c

Availability Digest. MySQL Clusters Go Active/Active. December 2006

Big data management with IBM General Parallel File System

Achieving High Availability & Rapid Disaster Recovery in a Microsoft Exchange IP SAN April 2006

How To Create A Network Access Control (Nac) Solution

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

FLASH STORAGE SOLUTION

W H I T E P A P E R E n t e r p r i s e S t o r a g e S y s t e m : S e c u r e a n d T r u s t e d

How To Make A Backup System More Efficient

Accelerate Oracle Performance by Using SmartCache of T Series Unified Storage

OmniCube. SimpliVity OmniCube and Multi Federation ROBO Reference Architecture. White Paper. Authors: Bob Gropman

RAID Basics Training Guide

SDN, a New Definition of Next-Generation Campus Network

INCREASING EFFICIENCY WITH EASY AND COMPREHENSIVE STORAGE MANAGEMENT

Microsoft SQL Server 2008 R2 Enterprise Edition and Microsoft SharePoint Server 2010

technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels

Traditionally, a typical SAN topology uses fibre channel switch wiring while a typical NAS topology uses TCP/IP protocol over common networking

HUAWEI TECHNOLOGIES CO., LTD. Huawei FusionCloud Desktop Solution

EMC DATA DOMAIN DATA INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY

Using Multipathing Technology to Achieve a High Availability Solution

Software-Defined Networks Powered by VellOS

Huawei Smart Education Solution

HRG Assessment: Stratus everrun Enterprise

SAN Conceptual and Design Basics

EMC XtremSF: Delivering Next Generation Storage Performance for SQL Server

Huawei esight Brief Product Brochure

Application Brief: Using Titan for MS SQL

NSS Volume Data Recovery

Understanding EMC Avamar with EMC Data Protection Advisor

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

Protect Microsoft Exchange databases, achieve long-term data retention

Virtualization s Evolution

Mirror File System for Cloud Computing

High Availability with Windows Server 2012 Release Candidate

Network Attached Storage. Jinfeng Yang Oct/19/2015

Photonic Switching Applications in Data Centers & Cloud Computing Networks

Get Success in Passing Your Certification Exam at first attempt!

Business-centric Storage for small and medium-sized enterprises. How ETERNUS DX powered by Intel Xeon processors improves data management

Cloud Based Application Architectures using Smart Computing

VRRP Technology White Paper

GIVE YOUR ORACLE DBAs THE BACKUPS THEY REALLY WANT

STORAGE CENTER. The Industry s Only SAN with Automated Tiered Storage STORAGE CENTER

SQL Server 2012 Performance White Paper

Distribution One Server Requirements

Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication Software

Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform

Intel RAID Controllers

OceanStor 18500/18800/18800F Enterprise Storage System

ORACLE COHERENCE 12CR2

Configuring Celerra for Security Information Management with Network Intelligence s envision

Huawei Agile WAN Solution

NetStream (Integrated) Technology White Paper HUAWEI TECHNOLOGIES CO., LTD. Issue 01. Date

Business-centric Storage for small and medium-sized enterprises. How ETERNUS DX powered by Intel Xeon processors improves data management

System Infrastructure Non-Functional Requirements Related Item List

Advanced Knowledge and Understanding of Industrial Data Storage

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Feature Comparison. Windows Server 2008 R2 Hyper-V and Windows Server 2012 Hyper-V

Introduction to NetApp Infinite Volume

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

Nimble Storage for VMware View VDI

The Sierra Clustered Database Engine, the technology at the heart of

Quantum StorNext. Product Brief: Distributed LAN Client

Definition of RAID Levels

Enterprise Storage Solution for Hyper-V Private Cloud and VDI Deployments using Sanbolic s Melio Cloud Software Suite April 2011

Connectivity. Alliance Access 7.0. Database Recovery. Information Paper

Monitoring DoubleTake Availability

Doc. code. Huawei OceanStor T Series Technical White Paper. Issue V3.0. Date HUAWEI TECHNOLOGIES CO., LTD.

White Paper Storage for Big Data and Analytics Challenges

Building a Flash Fabric

Long term retention and archiving the challenges and the solution

The Data Placement Challenge

Transcription:

OceanStor UDS Massive Storage System Technical White Paper Reliability Issue 1.1 Date 2014-06 HUAWEI TECHNOLOGIES CO., LTD.

2013. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd. Trademarks and Permissions and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd. All other trademarks and trade names mentioned in this document are the property of their respective holders. Notice The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied. The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied. Huawei Technologies Co., Ltd. Address: Website: Huawei Industrial Base Bantian, Longgang Shenzhen 518129 People's Republic of China http://enterprise.huawei.com i

Contents Contents 1 Executive Summary... 1 2 Overview... 2 2.1 Intended Audience... 2... 3 3.1 Decentralized Architecture... 3 3.2 End-to-End Data Protection... 4 3.2.1 Smart Disk Level: Disk Lifecycle Management... 4 3.2.2 Node Level: Erasure Code Redundancy Algorithm... 5 3.2.3 Data Center Level: Cross-Regional Disaster Recovery and Switchover... 9 3.3 Continuous Data Detection and Repair... 10 3.3.1 Short-Term Fault Handling... 10 3.3.2 Data Integrity... 11 3.3.3 End-to-End Consistency... 12 4 Conclusion... 14 A Acronyms and Abbreviations... 15 ii

1 Executive Summary 1 Executive Summary From the day when storage came into being, data security has become a focus in the professional storage area. Different levels of storage products are created to meet different requirements on performance, data security, and costs. No matter how advanced a storage system is, its system software or universal hardware may be faulty. Some faults will have immediate impacts on system availability, for example, a faulty disk, CPU, or main board. Some faults may be found only when data is being read or written, for example, a faulty file system, data partition, or memory. Therefore, an excellent storage system should be able to minimize impacts on system availability caused by software and hardware faults using a well-designed architecture and take measures to ensure data integrity. In the past, the design of the disk storage was mainly targeted at online enterprise services. Therefore, the requirements for quick response have higher priorities than those for reliability and recoverability. As data has been growing explosively, data is closely connected to and is taking a greater role in people's daily life. Cloud service providers are more concerned about data reliability because it is the foundation of their services. In the cloud storage infrastructure, data reliability not system latency should be the first concern because the storage system latency is far too small to be a system bottleneck compared with network latency. However, if data integrity is neglected, data is at risk and user data loss will have great impacts on cloud service providers or even put them in legal risks. Requirements for disk backup and archiving are soaring in enterprise applications. Similarly, in backup and archiving, data reliability not system performance should be the first concern. However, if data integrity is neglected, data is at risk. When duplicates of primary data or archiving data are lost, data recovery may be impossible. Most current disk backup and archiving systems are based on inexpensive disks. Although those systems have inherited the design concept of mainstream storage arrays, they actually improve performance by sacrificing data reliability. The HUAWEI OceanStor UDS massive storage system (the UDS for short) is an innovative massive storage product based on public cloud, disk backup, and archiving. 1

2 Overview 2 Overview The UDS is a new breakthrough in storage system design. The UDS design has taken into consideration of requirements for public cloud, backup, and archiving. The UDS brings a new design concept of decentralized architecture which is based on algorithm addressing and refined reliability and scalability which are based on extreme total cost of ownership control (TCO) control. This white paper will focus on the reliability design of the UDS in the following three aspects: Reliability revolution brought by the decentralized architecture End-to-end fault prevention and control Continuous data detection and repair The UDS can perform end-to-end data test to ensure that users can get the data availability protection corresponding to the Service Level Agreement (SLA) policy they have set. 2.1 Intended Audience This white paper is intended for Huawei customers, technical consultants, partners, and anyone who is interested in UDS reliability. 2

3 UDS Reliability Framework and Technology Overview 3.1 Decentralized Architecture The UDS uses a decentralized architecture the sea of disks (SoD) framework. The storage layer and the access layer are located in two separate cluster rings, namely the access cluster and the storage cluster. The storage cluster is made of smart disks which are distributed in the distributed hash table (DHT) ring. Figure 3-1 shows the UDS architecture. Figure 3-1 UDS architecture Data flow Access layer SoD engine Storage layer Distributed hash table ring SmartDisk Storage cluster The distributed data storage ring consists of all the smart disks. When data needs to be stored in the UDS, the access node partitions the data and stores the data to different smart disks on the DHT ring based on the storage addresses calculated by the hash algorithm. Before data is stored onto smart disks, the access node automatically detects the status of the smart disks and then stores data onto different smart disks based on the status of each disk. In this way, load in the storage layer is automatically balanced, greatly reducing the load sharing on each smart disk and avoiding single points of failure. 3

If any smart disk is faulty (manmade or mechanical), data on this disk can be recovered to other smart disks automatically. The damaged disk can be removed from the UDS without affecting data availability and data partitioning. Unlike the traditional storage system, a RAID group in the UDS will not be degraded nor will other disks be damaged due to a faulty disk. With the decentralized architecture, immediate maintenance is not needed, reducing data loss risks caused by human factors. When one of the two following conditions is met, all the faulty smart disks can be maintained in batches: 1. When the failure rate or the capacity threshold reaches the upper limit, critical alarms are generated informing customers of replacing disks in batches or expanding the capacity. In this way, maintenance period is prolonged and the maintenance costs are reduced. Examples for critical alarms: The disk damage rate reaches 6% or the capacity exceeds 85%. 2. Customers are advised to create a routine maintenance table. During the maintenance period, all the damaged or slow smart disks are replaced in batches, reducing the possibility of system faults. After batch replacement is completed, data partitions are redistributed to normal nodes through the smart balancing algorithm, ensuring disk lifespan, lowering disk failure rate, reducing data loss, and improving system reliability. Access cluster 1. All the access nodes are connected by the load balancer to form a distributed access node cluster. Access nodes do not exchange data with each other and only synchronize status information of smart disks through the Gossip mechanism. All the access nodes do not control and save layout information of data and metadata because the DHT hash algorithm will automatically calculate the data storage position. 2. This innovation is a great breakthrough in storage structure. In the traditional storage structure, data layout is determined by the controller and then recorded. In the UDS structure, the access node determines the data flow direction through rules and do not record and determine the layout information. This change has greatly simplified the work of access nodes and resolves the biggest bottleneck for cluster scalability and reliability. The complicated data synchronization and lock mechanism is no longer needed between clusters, which has eliminated restrictions on the number of clustered systems. The data synchronization and lock mechanism limits the expansion of the clustered system and is a great risk for consistency and reliability of the clustered system. 3. With the UDS's decentralized architecture, any access node fault (man-made or mechanical) will have no impacts on system availability. With the use of node overload control, impacts on the whole system from access nodes will be reduced. 3.2 End-to-End Data Protection 3.2.1 Smart Disk Level: Disk Lifecycle Management The annual failure rates of enterprise-level disks and desktop disks are about 2% and 6% respectively due to reliability problems of mechanical disks. The UDS disk lifecycle management focuses on how to lower failure rates of disks and control impacts brought by faulty disks. The disk lifecycle management consists of disk detection, disk repair, disk failure control, and pre-reconstruction technologies. 4

1. Disk detection: The UDS detects different properties of disks, such as data throughput performance, motor startup time, and seek failure rate and compares those data with standard values. Then the UDS adjusts weights of each smart disk. Furthermore, the UDS predicts disk faults using the SMART statistics of disks and moves the data before a disk is faulty, reducing the workload for reconstruction. 2. Disk repair: When a disk is faulty, the UDS repairs the bad logical sector. If the fault still exists, the unrecoverable sector is isolated. The UDS then starts to reconstruct data in this sector. The reconstruction process usually takes a short time and is automatic, which prevents human errors. 3. Disk failure control and pre-reconstruction: If a disk cannot be repaired, the UDS will consider the disk invalid and start the pre-reconstruction structure to eliminate risks. Figure 3-2 shows the disk failure detection process. Figure 3-2 Disk failure detection process The disk lifecycle management greatly reduces disk damage rate, prolongs disk lifespan, and has the following advantages: 1. Automatic hardware control: In the self-management smart disk architecture, smart disks can monitor, control, recover themselves and perform data synchronization selectively, making it possible for refined hardware management and recovery. In this way, system robustness is improved and hardware faults are reduced. 2. A 0.7% minimum annual disk failure rate: The disk lifecycle management can effectively reduce disk failure rate. It repairs and isolates bad physical and logical sectors in a timely manner with real-time monitoring of disk status. Furthermore, the weight management can balance data and service load, prolong disk lifespan, and reduce disk failure rate to 0.7%. 3.2.2 Node Level: Erasure Code Redundancy Algorithm The disk capacity has been growing recently, increasing from several hundred GB to 2 TB and 3 TB. Disks of 6 TB and 10 TB will be applied in a large scale very soon. Large-capacity disks, while bringing high cost-efficiency, make disk data security (disk reconstruction in particular) a more serious problem. A traditional RAID group is made of member disks and hot spare disks. When a disk is faulty, a hot spare disk will replace the faulty disk and the storage system will reconstruct the data. In RAID 5, when no data is written onto a 2 TB disk, the reconstruction takes 20 hours and when data is written onto the disk, the reconstruction may take a week. When data is being reconstructed, system performance and reliability are greatly compromised. If one more disk is faulty in the RAID 5 group, data in the whole RAID 5 group will be lost because the RAID 5 only allows one disk to fail at a time. 5

To tackle those challenges, Huawei develops the Erasure Code (EC) redundancy algorithm, which is a superset of RAID. The UDS arranges smart disks in a unified storage pool called the DHT ring. Data is written onto the storage system based on the redundancy policy. The system writes data blocks and parity blocks as objects to random smart disks. Smart disks automatically manage all the internal objects and physical space corresponding to those objects. If any object and its space are faulty, the UDS will move the object or repair the bad sector to avoid reconstructing of the whole disk. Even in some special cases where the whole disk needs to be reconstructed, the smart disk will clone healthy data units to a new smart disk in advance and only verify and calculate the faulty data. In this way, reconstruction time is greatly decreased and data loss caused by multiple faulty disks is reduced. Working principle of EC The EC mainly contains the following three parts: 1. Creating storage pools: All the disks are constructed as a DHT ring based on a certain rule. The DHT ring technology introduced here is basic to the EC technology. 2. Partitioning objects: The UDS uses the EC technology to divide objects into data partitions of fixed length. Based on the EC algorithm, every M data partitions have N parity blocks. Then the UDS writes the data partitions to different smart disks based on the DHT hash algorithm and smart disks determine the location for saving the data. Figure 3-3 shows the partitioning principle of the EC algorithm. Figure 3-3 Partitioning principle of the EC algorithm In the UDS, customers can define the values for M and N of the EC algorithm. Different tenants and sub-tenants can configure different EC policies. The EC policy is part of the SLA in the multi-tenant model. A bigger N means higher data reliability but a lower data utilization rate. The biggest N can be 6. A maximum of 6 smart disks or storage nodes can be faulty simultaneously. The maximum allowed number of faulty disks (not exceeding 6) is related to the total number of disks. Compared with the redundancy protection provided by RAID, the UDS can offer customers with better protection. In the EC policy, M (the number of data blocks) = {12/15}, N (the number of parity blocks) = {3/6}. The EC policy, disk type, and number of nodes have direct impacts on data durability. 3. Balanced object saving: The UDS writes data to different storage nodes based on the DHT hash algorithm. If any hardware faults occur during the data write, the short-term fault handling mechanism is started (see section 3.3.1 "Short-Term Fault Handling"). Figure 3-4 shows the principle of the DHT ring hash algorithm. 6

Figure 3-4 Principle of the DHT ring hash algorithm Refined EC reconstruction algorithm To further improve data durability, Huawei created the refined EC reconstruction algorithm to recover the seemingly unrecoverable data. Smart disks manage partitioned objects and space, determine the range of lost data by scanning, and divide the EC group dynamically to reconstruct the lost data. In Figure 3-5, M is 6 and N is 2. Four disks have data damaged but the UDS only allows two disks to be recovered because N is 2. Therefore, some data will be lost. Figure 3-5 shows the data partitioning before the reconstruction. Figure 3-5 Data partitioning before the reconstruction To improve system reliability and availability and prevent data damage and loss, the UDS will start the refined EC reconstruction algorithm. The procedure of the EC algorithm is as follows: 7

1. Smart disks scan the physical areas of partitions and determine the range and size of damaged data partitions. 2. Based on the range and size of the damaged partitions, the EC group is divided temporarily to ensure that each EC has only N damaged partitions at most. Figure 3-6 shows the divided EC groups: Erasure Code 1-1 (0 KB to 300 KB), Erasure Code 1-2 (300 KB to 700 KB), and Erasure Code 1-3 (700 KB to 1024 KB). One EC group is temporarily divided into multiple EC groups (in this example, the number is 3) so that each divided EC group has not more than N damaged disks. 3. The UDS reconstructs damaged data in temporary EC groups. Figure 3-6 shows the data partitioning after the reconstruction. Figure 3-6 Data partitioning after the reconstruction The EC algorithm brings the following innovations: 1. Better data durability: When EC configuration is M:N=15:6, data reliability can reach 99.99999999999%, minimizing data loss risks. 2. Faster data reconstruction speed: The UDS distributes data objects to different smart disks for reconstruction. Multiple disks are involved in the reconstruction, increasing the capacity for parallel reconstruction and reducing the reconstruction time for 1 TB data to four hours. This has greatly reduced the reconstruction time and improved system reliability. 3. Reconstruction of damaged data only: Data reconstruction is based on objects and only damaged objects are reconstructed. Intact objects or empty regions are not processed. In this way, data reconstruction rate is greatly increased. 4. Global hot spare and batch disk replacement: The UDS uses global hot spare space not hot spare disks for data reconstruction. When the failure rate of smart disks reaches the threshold, the UDS will send alarms for disk replacement. In the UDS, immediate maintenance is not needed, reducing quantities of spare parts and upgrade cycle and saving inputs of maintenance personnel. 5. Elimination of restrictions on data recovery and enhanced EC recoverability: Compared with the RAID algorithm, the refined EC reconstruction algorithm can effectively reduce restrictions on the quantity of damaged data. Any damaged data block can be recovered fully, which enhances data durability and system reliability. 6. More flexible data redundancy and improved disk utilization rate: The available EC policy configuration for tenants is as follows: M (the number of data blocks) ={12/15}, N (the number of parity blocks) = {3/6}. The EC policy, which is defined based on user environment, has direct impacts on data durability and provides users with flexible 8

redundancy ratio configuration and more choices of data durability. The EC algorithm can flexibly control the disk utilization rate and reduce costs. For example, when M is 12 and N is 3, the disk utilization rate is nearly 80%. 7. Enhanced storage management efficiency and lowered costs: Users do not need to spend much time on storage planning because all the disks automatically form a unified storage pool, and users only need to insert new disks to expand the system capacity. The UDS will automatically distribute data to each disk. 8. Smart reconstruction of EC groups and lowered system load: The UDS intelligently determines the range and size of the damaged data. It temporarily reconstructs EC groups and segments the data to be recovered, which reduces system load and improves recovery efficiency. 3.2.3 Data Center Level: Cross-Regional Disaster Recovery and Switchover Service continuity is an inherent attribute of massive storage. Running of operators' cloud storage systems and backup and archiving systems of enterprises all rely on service continuity. Meanwhile, service continuity is more and more independent on the storage system's disaster recovery capacity to tackle all risks and challenges. A traditional disaster recovery system consists of three data centers in two regions (one active and two standby centers). The disaster recovery center only processes backup requests from the active data center, which greatly wastes storage resources. If a customer has multiple data centers that work in active-standby mode for one another or all data centers are related, the investment costs may be huge. The UDS uses the active-active data center mode and synchronizes data using the asynchronous data replication policies. Users can access the nearest UDS system to maximize resource utilization and reduce investments. The active-active mode will be implemented by the end of 2014 and the UDS currently uses the active-standby mode. The UDS supports multiple data centers and provides cross-regional data redundancy policies. When disasters or unpredicted events happen in the active data center, user data will not be lost and service data can be quickly recovered in standby data centers, reducing service interruption duration and ensuring service continuity. In addition to protecting data and supporting service continuity, multiple data centers supported by the UDS can maximize resource utilization and increase users' return on investment through resource management and smart load balancing. Figure 3-7 shows the networking for multiple data centers. Figure 3-7 Networking for multiple data centers 9

Not all the data needs to be stored in multiple data centers to improve durability. Some data in the UDS may need a high durability while some data only need an ordinary durability. Therefore, the SLA policy is used to differentiate durability of different data. Base on the multi-tenant multi-data center (MDC) technology, multiple data centers have the following advantages: 1. Unified resource scheduling: Multiple data centers use global virtualization technology to establish unified storage pools and consolidate resources, increasing resource utilization rate. Based on the scheduling policy, user can put and get objects on the nearest data center 2. SLA policy control: Based on the SLA contracting mechanism, the UDS can control the disaster recovery, quantity of disaster backup, and quality of service (QoS) to support customer services and optimize service and resource utilization. 3. Cross-regional disaster recovery: Data centers implement remote backup and data verification through the disaster recovery port HTTP/REST to optimize data transmission through the Internet and to improve disaster recovery efficiency. Data centers provide disaster recovery for one another to optimize resource utilization. 3.3 Continuous Data Detection and Repair 3.3.1 Short-Term Fault Handling Although many reliability technologies are available, computer hardware may occasionally be faulty. In a storage system, disk drive faults are the most common problems. Some partial and temporary faults may also occur. For example, one disk block cannot be read, or one bit of the interconnection or internal system bus is faulty. The UDS defines a special intermediate state: short-term fault state. If the UDS detects a short-term fault, it will start the fault diagnosis mechanism and try to rectify the fault. If the 10

recovery fails and the fault persists for more than X (X can be defined, for example 15 minutes), the fault goes to the permanent fault state and exits from the DHT ring. The UDS starts the data recovery policy to recover damaged data. Figure 3-8 shows the change of fault states. Figure 3-8 Change of fault states 3.3.2 Data Integrity The short-term fault handing technology has the following advantages: 1. Improved data and performance stability: Only the faults that cannot be detected and rectified by the short-term handling mechanism are considered permanent faults and require adjusting the DHT ring, reducing the workload for data recovery and migration. 2. Transparent to upper-layer services and enhanced service continuity: The short-term fault handling mechanism is invisible to upper-layer services, ensuring service continuity and durability. 3. Efficient system performance utilization and increased data recovery rate: After a fault is considered as a permanent fault, the UDS starts the data rebalancing process and distributes damaged data to multiple nodes through effective data distribution control. The UDS provides multiple-level data integrity protection measures and supports four-level data repair function: track, partition, object, and data center. When damaged data is detected, the UDS handles the faults as follows: 1. Try to recover data using disk track level repair. 2. If data cannot be recovered, the UDS tries to repair data through redundant data blocks with EC algorithm. 3. If data still cannot be recovered, the EC refined reconstruction technology repairs data of bad sectors by dividing the EC group and reading undamaged data in multiple damaged objects. 4. If all the three methods fail, the UDS tries to recover the damaged data through backup data centers. Figure 3-9 shows the processing flowchart for damaged data. 11

Figure 3-9 Processing flowchart for damaged data Damaged data is detected. Succeeded Track level repair Succeeded Failed Block verification repair Failed Succeeded EC refined repair Failed Succeeded Remote backup repair The repair is completed. The data integrity protection measures effectively ensure end-to-end data durability and have the following advantages: 1. Four-level integrity protection, ensuring data durability: The UDS provides end-to-end repair measures for damaged data with track, partition, object, and data center level integrity protection. 2. Different levels of recovery measures, reducing resource utilization caused by data recovery: Based on the degree of data damage, the UDS tries to recover the data in a range as small as possible to reduce resource utilization. 3.3.3 End-to-End Consistency The UDS supports end-to-end data consistency verification, which can be divided into the following four levels: 1. Application level: The UDS uses the verification of the TCP/IP and HTTP protocols to ensure that data received from and sent to users are correct. 2. Object level: At the access node, the MD5 algorithm verifies data objects. 3. Partition level: Each partition carries the verification fields calculated by the HMAC SHA256 algorithm. After smart disks receive data partitions, they check whether the partitions are correct. 4. Physical level: Each partition and its verification code are verified by the track cyclic redundancy check (CRC) after written onto the disk. Figure 3-10 shows different levels of UDS data consistency verification. 12

Figure 3-10 Different levels of UDS data consistency verification With four levels of data verification measures, the UDS ensures that no silence errors occur during the data writing process (from the time when data is sent by end users to the time when data is written onto the disk). In each level, once incorrect data verification is detected, the UDS can repair or resend the data quickly. The end-to-end data consistency verification greatly improves data security and has the following advantages: 1. Data will not be damaged in storage and transmission, ensuing data correctness. 2. Possible malicious data tampering from internal personnel of cloud storage service providers is prevented, increasing data security. 13

4 Conclusion 4 Conclusion No single technologies can ensure data availability and integrity independently. Data reliability can be ensured only with the combination of multiple mechanisms to tackle all risks. Different from traditional storage systems, the UDS is designed for reliability and scalability and has the following advantages: The decentralized architecture that is based on algorithm addressing provides an excellent solution to system reliability problems, enhancing data integrity and recoverability. With the smart disks, the UDS can control software and hardware faults and data integrity in a refined manner. The end-to-end data protection policies ensure that hardware and software faults at any levels, such as smart disks, storage nodes, cabinets, and data centers can be handled properly to protect data availability. With uninterrupted real-time data detection and repair, the UDS can automatically find out and repair potential faults before they become problems. Huawei UDS, with its innovative architecture and focuses on reliability, provides a highly reliable, extensible, and low-cost massive storage solution for customers. 14

A Acronyms and Abbreviations A Acronyms and Abbreviations Table A-1 Acronyms and abbreviations Acronyms and Abbreviations UDS SoD A-node UDSN Full Spelling UDS massive storage system Sea of Disks Access Node Universal Distributed Storage Node 15