Scality RING. Software Defined Storage for the 21st Century. Philippe Nicolas Director of Product Strategy

Transcription

1 Scality RING Software Defined Storage for the 21st Century Philippe Nicolas Director of Product Strategy

2 Table of Contents 1 Executive Summary The Need for Massively Scalable Storage Requirements for a New Generation of Exabyte- Scale Storage Solutions Limitations of Last- Generation Storage Technologies Next Generation Technology... 8 GENESIS... 8 SCALE- OUT AND SHARED- NOTHING... 8 CONSISTENCY MODEL... 8 TOPOLOGIES AND ROUTING... 9 DATA TRANSPORT... 9 NEW DATABASE MODEL... 9 NEW IT SOLUTIONS CHARACTERISTICS OBJECT STORAGE FOR MASSIVE SCALE Scality RING Object Storage DEFINITION ARCHITECTURE AND COMPONENTS CONNECTORS STORAGE NODES THE IMPORTANCE OF ROUTING BUILDING THE RING S TOPOLOGY IO OPERATIONS Scality RING Topologies and Deployment Models DATA PROTECTION REPLICATION ADVANCED RESILIENCE CONFIGURATION AUTO- TIERING MESA: A DISTRIBUTED DATABASE FOR METADATA SPARSE FILE TECHNOLOGY Scality RING Access Methods OBJECT ACCESS APIS FILE SYSTEM ACCESS STANDARD FILE SHARING PROTOCOLS HADOOP INTEGRATION OPENSTACK CINDER Scality RING Management Scality RING Performance ESG LAB VALIDATION HIGHLIGHTS Conclusion: A Storage Solution Operating at Exascale References

3 Table of Figures Figure 1: Scality delivers on- premise Google- like storage platform... 5 Figure 2: Scality RING, a Software- Defined Storage platform Figure 3: Scality RING scalability models: capacity only, performance only or both Figure 4: High level view between Scality RING and traditional storage Figure 5: From physical servers to storage nodes organized in a logical ring Figure 6: 20- byte Scality key format Figure 7: Fast and efficient node search with Chord Figure 8: Parallelism from connectors to storage nodes to iods Figure 9: Deployment models for the RING Figure 10: Two examples of RING deployment Figure 11: Topologies supported by Scality RING Figure 12: Replication with 3 replicas Figure 13: Scality ARC(14,4) model Figure 14: Scality ARC example vs. Replication and Dispersed approaches Figure 15: Scality MESA distributed metadata database Figure 16: Scality SOFS parallel design for maximum throughput Figure 17: Scality SOFS, from Volume to Namespace Figure 18: Logical view of Scality file storage services Figure 19: Scality Hadoop platform Figure 20: Scality Hadoop stack with SOFS/CDMI Figure 21: Scality Hadoop deployment model Figure 22: Scality Supervisor GUI for provisioning Figure 23: Scality Supervisor GUI for node management Figure 24: Object performance with Replication Figure 25: File performance with ARC Figure 26: Scality s storage vision Table of Tables Table 1: Table 2: Table 3: Table 4: Table 5: Table 6: Table 7: Table 8: NAS, SAN and Next Gen characteristics and limitations... 7 Illustration of Brewer s theory... 9 From legacy to next generation IT solutions Impact of system size on routing algorithm Different durability levels and configuration impacts Object access APIs File access methods and protocols Summary of ESG benchmark tests

4 1 Executive Summary The Need for Massively Scalable Storage The ubiquity of the Internet has radically transformed the IT landscape. Every Internet user queries at least one search engine and checks daily, often using multiple accounts. Users upload photos to online albums; connect with colleagues and friends over social networks, author product, pore over travel and restaurant reviews, post videos, and share personal experiences using multimedia rich content. Who is driving the worldwide explosion of data? Everyone from businesses, to consumers as well as devices and machines. We are all contributing to the massive explosion of user- generated information. Leading social and e- commerce web sites like Facebook, ebay, Yahoo or Netflix were designed to perform at Cloud scale. In contrast, many leading IT vendors did not design their solutions to handle this volume of data. Most traditional IT vendors, including leading providers of storage, have found it difficult to match the pace of growth and innovation demanded by the Internet and Cloud. Few have had the luxury to start over with a clean sheet of paper design. As has happened many times in the history of technology, disruptive innovation has come from newer, more agile players emerging companies not encumbered by a portfolio of legacy products and technologies. These companies have been successful in inventing new classes of products, providing solutions that are much larger in scope and that offer improved functionality based on a fundamental rethinking of core technology principles. Such breakthroughs invariably improve IT performance as well as economics, enabling a large, mainstream market to deploy a class of solutions that had previously only been affordable for a few, high- end customers. To solve data and storage challenges associated with new online services, leading Internet and e- commerce companies had to invent new platforms rather than rely on solutions from traditional storage and infrastructure vendors. Internet innovators viewed the last generation of IT solutions as being constrained in many ways. They offered limited storage capacity and scalability, could not provide multi- site data services and could not achieve Cloud scale capacity or performance in a cost- effective way. Today, for example, the most scalable commercial NAS solutions, offers, at best, a maximum of 20 petabytes of raw storage a capacity that is inadequate to handle large data services such as those required by an online photo sharing site needing to store at least four times this size. Traditional solutions are simply too complex, cumbersome and costly to support applications at Cloud scale. In the absence of readily available, cost- effective, massively scalable commercial storage products able to support hundreds of millions of users and hundreds of petabytes of data, innovative companies in need of such capacity were forced to design their own storage systems. Using internal engineering and R&D teams and the insights of leading university computer scientists, these companies developed their own solutions based on open source software. Companies like Google, Facebook and Amazon succeeded in providing game- changing platforms, embodying radically new approaches to their internal IT systems and operations. These fundamentally innovative platforms enabled these companies to lead the emerging Internet commerce and Cloud revolutions. 4

5 In 2009, Scality, today a leading software- defined storage provider, saw an opportunity to develop a commercial datacenter- grade product based on many of the insights, concepts and new computing models developed by the top Internet and Cloud companies. Figure 1 compares Scality s approach to that of two leading Internet companies in regard to data center, application and data ownership. Figure 1: Scality Delivers on- Premise Google- like Storage Platform The Scality RING object storage solution brings the technological innovations developed to support leading Internet Commerce and Cloud Service Providers to the enterprise. A few years ago, a wave of innovation and invention prompted the development of a number of similar solutions. These similarities are manifestations of tremendous market developments commonly referred to as the consumerization of IT. A decade ago, CIOs looked for inspiration, to the technologies and infrastructure deployed by large telco and bank data centers; now it is the practices and technologies developed and used by the big Internet players that drive IT innovation. Beginning in 2009, a team of engineers at Scality designed, from the ground- up, a unique storage software approach: A technology that is completely hardware agnostic, able to store exabytes of data with a very high durability and with multiple flexible data access methods, and can be deployed and managed at a very efficient cost. This is what the industry today calls Software- Defined Storage. This white paper describes the philosophy, architecture and design choices underlying Scality RING. Thanks to unique distributed algorithms supported by several patents, Scality RING is now recognized as one of the most powerful and advanced storage platforms on the market. 5

6 2 Requirements for a New Generation of Exabyte- Scale Storage Solutions Today s enterprise computing environment suffers from many of the same challenges that confronted Internet and Cloud leaders. Applications that produce petabyte and exabyte scale data, once thought exotic, are increasingly common. In areas as diverse as media and entertainment, oil and gas, biotechnology, financial services and high performance computing, the amount of data that must be managed is outstripping the capability of existing technologies. When attempts are made to use existing technology to solve petabyte scale problems, customers usually find the cost to be prohibitive. In developing new storage solutions for the petabyte and exabyte scale era of computing, several requirements must be met to ensure that tomorrow s massive data stores will enjoy the same or better levels of access, protection and usability as today s enterprise repositories. These requirements include: Storage at Exascale. A next generation storage system must be capable of supporting users and data at Cloud scale. This requires unlimited storage, scaling to billions or trillions of objects, files and other entities without degrading performance. High Availability. Data must be available to users continuously, without interruption, even when the storage system is performing rebuild or data recovery operations, or when configurations are undergoing maintenance or upgrades. Data Durability. Data must be stored exactly as it was intended and must, upon access, prove to be identical to the data that was stored. The system must protect against the possibility of data corruption for any and all data stored in the system. Automated self- healing must guarantee protection against both software and hardware failures, servers or disks, data center disasters, loss of power and any other failure model. High Performance Access. Data access operations must exhibit high throughput, high IOPS and low latency to ensure that the system can support mission- critical applications accessed by millions of users. Universal Data Access. Storage systems must be transparent to the applications that access them. Applications must be able to store and retrieve data using their existing file system protocols without requiring changes to underlying applications. This means that storage systems must be able to interoperate with traditional file protocols such as NFS and CIFS as well as with newer environments, such as HTTP/REST and Hadoop HDFS. Geo- aware File Storage. The system must be able to synchronize and replicate data efficiently across dispersed data centers and offer multiple access points via data propagation. Simple Storage Management. Storage systems should simplify and automate storage management tasks such as provisioning, replication, and backup and recovery. Auto- Tiering. Policy- based methods should ensure that data is written or moved to the storage tier that provides the right balance of storage cost and performance based on the lifecycle of each type of data stored on the system. Cost- Effective Scalability. Next generation storage should enable cost- effective scalability so that petabyte storage is affordable for mainstream IT buyers. TCO is also affected by automation capabilities and a hardware agnostic approach driven by a software- defined philosophy. At scale, intelligent power management also plays a key role in the global financial equation. Storage Powered by Software. Decoupling storage from client applications is essential for true scalability, availability and cost efficiency. The goal of this new approach is to allow full programmatic control in and by the software, and the building of a large multitenant storage pool from a farm of heterogeneous physical servers. 6

7 3 Limitations of Last- Generation Storage Technologies Last- generation storage solutions, such as NAS and SAN, fail to meet the requirements identified in the last section of this document when operating at anywhere near petabyte scale. Operational constraints at web scale make traditional approaches to data management, resiliency, durability and data protection fundamentally inadequate. As noted previously, leading Internet and Cloud companies such as Google and Amazon recognized early on that they needed to innovate in order to achieve the unprecedented scalability and high performance required for the support of a global user base. The table below outlines limitations inherent in last- generation approaches to storage that led Internet innovators to seek new solutions. Among these limitations, three are the most critical: 1) Scalability limits, 2) Inadequacy of RAID data protection, and 3) WAN and geographic distribution limitations. NAS SAN Next Gen Storage Nature File system Block device Unified Fault tolerance Low: must have all components available (LUN, partitions ) High: Only needs an ID to locate data (independent of physical topology) Logical entity Network file system Volume, LUN Object, Bucket Access Methods Byte level via file path name Block level (512Bytes, 4kB) via /dev Multiple with Block, File and Object Access Protocols NFS, CIFS, pnfs, FTP SCSI, iscsi, FC, FCoE, IB Object based, such as HTTP/REST, Amazon S3 and CDMI; Compatibility with traditional Block and File is a plus Data Protection Distance tolerance Advantages Limitations Use cases RAID double parity and spares are not aligned with large- scale requirements; Limited and costly georedundancy solution at Block or File level Medium (can be extended with WAN Acc./Opt.) but traditional file sharing protocols were not designed for the Internet Flexible (NAS clients embedded in OS); Well adopted: IOPS for Scale- Out NAS, Bandwidth Division between two file sharing protocols (NFS and CIFS); Maximum number of files; Maximum file size; File services can t be used over the Internet; Georedundancy Vertical IT/Industry; Generic file share; Office documents Limited (local: DC, building, a few kilometers with channel extenders) Well deployed and adopted; IOPS, Bandwidth, Low latency Rigid; Disaster Recovery; Distance; Number and size of volumes; RAID doesn t scale; Georedundancy Database, VM and applications with low latency requirements Cost $$$ $$$$ $ Table 1: NAS, SAN and Next Gen Characteristics and Limitations Replication, Erasure Coding, Fault tolerance across nodes High (designed for the Internet) Very flexible with programmatic API and legacy compatibility; IOPS, Bandwidth; Close to the application; Geo redundancy: Hardware agnostic New Data models Staas, Vertical IT/Industry, Unstructured content at scale 7

8 4 Next Generation Technology As the previous section has indicated, it is becoming increasingly difficult and cost prohibitive to support larger storage deployments using traditional NAS or SAN technology. Storage systems have already reached their physical limits using existing scale- up methods. To deal with these limitations, leading universities and major Internet firms have introduced a number of concepts essential to building a very scalable and agile IT infrastructure. Underlying this work in large- scale distributed computing has been the industry s actual experience of frequent failure of CPU, disk or network components as clusters increase in size. This section describes a number of theoretical principles informing the design of the next generation of scalable storage systems. These principles are of fundamental importance to Scality s core architecture, which is discussed in later sections of this paper. GENESIS To address these limitations and deliver a more scalable approach to enterprise storage, the IT industry has begun to consider new models employing scale- out and shared- nothing paradigms. In these models, data is distributed and managed together with its associated metadata as a single object constituting a new logical entity. SCALE- OUT AND SHARED- NOTHING The most demanding applications require both significant computing power and highly scalable storage capacity. Such applications must harness the resources of hundreds or thousands of servers, where computational power is scaled horizontally, across many separate compute nodes, rather than vertically, where the computational power of each node is increased with additional internal resources. A scale- out model is one in which many loosely coupled, independent components cooperate to deal with large amounts of data. Scale- out is a radically different IT concept, based on distributed computing. Instead of using clusters of large, proprietary systems, organizations use commodity (Commodity Off The Shelf or COTS ) servers using intelligent software to manage their integration. This approach delivers functionality that is superior to older proprietary systems and, ultimately, improves performance, availability and scalability, as well as reducing hardware costs far beyond what proprietary systems can achieve. The scale- out approach is related to a recently introduced computing architecture called shared- nothing, where each server brings its own resources to the cluster, and where the only shared resource is the network connecting the servers themselves. Clusters are built using self- contained peers, connected over a relatively high- speed network. Each of these nodes uses standards components: x86 CPUs, internal disk or SSD drives, an Ethernet network, IP protocol and a Linux OS. CONSISTENCY MODEL A consistency model is defined by Brewer s CAP theorem, which states that a distributed system can satisfy, at most, two of three properties: Consistency, Availability and Partition Tolerance. Consistency (C): All nodes see the same data at the same time. Availability (A): A guarantee that every request receives a response about whether it is successful or fails. Partition Tolerance (P): The system continues to operate despite arbitrary message loss or failure of part of the system. The industry uses CA, CP or AP to define the behavior and characteristics of different distributed systems, each designed to satisfy different operational requirements. Two other sub- categories exist: "Strong Consistency" (SC) or "Eventually Consistency" (EC) mode. SC potentially reduces performance, while EC provides better response time but a weaker consistency, depending on the configured environment and operating constraints. 8

9 Convention R: Number of replicas contacted to satisfy the Read operation W: Number of replicas that must send an acknowledgement (ACK) to satisfy the Write operation N: Number of storage nodes for storing the replicas of the requested data Definition Strong Consistency when R + W > N Eventually Consistency when R + W <= N Particular cases: - W >= 1: Writes are always possible - R >= 2: Minimal redundancy Table 2: Illustration of Brewer s Theory TOPOLOGIES AND ROUTING In very large computing environments with hundreds or even thousands of nodes, requests must be resolved in such a way as to avoid any bottleneck and satisfy SLAs. Different topologies exist to organize nodes. Increasingly, a ring is used, coupled with a consistent hashing mechanism to build and assign key ranges to nodes. In addition to the topology itself, an effective routing algorithm must be employed to produce the correct linear response times for write and read operations. Several research papers describe routing methods. These include Chord from MIT (adopted and extended by Scality), CAN, Pastry, Tapestry and Kademlia. DATA TRANSPORT A data exchange protocol is key to architecting a highly scalable storage system, especially for exchanging data within a cluster. HTTP is universally supported by all systems and browsers, and is also supported by many web servers and connected services. For this reason, HTTP is often adopted as a key data transport component. NEW DATABASE MODEL Given the challenges of dealing with very large- scale unstructured, semi- structured and structured data (often originating on the Web), new database models have been developed, providing greater flexibility than traditional SQL models. NoSQL is the most popular of these new database designs. NoSQL owes its reputation to the fact that it avoids any rigid data structures. This allows the database to store a very large number of records and to deliver a very high number of transactions, especially when the system has to deal with metadata. Some implementations succeed in maintaining the ACID constraint (atomicity, consistency, isolation, durability), thus ensuring that transactions are processed reliably, while other implementations relax this constraint, to satisfy different implementation goals. Scality s implementation, discussed later in this document, is fully ACID- compliant to guarantee reliable transaction processing. 9

10 NEW IT SOLUTIONS CHARACTERISTICS The following table compares new and legacy IT models. Legacy solutions Next Gen. solutions Infrastructure Virtualized IaaS and software defined Dedicated (1 tenant) Elastic, multitenant and shared Enterprise- grade Carrier- grade Proprietary hardware Commodity hardware Application architecture Centralized Distributed Stateful Stateless Synchronous Asynchronous Scale- up Scale- out Configuration management Manual Automated Layer specific Converged Operational owner IT DevOps Preferred management tools On- premise SaaS Access methods Local only (block or file based) Ubiquitous, shared and global with object (HTTP) and file Data protection RAID and limited geo- copy Data replication, erasure coding (geo distributed) Table 3: From Legacy to Next Generation IT Solutions OBJECT STORAGE FOR MASSIVE SCALE The need for massively scalable storage requires a new approach. Object storage was introduced to enable unlimited storage capacity, high performance, linear levels of service, and the capability to share content remotely and transparently across geographies. The leading Internet and Cloud providers developed their own object model based on their new requirements for global scalability. While the enterprise is beginning to adopt object storage, the first generation of object storage deployments were for public Cloud services. One of the best- known is Amazon S3. By June 2012, six years after service launch, Amazon demonstrated that an object storage system could readily store one trillion (10 12 ) objects. Object storage has evolved from OSD (Object- based Storage Devices) and CAS (Content Addressable Storage) to wider use cases and interfaces. It is now defined by these key characteristics: An object is a self- describing opaque entity that contains data and associated metadata. An object belongs to a single flat namespace. This simple namespace guarantees transparent scalability. An object is location- independent and does not utilize nested directories, file paths or other complex addressing schemes. Policies and user- defined metadata exist at the object- level or bucket- level. Object storage is, by nature, multitenant. Object storage provides vertical consistency, in that the model is simple and end- to- end, from the application to the object itself, without regard for volume size, number of objects, directory structure or file system layout. Object storage performance has predictable, linear response, and is not degraded by central authority control mechanisms or lookups. The object requires a specific HTTP API, often REST- based, to connect to the application and to deliver content. Object storage provides a self- service mode and offers provisioning and metering capabilities. 10

11 5 Scality RING Object Storage As the example of top tier Cloud and Internet innovators indicates, IT models and approaches need to change fundamentally to satisfy emerging requirements for orders of magnitude of additional computational power and storage capacity. Several years ago, Scality anticipated these requirements and understood that meeting them would require a paradigm shift, a fundamentally new design and a new class of IT storage solutions. Designed originally to deliver massively scalable consumer platforms, Scality developed a software- defined, large scale data storage platform able to store hundreds of petabytes of data, and billions of files while providing high levels of data durability without compromising performance. The RING, Scality s software- defined object storage solution, is designed for the demanding requirements of Cloud- scale and large corporate data center environments. The RING was designed to meet four primary requirements: Store an unlimited amount data; Protect data locally and globally to maximize its durability; Serve data to applications using multiple flexible access methods, each capable of delivering satisfying performance; Compute data, if needed, using the host s processing power within the storage cluster itself. DEFINITION Scality RING is an award- winning web- scale, software storage solution. Scality RING is based on a patented object storage technology with full scale- out file system support. It is built using a distributed, shared- nothing architecture with no single point- of- failure. Built- in tiering provides maximum flexibility for storage configuration and data movement, and ensures low latency and high performance. The RING is designed to support very large volumes of unstructured data and to sustain heavy traffic and heavy data workloads. Scality RING is cost efficient to operate and delivers comprehensive data protection. The RING operates seamlessly on any commodity server hardware, turning generic x86 servers into a reliable, high performance storage platform. These commodity servers provide the storage media, and Scality s software provides the storage provisioning and management, data protection, self- healing operations, high availability and automated- tiering. Scality RING s object- oriented architecture enables it to overcome traditional scalability limits, easily storing and managing petabytes and exabytes of data. Scality s architecture supports virtually any application, including high performance computing, infrastructure for top- tier Cloud Service providers, massive compliance archives, storage of digital media, enterprise- class and high volume storage of business files. 11

12 Figure 2: Scality RING, a Software- Defined Storage Platform ARCHITECTURE AND COMPONENTS Scality RING s distributed architecture achieves limitless scalability and high levels of availability and durability. Three principles drive the RING s distributed design: Divide and Conquer: The RING balances workloads across all access points and storage elements. By design, there is no single point of failure or hotspot. Divide and Serve: The RING is inherently decentralized: its elements are loosely coupled, but independent of one another, enabling end- to- end parallelism at all stages of the system. Divide and Store: Data, replicas and parities are distributed across distinct storage nodes, maximizing durability and increasing storage efficiency. Parallelism is fundamental to the design of the RING, and is responsible for the RING s ability to deliver a very high quality of service and performance. Figure 3: Scality RING Scalability Models: Capacity Only, Performance Only or Both 12

13 In order to be highly scalable, particularly across the key dimensions of performance and capacity, Scality defines two entities: an access layer and a storage layer. These two elements are represented by Connectors and Storage nodes: Connectors serve as translators that receive data requests from application servers and coordinate access to the RING. Storage nodes are virtual servers deployed on the RING. These servers are dedicated to write, read, store and data preservation operations. They interface and manage the system s interaction with physical storage devices. Figure 4 compares the architecture of the RING to the architecture of a traditional NAS/File Server. While the technologies are quite different, the role played by a Scality connector is analogous to the role played by a traditional storage controller. Figure 4: High Level View between Scality RING and Traditional Storage CONNECTORS Connectors, instantiated at the access layer, provide entry points to the RING. They translate data between an interface (either an open, widely deployed standard such as NFS, or a more proprietary interface for specialized business requirements) and the object model as implemented by the RING. In other words, their role is to expose an API or protocols to the outside world and map the corresponding data request to a fast, internal routing mechanism that locates the desired content. Connectors are stateless, thus easy to scale and fault tolerant. They provide fine- grained caching of RING topology and metadata, resulting in very fast response times. Scality supports a very broad range of connectors to data sources, including the very fast Scality HTTP/REST API, CDMI (SNIA s Cloud Data Management Interface), RS2 (Scality s Amazon S3- compatible API), Scality FUSE, NFS and CIFS, among others. Section 7 details the data connectivity options provided by the Scality RING. STORAGE NODES Scality introduced the concept of a storage node as a logical entity distinct from a physical server. A storage node is a Linux/Unix process. Nodes are independent of each other even when they operate on the same server. These storage node instances control their portion of the global key space of the Distributed Hash Table (DHT) and their primary role is to locate data and honor object requests. By default, Scality defines six storage nodes on each x86 server. The following schema illustrates the creation of the RING from storage nodes created on physical servers. 13

14 Figure 5: From Physical Servers to Storage Nodes Organized in a Logical Ring These nodes are each responsible for a separate segment of the global RING, and each node manages an even portion of the global key space. Figure 5, above, illustrates this, describing a RING of thirty- six nodes created from six storage servers, with each server managing six storage nodes. Each node is thus responsible for 1/36 of the key space. This means that when a server becomes unavailable, its six non- sequential portions of the key space are reallocated to the remaining servers without saturating any single server. The independence of nodes, servers and keys is what enables Scality s RING to guarantee five- nines (carrier- grade) availability (99.999%). This level of reliability makes the RING suitable for the most demanding environments. Scality s oldest customer, Telenet (Belgium s largest hosting provider), has enjoyed complete continuity of service and data availability, zero downtime and zero data loss or corruption since the system was deployed four years ago, even though capacity has been increased ten- fold and the system has undergone significant upgrades since it was first installed. THE IMPORTANCE OF ROUTING With large distributed systems, routing requests to the node where the data resides must be fast and efficient. Methods for accomplishing this range from quite simple to very complex. Centralized routing is easier to implement, offering some advantages, particularly in the detection and handling of conflicts and locking issues. However, centralized routing does not scale and becomes a significant performance bottleneck. It also creates significant risk because it can become a single point of failure. A large cluster should not rely on just one central authority. In contrast, a broadcast model partially eliminates this problem, but is practically unusable for large environments and can generate too many changes in the system s topology. A number of efficient routing methods have been proposed by the research community, (including MIT s Chord protocol). What are now called Overlay Routing Networks or second generation P2P networks, overcome the problems noted earlier. The system s ability to reach the node that stores the requested data quickly is, of course, key to efficient and fast performance and depends on intelligent routing. Intelligent routing ensures that modifications of the topology do not require that such changes be broadcast across the entire network to all nodes, but only to a few, relevant nodes. Advanced routing algorithms enable these second- generation P2P networks to work economically, even with very large clusters. The Chord protocol 1, 2, invented at MIT, is a second- generation P2P system and is used by Scality to map stored objects into a virtual key space. The unstructured first generation of P2P systems, such as Gnutella, required that requests be broadcast to different producers of storage. In contrast, the second- generation, 14

15 structured P2P, relies on the effective routing of a request to the node owning the requested data. This is significantly more efficient and allows data requests to be handled faster and with less system overhead. BUILDING THE RING S TOPOLOGY Scality has extended Chord beyond its original role in data distribution. Scality has added components necessary to achieve enterprise- level performance and reliability, enabling access time reduction, guaranteed object persistence and self- healing. These extensions enable Scality to manage infrastructure load increases easily and to automate the redistribution of object keys from a failing server to surviving servers and nodes. The RING can be installed on one rack, multiple racks on one site or across multiple sites, as is described later in Section 6 of this paper. Scality has developed a very efficient and complete provisioning algorithm that integrates all requirements and constraints such as the fault tolerance level, the class of service and the topology of the RING. If the replication mode is chosen, the class of service determines the number of data copies needed to ensure against multiple failures. Otherwise, it enables the choice of Scality ARC erasure coding technology. The result is a key space that is projected on the physical nodes to build the ring and the logic associated with it. Each Scality storage node has an automatically assigned key and acts as a simple key/value store. Scality s load balancing algorithms form a key space that is uniformly distributed over the cluster of nodes present. These algorithms prevent collisions between data replicas during normal operations as well as following a disk or server failure. The RING has the capability to integrate servers with different physical configurations, i.e., numbers of disks and disk capacities. This enables data center managers to support dynamic, changing configurations. Some servers exist in the system from the beginning of its first deployment and continue to operate alongside new servers having different capacities or performance characteristics. Scality has implemented a dispersal technology that guarantees that all object replicas are stored on different node processes and on different physical storage servers, and, potentially, in different datacenters. This guarantee is maintained even after a disk or server failure. Connectors generate 20- byte keys and assign these keys to different nodes. This establishes a RING with a fair, balanced policy and, as noted earlier, each node operates as a very simple and fast key/value store. Figure 6: 20- byte Scality Key Format The Chord algorithm always maps a key to a specific node at any given time even though the key itself does not contain specific location information. The internal logic of each node determines the appropriate location of object data on disk. Keys always contain either a hashed or randomly generated prefix, leading to a balanced distribution of data among all the nodes based on consistent hashing principles. Another essential concept underlying Scality RING is decentralization and the independence of nodes, since nodes are not controlled by a central intelligence. As a peer- to- peer architecture, any node can receive a request for a key. The longest path in terms of hops to discovery of the right node follows the ½ log2(n) rule, with n being the number of nodes in the RING, even after the ring topology changes. A key is assigned to a node that has the responsibility to store the objects whose keys are immediately inferior to its own key and superior to the keys of the preceding node on the ring. 15

16 In the original Chord protocol as documented by MIT, each node only needs the knowledge of its successor and predecessor nodes on the ring. Thus, updates to data do not require the synchronization of a complete list of tables on every node, and, using this schema, the RING can still avoid the risk of stale information. Using the intelligence of the Chord protocol, the RING provides complete coverage of the allotted key spaces. Figure 7: Fast and Efficient Node Search with Chord An initial request is internally routed within the Chord RING until the right node is located. Multiple hops can occur, but the two key pieces of information predecessors and successors reduce the latency needed by the protocol to locate the right node. When the desired node is found, the node receiving the original request returns the requested information to the connector. Figure 7 illustrates a simple lookup request from the connector. The connector requests key #33 and it knows only keys 10 and 30. The connector selects the first information and connects to that node, 10. Nodes 15, 25, 45 and 85 are then contacted. The protocol determines that node 25 connected to node 35 matches the request for key 33. Node 25 sends the information back to node 10, and then to the connector. Scality s implementation of the Chord protocol modified the original algorithm so that: Each node knows its immediate successor and follows the ½ log2(n) rule. As a consequence, most of the time, only one hop is needed to find the data. In the case of a topology change, the number of hops follow the ½ log2(n) rule, with n being the number of nodes in the ring. That leads to 4 hops maximum for a 100- node ring, and only 5 hops maximum for a node RING. When changes occur, such as the insertion of a new node or the failure of a node, a proxy mechanism maintains data availability while a load- rebalancing, rate limited job is started in the background to maintain the RING in an optimized topology. Maximum # of lookups with Scality s Chord implementation 100 nodes nodes nodes 7 Topology is cached at the connector level. Table 4: Impact of System Size on Routing Algorithm Globally, the routing table barely changes even after an insert. The infrastructure doesn t need to pause, stop or freeze the environment when storage servers are added. When a failure occurs, it is handled like a cache miss, and the lookup process feeds the cache line again after determining the new route to the data. 16

17 Scality allows seamless topology changes as nodes join and leave the infrastructure. The RING can continue to serve queries even while the system is changing. During normal operations, the mapping of a connector- key- node is direct, and performance is optimal. The core of the Scality solution resides in its unique, distributed architecture and intelligent self- healing software. There is no centralization of requests, no unique catalog of data, no hierarchy of systems, and therefore no notion of master or slave. The approach is completely symmetric. All nodes have the same role and run the same code. IO OPERATIONS At the heart of the Scality system, IO daemons, known as iods, are responsible for the persistence of data on physical media. Their role is to write the data passed to the node on the same machine, monitor physical storage and ensure durability. Each iod is local to one machine, managing local storage space, and communicating only with the storage node instances present on that same machine. There is no exchange between a node of a machine and the iod of another machine. Multiple iods run on the same physical machine, in a typical configuration of one iod per physical disk. The iods are the only link between the actual physical location of data and the logical layer of services represented by the nodes and connectors. 255 iods can exist on a server, enough to support a very large local load. Physical storage local to a server consists of regular partitions formatted with the standard ext3 or ext4 file system. Each iod controls its own file system and its data containers built on storage nodes. These containers are, in fact, elementary storage units of the RING that receive written objects directed to the iod from node requests initiated by any connector. These containers store three types of information: the index to locate the object on the media, object metadata, and the object s payload data itself. The unique connector- ring- iod architecture provides a completely abstracted hardware and network layer, with connectors at the top acting as an entry gate to the RING. The nodes of the RING act as storage servers, and iod daemons act as storage managers responsible for the physical I/O operations. The use of a local file system on each local disk provides Scality and the administrator the capability to use standard Linux commands to copy, migrate, repair and scrub disks if required. Containers used by the RING are large files grouping thousands of objects. As such, this design does not incur any performance impact and adds no overhead in terms of disk utilization. Figure 8: Parallelism from Connectors to Storage Nodes to Iods 17

18 6 Scality RING Topologies and Deployment Models Previous sections introduced and defined the key building blocks of Scality RING. This section describes the range of geodistributed RING configurations that provide different types of mission- critical data services. Figure 9: Deployment Models for the RING A RING can be designed and deployed in multiple ways, depending on business and user requirements. For example, disaster recovery and business continuity or RPO/RTO (recovery point/recovery time objectives) requirements often determine the optimal configuration used in specific cases. A RING can be deployed across sites in multiple ways: One site and one RING (a simple configuration) Two sites with one RING configured to tolerate the loss of one site Three sites with one RING configured to tolerate the loss of one site (or even two sites, as this is treated as a configuration parameter) Figure 10: Two Examples of RING Deployment In addition, multiple RINGs can be integrated via a file or object data copy mechanism operating between the RINGs. Any of these models must consider the following criteria: Number of sites: One, two, three or more Data protection: Replication or Erasure Coding (Scality ARC) Data workflow: Copy or Tiering between RINGs Copy mode between sites if any: Asynchronous or Synchronous 18

19 Site states: Active/Passive or Active/Active Type of Connectors: Object or File- oriented (e.g. Amazon S3 API or NFS) Granularity: Object or File Figure 11: Topologies Supported by Scality RING DATA PROTECTION A key objective and a core design criterion for Scality RING is to ensure that data is never lost. To stay flexible, efficient and cost- effective, Scality RING provides two different mechanisms for protecting data within the infrastructure on which it operates: Data Replication and Erasure Coding. In addition to these two methods, it is very common to integrate two RINGs using a data tiering policy engine that can move older (least accessed data) and/or large files from a fast primary storage pool to a secondary pool. This is enabled by Scality s Auto- Tiering facility, described below. REPLICATION Scality offers a built- in Replication mode within the RING to provide seamless data access even in the event of system or hardware failure. Via replication, the data is copied at the object level in native format without any transformation, providing a significant performance gain. Scality Replication creates multiple object copies, called replicas, across different storage nodes, with the guarantee that each replica resides on a different storage server by leveraging the dispersion factor expressed in its key (the first 24 bits of each key). Scality s copy engine uses a mathematical projection to select the right location for additional copies of data. The maximum number of replicas is six, although typical configurations maintain three or four copies. Additionally, a configuration option exists to enable replication across multiple RINGs, on the same site or on remote sites, with the flexible choice of unidirectional or multi- directional replication. 19

20 Figure 12: Replication with 3 Replicas Data replication is a very efficient mechanism, well optimized for small objects up to a few hundred kb. It is also the preferred method for protecting small clusters. ADVANCED RESILIENCE CONFIGURATION Erasure coding techniques are well known and have been widely used in telecommunication. Scality developed its own advanced erasure- coding technology, Advanced Resilience Configuration (ARC). Scality s ARC reconstruction mechanism is based on Reed- Solomon error correction theory. ARC is a standard Scality RING feature to protect data intelligently against disk, server, rack or site failures. This new configuration mode reduces the number of copies required to enable full reconstruction and avoids unnecessary information duplication. ARC provides a high level of durability, requiring only one additional copy of data (in addition to the original data). For this reason, ARC significantly reduces hardware cost and capital expense, as well as related operating expenses needed to safeguard key information assets. To illustrate how Scality s ARC works, consider n fragments (1 object split into n fragments) that need to be stored and protected. For this example, assume that n fragments are all 1 MB in size, with a goal to protect against k failures. Scality refers to this model as ARC(n,k). Scality ARC will store each of the n fragments of content individually, and will, in addition, compute and store k new fragments, which are checksums or parities. These checksums are mathematical combinations or equations of the original n fragments, computed in such a way that all the n original fragments can be reconstructed despite the loss of any k elements, whether the lost information consists of data or checksums. With Scality ARC, each of the k checksums would be 1 MB in size, providing protection against k disk or server losses, with just an extra k MB of storage required. 20

21 To illustrate the benefits and the mechanism behind ARC, consider the following example: Figure 13: Scality ARC(14,4) Model Using only 4 MB of additional storage (4 fragments of 1MB), Scality s RING protects 14 MB against a loss of four disks or servers. The storage overhead (additional storage) required to enable this protection is 4/14 = 29%, much lower than the 200% overhead required using a standard three- copy replication method. Even though this is a fraction of the overhead required by replication, it still offers much better protection than RAID 6. Note that erasure coding protects against server loss as well as disk loss. RAID, in contrast, is limited to protection against disk failure, rather than disk and server failure. The traditional implementation of erasure coding by storage vendors introduces a penalty on read. The system must read several pieces of information and then extract data in order to recover the original information. This problem of multiple reads is a significant limitation of dispersed storage, as it introduces a 200ms 300ms latency. To avoid this overhead and a large number of IOPS, Scality stores the original data fragments and the checksum fragments independently. By default, the ARC model implemented by Scality is referred to as ARC(14,4). Using this model, the required redundancy needed to safeguard data, as noted above, requires an additional 29% storage overhead only. These two model ratio parameters are, of course, completely modifiable based on an organization s individual throughput and protection requirements. Scality users have, for example, implemented configurations with an ARC RING encompassing two or three sites using the ratios (6,6), (10,14) or (4,2). Each one of these models provides a better hardware cost ratio compared to replication. Using an ARC(6,6) model means six data fragments and six parity fragments are evenly distributed on the RING across two sites. This configuration requires twice the hardware of the original system. The same level of replication, by contrast, would require two full copies on each site, or four times the hardware of the original system, which is twice the hardware required by ARC. In terms of cost- effectiveness, at the same level of protection and redundancy, replication with three copies requires extra storage space of +200%, while ARC(14,4) needs only 29%, as described above. The storage efficiency ratio is also important; the larger the ratio, the better. Replication using three copies has a 33% storage efficiency ratio, while ARC(14,4) has a storage efficiency ratio of 77%, meaning that 77% of the total storage space is consumed by the original data. 21

22 Type Durability Overhead Efficiency Comments ARC(14,4), 1 site 8 x 9 s % Single site ARC(8,4), 1 site > 13 x 9 s % Single site ARC(10,14), 2 sites > 13 x 9 s % Supports loss of 1 site Replication 3 copies, 3 sites 8 x 9 s 3 33% Supports loss of 2 sites Replication 4 copies, 2 sites 12 x 9 s 4 25% Supports loss of 1 site Table 5: Different Durability Levels and Configuration Impacts Scality recommends ARC for large- scale configurations or for large objects (> 1MB), and replication for small objects. Given ARC s proven performance, it is possible to obtain 1,000,000 x improved reliability over RAID 6, all with no additional disk overhead and no performance bottlenecks. In comparison to dispersed storage, Scality s approach avoids the penalty on reads and continues to offer the best response times for direct read operations. Figure 14: Scality ARC Example vs. Replication and Dispersed Approaches AUTO- TIERING In addition to data protection mechanisms, Scality provides its own storage tiering technology embedded within the RING, Auto- Tiering. This core RING feature is described in Scality s patent WO/2011/ Auto- Tiering ensures the right alignment between the value of the data to the organization and the organization s cost of data storage. It operates at the object level and is independent of the specific data structure used by the application. It can be applied, therefore, to many different IT environments. A policy engine performs autonomously, and automatically manages the migration and movement of objects within a single RING or between RINGs. The object key continues to reference the same original location and is completely transparent to the user and the application. When a unit of data is accessed, migrated data are cached back on the primary location to let applications access it. It works like an HSM system. Different RING configurations can be designed, including configurations to enable storage consolidation using an N- 1 model, where N is the number of primary RINGs. These N RINGs migrate data to a shared secondary RING and leverage its potentially massive capacity. Some configurations integrate replication and ARC. In such a topology, a fast primary replicated RING optimized for performance is linked to a secondary capacity- oriented RING using ARC. 22

23 MESA: A DISTRIBUTED DATABASE FOR METADATA In large- scale systems, metadata are essential, both in terms of their intrinsic content and in terms of their utility in enabling data access operations. For this reason, metadata must be rigorously protected and metadata access must be very fast in order to support stringent SLA levels. Scality designed and developed its own internal distributed database, MESA, to store metadata in a scalable manner. MESA uses a NewSQL model where tables are distributed across storage nodes. In order to ensure high durability, Scality maintains five copies of metadata. Figure 15: Scality MESA Distributed Metadata Database MESA provides multiple indexes per table. It is 100% ACID compliant and 100% elastic. The MESA engine provides excellent, almost linear performance, and operates in an automatic fault tolerance mode. MESA is a core internal system feature and is not customer accessible. Scality uses MESA for metadata management for its Scale Out File System (SOFS), its sparse file technology, and for Scality RS2 (REST Storage Service) buckets. RS2 (REST Storage Service) is Scality s Amazon S3- compatible API. SPARSE FILE TECHNOLOGY To store very large files in demanding environments and satisfy strict SLAs, Scality developed and implemented its own sparse file engine. Files are striped with a fixed strip unit size aligned with environment needs, for example, 128 kb or 16 MB. Each file uses an entry in a MESA table and every stripe unit has a key associated with it. This technology boosts data distribution and delivers fast I/O operations for all services based on the Scality Scale Out File System 23

24 7 Scality RING Access Methods One of Scality s primary design goals is to enable application developers to adopt the RING without needing to make significant changes to their applications. The number of applications that can access a storage system is a major factor in the success of the platform. While object storage provides its own very fast and efficient access methods, Scality has long recognized that users also need to consume data using other well established and proven methods. The more varied the access methods a solution provides, the stronger its chances of market adoption and customer success. For these reasons, Scality offers both object level access and file level access, and is committed to supporting a broad range of information access standards and APIs. This section outlines the most important access protocols supported by Scality RING. OBJECT ACCESS APIS Scality provides software APIs, or connectors, that handle the communication between the software application and the storage nodes. They are based primarily on the HTTP/REST protocol and are tailored for specific applications. Scality also supports the open source Droplet library, available on Github, which can be used to develop customized interfaces. Table 6 lists Scality s standard object- based access connectors. Access Method RS2 RS2 Light Sproxyd CDMI Description REST Storage Service Amazon S3 compatible API S3 data model with service/bucket/object relation Authentication, Metering, Usage Reporting A reduced set of RS2 functions. No Authentication, no Metering and no Usage Reporting Scality HTTP/REST- based connector It provides a fast and simple object exchange protocol with basic GET, PUT and DELETE commands, and extensions introduced by Scality The connector provides asynchronous writes and is able to be a secondary connector for pipeline operations. Cloud Data Management Interface, standard from SNIA (ISO/IEC 17826:2012) Key component in the Scality Open Cloud Access strategy CDMI by Key or by Path Table 6: Object Access APIs FILE SYSTEM ACCESS In addition to multiple object access methods, Scality has developed a high performance file storage solution, Scality s Scale Out File System (SOFS). SOFS provides high performance parallel network file access to data stored on the RING. SOFS leverages the parallel and distributed design of the RING to accelerate data IO operations, and can be tuned to deliver exceptionally high IOPS or throughput. 24

25 Figure 16: Scality SOFS Parallel Design for Maximum Throughput Layered on SOFS, Scality provides support for file- sharing protocols such as native NFS, CIFS, and FTP. Scality has also developed its own CDMI server that leverages the SOFS layer to enrich Scality file services. As described earlier, SOFS has two components, a back end, running on storage servers and a front end operating on access nodes. On the back end, SOFS serves as a file virtualization abstraction layer where all file system entities (directories and files) have a direct representation as objects on the RING. On the front end, access nodes run sfused, the Scality implementation of FUSE (File System in User Space). FUSE locally emulates a POSIX file system connected to back- end storage servers. In addition, SOFS implements two fundamental core services developed by Scality: the MESA metadata database described earlier, and a sparse file technology to support very large files. Figure 17: Scality SOFS, from Volume to Namespace SOFS introduces the concept of volume and namespace. A volume contains a file system, and a namespace is essentially a mount point and associated files. Each file system offers 64- bit inodes, or addressable files. Within one namespace, up to 2 32 volumes can be configured and the maximum number of namespaces per RING is 2 24, which is millions of times more than what is offered by competing NAS and object storage solutions. 25

26 Figure 18: Logical View of Scality File Storage Services STANDARD FILE SHARING PROTOCOLS With the release of RING 4.2, Scality provides native support for industry standard file sharing protocols. SOFS, Scality s Scale Out File System, is a common file abstraction layer from which connectors to different protocols are built. In addition to FUSE, RING 4.2 adds NFS, CIFS and FTP to the file storage services provided by the RING platform. The following table identifies the file sharing protocols operating on Scality RING. Protocol Description Version information FUSE SOFS front- end with parallel Since RING 4.0 network file access NFS Network File System with Kerberos NFS version 3, since RING 4.2 integration CIFS Common Internet File System Based on Samba 3.5 minimum, since RING 4.2 FTP File Transfer Protocol Many commercial and open source servers, since RING 4.2 CDMI Cloud Data Management Interface server implementation Version 1.0.2, since RING 4.1 CDMI by Key or by Path Table 7: File Access Methods and Protocols HADOOP INTEGRATION Scality s storage implementation for Hadoop delivers benefits long desired by the Hadoop community. Scality s support for Hadoop enables a more cost effective, easier- to- use, more resilient and higher performing Hadoop infrastructure. The Hadoop CDMI connector eliminates the single- point- of- failure in Hadoop s architecture by replacing Hadoop s NameNode server with Scality s own metadata architecture. It enables computation on the storage node itself, significantly reducing the need for data movement by enabling in- place processing and data location sharing with Hadoop s Job Tracker. 26

27 Figure 19: Scality Hadoop Platform Scality provides Hadoop with full support for ARC, Scality s erasure coding data protection technology. As noted earlier, Scality ARC delivers high levels of data protection without excessive hardware overhead. The Hadoop CDMI connector eliminates the need to load files through HDFS by leveraging Scality s Open Cloud Access (OCA) solution. This renders ETL (extract, transform and load) solutions unnecessary, since data processing is executed where the data reside. Users can read and write files through a standard file system and at the same time process the content with Hadoop using the processing power of the storage servers where the Hadoop data already resides. Because no changes are required at the application level to gain the full benefits of the Scality RING Hadoop integration, moving an application and a data set to the Scality platform causes no disturbances to the user experience. Figure 20: Scality Hadoop Stack with SOFS/CDMI 27

28 As noted above, Scality s design goal is to leverage existing server capabilities to run Hadoop processing tasks on the data where the data already resides. This solution represents the first phase of Scality s vision of delivering a converged platform for data storage and processing, and eliminates the need to move data between storage and processing systems. Figure 21: Scality Hadoop Deployment Model OPENSTACK CINDER Scality provides a block storage driver for OpenStack Compute (open source provisioning and management of a large network of virtual machines) integrated with the Cinder API. Scality RING for the OpenStack connector is built on Scality s unique distributed sparse file technology embedded in Scality SOFS. An OpenStack Cinder volume is essentially a sparse file on SOFS, having unlimited size and elasticity. This ensures easy management and seamless scalability. It enables advanced virtualization features such as live migration of virtual machines and instant failover in case of compute node hardware failure. The RING s distributed architecture enables extraordinary concurrency and ensures high performance, both in terms of high IOPS and throughput. All of the RING s features, including ARC s advanced data protection, are available to Cinder volumes. Scality supports OpenStack s (Grizzly) release. 28

29 8 Scality RING Management To configure and manage a RING, Scality provides a command line interface (CLI) and a web GUI. The CLI, RingSH, allows full control of the cluster and can be integrated into a broader management framework with user- created scripts. RingSH includes a very comprehensive command set, enabling end- to- end management of Scality RING from initial deployment and start- up to a fully operational production system. Scality s web GUI administrative tool, the Supervisor, provides simple and intuitive system monitoring and management of a RING and a RING s individual components including physical disks, storage servers and nodes. The Supervisor monitors and provides system health metrics, capacity statistics and system management alerts. It allows administrators to manage the details of storage nodes by key or by server, and enables the easy addition or removal of servers as required. The Supervisor supports a range of common IT operations, including hardware refresh, and maintenance tasks, including the replacement of failing servers. During these operations, the RING continues to serve requests without suffering any adverse impacts and redistributes data among existing online servers and resources automatically, as required. This behavior reinforces the elastic and high availability characteristics of Scality RING. Figure 22: Scality Supervisor GUI for Provisioning 29

30 Figure 23: Scality Supervisor GUI for Node Management Scality provides 24x7 support and maintenance, and delivers a variety of professional services such as expert on- site or dedicated care service. 30

31 9 Scality RING Performance In 2012, Scality was the very first object storage company to publically release independently validated performance metrics for such technology. These lab results were published by ESG (Enterprise Strategy Group). In 2013, ESG produced a second benchmark report focused on the performance of Scality RING running on the MIS (Modular Infinite Storage servers) of Scality s OEM partner SGI. ESG measured the performance of Scality RING and compared replication to ARC 21 for both object and file service. Figure 24: Object Performance with Replication Figure 25: File Performance with ARC 31

32 The following table summarizes some key results of the ESG testing of both ARC and replication methods of data protection. Access Methods Protection Results Object 4 KB GET Replication 45,420 objects/second Object 1 MB GET Replication 7.75 GB/sec Object 10 MB GET ARC 7 GB/s (960 clients) File 1 GB Read Replication 2.6 GB/s (1 injector) GB/s (16 injectors) File 1 GB Read ARC 10.2 GB/s (24 clients) Data Reconstruction Disk Size Time to Rebuild 750 GB 13 minutes 1 TB 16 minutes Table 8: Summary of ESG Benchmark Tests ESG LAB VALIDATION HIGHLIGHTS Using a six- node Scality RING configured in replication mode with three SGI MIS servers, ESG Lab measured object- based performance using the REST interface to GET and PUT small (4 KB) and large (1 MB) objects. Peak performance was achieved with 2,220 simulated clients while latency remained manageably low in all the test scenarios; 4 KB GETs reached 45,420 objects/second, 4 KB PUTs reached 41,891 objects/second, 1 MB GETs reached 7.75 GB/sec, and 1 MB PUTs reached 6.68 GB/s. File- based read and write throughput performance was measured with Scality s SOFS architecture. ESG Lab witnessed linear performance scalability as the number of injector nodes driving the workload increased from one to six in a replicated Scality RING. Read throughput started at 2.6 GB/s with one injector and scaled up to 16.9 GB/s with six injectors. Write throughput started at 0.8 GB/s and scaled up to 3.7 GB/s. Also, as the workload increased, latency continued to improve, creating a highly efficient SGI and Scality joint solution that meets the performance and latency requirements of enterprise- class organizations. ESG Lab verified the ARC mode data protection method by comparing identical object- and file- based workloads to the measured replication mode results. In all cases, ARC mode performed well, and in some cases, outperformed replication mode. With a 10 MB object size, it took just 960 simulated clients to reach 7 GB/s, while the file- based throughput simulation performed 30% faster than the replication mode result. ESG Lab witnessed impressive data recoverability rates to validate the solution s high- availability features. A 750 GB and a 1 TB data set were continuously accessible while data was quickly reconstructed. The 750 GB data set was recovered in just 13 minutes, while the 1 TB data set took only 16 minutes. 32

33 10 Conclusion: A Storage Solution Operating at Exascale Scality provides an advanced, robust and massively scalable way of deploying and managing storage based on distributed models, insights and operation guidelines similar to those developed to support the world s largest and most successful Cloud and e- commerce companies. Scality RING delivers a proven storage solution without any inherent limits. It enables companies to address their requirements for petascale or exascale storage with an easy to manage, cost- effective, high performing, and fully scalable software storage solution. Now in its fourth generation, Scality RING offers data center class functionality while overcoming the high cost, capacity and performance limitations of traditional storage solutions. Whatever the application or usage model, Scality RING can deliver an interface customized to an organization s specific storage requirements. Scality can store and exchange data in any combination of block, file or native object modes, and can provide users with seamless access to information, whether user data is stored in HTTP/web- based systems or in industry- standard file systems, such as NFS. Scality RING storage offers five unique benefits: 1. Exascale: Unlimited capacity and high performance. 2. Multi- Geo: Flexible topologies with complete disaster recovery, business continuity and multiple points of presence. 3. Data Protection: Data protection based on replication, erasure coding and intelligent tiering across RINGs. 4. Universal Data Access: Comprehensive support for the broadest range of object APIs and file system interfaces. 5. Ecosystem: A software- defined storage solution that is hardware agnostic while also providing tested and proven OEM hardware reference platforms. Scality s ecosystem incorporates a growing range of partners who use Scality RING to deliver innovative storage solutions for a variety of markets and applications. Scality s vision for the RING is guided by the following architectural goals: Comprehensive Unified Storage with Block, File and Object interfaces, and new application storage APIs such as HDFS. Ubiquitous Access means local and remote data points without any need for the user to know the location of the data. High Scalability for both Performance and Capacity without inherent limits. Advanced Data Services with real- time policy and quality of service, content indexing and other service features. Convergent IT Platform with capabilities to run multiple applications within the cluster. Multitenancy, Metering and Directory Integration with a policy- based QOS engine capable of applying encryption, quota, IOPS and bandwidth capping in real time per tenant. 33

34 Figure 26: Scality s Storage Vision 34

35 11 References 1. Chord: A Scalable Peer- to- Peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan - MIT Laboratory for Computer Science chord.pdf 2. Multipurpose storage system based upon a distributed hashing mechanism with transactional support and failover capability US and WIPO patent WO/2010/ Vianney Rancurel, Oliver Lemarie, Giorgio Regni, Alain Tauch, Benoit Artuso, Jonathan Gramain 3. Probabilistic offload engine for distributed hierarchical object storage US and WIPO patent WO/2011/ Giorgio Regni, Jonathan Gramain, Vianney Rancurel, Benoit Artuso, Bertrand Demiddelaer, Alain Tauch 4. On Routing in Distributed Hash Tables Fabius Klemm, Sarunas Girdzijauskas, Jean- Yves Le Boudec, Karl Aberer - School of Computer and Communication Sciences - Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland 5. Improving the Throughput of Distributed Hash Tables Using Congestion- Aware Routing Fabius Klemm, Jean- Yves Le Boudec, Dejan Kosti c, Karl Aberer - School of Computer and Communication Sciences - Ecole Polytechnique F ed erale de Lausanne (EPFL), Lausanne, Switzerland us/um/redmond/events/iptps2007/papers/klemmleboudeckosticaberer.pdf 6. An Architecture for Peer- to- Peer Information Retrieval Karl Aberer, Fabius Klemm, Martin Rajman, Jie Wu - School of Computer and Communication Sciences - EPFL, Lausanne, Switzerland IR_Architecture.pdf 7. A High- Performance Distributed Hash Table for Peer- to- Peer Information Retrieval Thèse #4012 (2008) Fabius Klemm EPFL 8. Dynamo: Amazon s Highly Available Key- value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels - Amazon.com dynamo- sosp2007.pdf 9. Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber - Google osdi06.pdf 10. The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung - Google sosp2003.pdf 35

36 11. Computing in the RAIN: A Reliable Array of Independent Nodes Vasken Bohossian, Charles C. Fan, Paul S. LeMahieu, Marc D. Riedel, Lihao Xu & Jehoshua Bruck California Institute of Technology Time, Clocks, and the Ordering of Events in Distributed Systems L. Lamport Comm. ACM 21, 1978, pp us/um/people/lamport/pubs/time- clocks.pdf 13. Brewer s Conjecture and the Feasibility of Consistent, Available, Partition- Tolerant Web Services Seth Gilbert & Nancy Lynch SigAct.pdf 14. Providing Authentication and Integrity in Outsourced Databases using Merkle Hash Tree's Einar Mykletun, Maithili Narasimha & Gene Tsudik - University of California Irvine Secrecy, authentication, and public key systems R. Merkle, Ph.D. dissertation, Dept. of Electrical Engineering - Stanford University, Fractal Merkle Tree Representation and Traversal M. Jakobsson, T. Leighton, S. Micali and M. Szydlo - RSA- CT aSPOjC0QWI3siJDg&usg=AFQjCNESUo- gxfi7gidsd0h5zfo60uitmw 17. Merkle Tree Traversal in Log Space and Time M. Szydlo - Eurocrypt '04 loglog.pdf 18. An Analysis of Latent Sector Errors in Disk Drives Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy and Jiri Schindler sigmetrics07.html 19. Failure Trends in a Large Disk Drive Population Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr e Barroso Google Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder and Garth A. Gibson SGI Modular Infinite Storage Server with Scality RING Organic Storage Enterprise Strategy Group, Lab Validation Report - October 2013 Available from the Scality web site at performance- validation- on- sgi/ 2014 Scality. All rights reserved. Specifications are subject to change without notice. Scality, the Scality logo, Scality RING and RING Organic Storage, are trademarks or registered trademarks of Scality in the United States and/or other countries. 36