Designing a Cloud Storage System End to End Cloud Storage When designing a cloud storage system, there is value in decoupling the system s archival capacity (its ability to persistently store large volumes of data) from the system s delivery capacity (its ability to deliver popular objects to a scalable number of users). The archival half need not support scalable performance, and likewise, the delivery half need not guarantee persistence. In practical terms, this translates into an end to end storage solution that includes a high capacity and highly resilient Object Store in the data center, augmented with caches throughout the network to take advantage of aggregated delivery bandwidth at edge sites. As depicted in the following diagram, the Object Store ingests data from some source (e.g., video prepared using a Content Management System), and delivers it to users via edge caches. The ingest interface is push based and likely includes one or more popular storage APIs (e.g., WebDAV, S3), while the delivery interface is pull based and corresponds to HTTP GET requests from the CDN. While the CDN should be source agnostic for example, content might originate from an upstream CDN or be transparently intercepted it is increasingly the case that content delivered over a CDN is sourced from a data center as part of a cloud based storage solution. This begs the question: is there anything we can learn by looking at storage from such an end to end perspective? There are three key lessons. First, it makes little sense to build an Object Store using traditional SAN or NAS technology. This is for two reasons. One has to do with providing the right level of abstraction. In this case, the CDN running at the network edge is perfectly capable of reading a large set of objects from the store, meaning there is no value in managing those objects using full file system semantics (i.e., NAS is a bad fit). Similarly, the storage system needs to understand complete objects and not just blocks (i.e., SAN is not a good fit). The second reason is related to cost. It is simply more cost effective to build a scalable Object Store from commodity hardware. This argument is well understood, and leverages the ability to achieve scalable performance and resiliency in software. 1 Verivue, Inc www.verivue.com
Second, a general purpose CDN that is able to deliver a wide range of content from software updates to video, from large files to small objects, from live (linear) streams to on demand video, from over the top to managed video should not be handicapped by an Object Store that isn t equally flexible. In particular, it is important that the ingest function be low latency, support redundant encoders, and be able to accommodate HTTP adaptive streaming. This makes it possible to deliver on demand and live video, the latter of which needs to be staged through an Object Store to support time shifting and ndvr. Third, it is not practical to achieve scalable delivery purely from a data center. Data centers typically provide massive internal bandwidth, making it possible to build scalable storage from commodity servers, but Internet facing bandwidth is generally limited and expensive. This is just another way to state the argument in favor of delivering content via a CDN scalable delivery is best achieved from the edge. The OneVantage Object Store adopts exactly this end to end design philosophy. It supports file ingest from content management systems via multiple ingest protocols and originates that content for live streaming and on demand delivery via a CDN. In essence, Object Store provides the root of the CDN hierarchy and serves cache misses from multiple CDN tiers downstream. It also offers a scale out architecture for redundancy and storage expansion by leveraging commodity hardware, and in doing so, supports a more cost effective solution than purpose built storage appliances. Scale out Design Object Store scales to billions of objects and petabytes of storage. It runs on clustered commodity servers incorporating the latest hardware technology. Both disks and nodes can be easily added to accommodate growing storage needs without service disruption. Different size disks can be mixed in a node and different types of nodes can be mixed in a cluster, making it possible to always incorporate the latest Commercial Off The Shelf (COTS) hardware. I/O bandwidth and transaction processing capacity also grow linearly, meaning that as the storage and nodes are added to the cluster, Object Store s ingest and delivery capacity increases proportionally. Transaction processing capacity can be adjusted independent of storage capacity by controlling the number of disks per node. For example, external disk shelves can be used to expand the direct attach storage per node. Object Store efficiently handles a high rate of small reads and writes because it has no centralized mechanisms (e.g., a replica tracking database). Instead, ingest and delivery requests can be directed at any node, independent of which nodes currently store a replica of the object. Object Store distributes content evenly across all of the disks within the cluster using the same consistent hashing algorithms employed by the OneVantage HyperCache. Availability and Durability Object Store provides multiple levels of redundancy, employing both mirroring and automatic failure recovery to achieve high levels of fault tolerance. 2 Verivue, Inc www.verivue.com
Disks in the Object Store are configured into redundancy groups a set of nodes that are as isolated as possible from other nodes where a given deployment must have at least three redundancy groups. When content is ingested into the Object Store, it is replicated across multiple redundancy groups; two by default, but the replication factor is configurable. In small clusters it is sometimes necessary for a redundancy group to contain a single node. In large clusters it is typical to configure multiple nodes into the same group when those nodes share a network switch, with redundancy groups isolated from each other by separate switches, power, or geography. RAID is not required. Disk and node failures are automatically detected and isolated. Content on the failed disk is automatically replicated to other disks to rapidly restore redundancy. The failure recovery time is a fraction of what it would be on a typical RAID based system, which requires a lengthy RAID rebuild process. In addition, a background auditor continually validates the integrity of the objects. If a corrupted object is detected (due to the decay of physical storage media, for example), the file is quarantined and the bad file is replaced with another replica. Disaster recovery surviving the failure of an entire site is handled in one of two ways. The first is to explicitly synchronize content between two Object Store clusters located at distinct sites. In this case, the two Object Stores operate autonomously, which means each maintains an independent set of redundancy groups. The second is to distribute a single Object Store cluster across multiple sites, in which case the Object Store s internal redundancy mechanisms cause objects to be replicated at a remote site. Optimized for Streaming Applications Object Store is uniquely tailored for streaming applications, particularly live and HTTP adaptive streaming. There are three considerations. First, live streaming applications require low latency ingest and delivery of small video fragments, typically at multiple bit rates. Object Store is optimized to support ingest and delivery of a large number of live channels per node with predictable, low latency even in the presence of failed disks or nodes. In contrast, RAID or erasure coding based storage systems are typically not optimized for small object writes. This is because in addition to the overhead of parity calculations, every time a portion of a block stripe is updated the entire stripe must be read back in order to compute the new parity. For example, on a RAID 5 array made from five disks, a particular stripe across those disks may have data on drives #1, #2, #3 and #4, and its parity block on drive #5. If a small object write changes just the block in the stripe of disk #2, disks #1, #3 and #4 must also be read to calculate the parity which is then written to disk #5. Also, the RAID controller must ensure that changes to data and its associated parity occur as a single transaction. This is often handled with a two phase commit, which results in additional performance overhead. Finally, writes must be serialized which can affect latency. Hence the ingest latency on RAID based storage systems is less predictable which can be problematic for live applications. 3 Verivue, Inc www.verivue.com
Second, live streaming applications require redundant encoders to avoid a single point of failure, which implies Object Store must be able to simultaneously ingest multiple copies of each object. Even in such cases, Object Store replicates the content as if it had received a single request. There is no storage penalty for using redundant encoders, and critically, no content is lost in the event of an encoder or Object Store disk/server failure. Third, some HTTP adaptive streaming protocols, such as Microsoft Smooth Streaming and Adobe HTTP Dynamic Streaming, require a translation from client fragment requests to server file offsets. Object Store can be complemented with origin heads to provide this functionality. These origin heads support a scale out model consistent with the Object Store architecture, where origin heads can run on dedicated servers or on the same servers as the base Object Store functionality. For example, the following figure shows an eight node cluster with six nodes dedicated to Object Store and two nodes running origin heads. Origin heads are not required for Apple HLS or native HTTP delivery. Integrated Management Object Store allows independent accounts to be generated, and within these accounts, users can be defined with specific rights and privileges. Accounts can be generated for in house users, as well as for third party content providers. This allows operators to offer storage as a service, where typically there will be an administrative account for the operator and a separate account for each content provider. Content providers then create users with desired privileges within that account. All interaction with Object Store is cryptographically protected via HTTPS, and conforms to a well defined API; content providers are not granted direct access to individual Object Store nodes. This API allows content provider s to ingest content into Object Store using established content management protocols (e.g., FTP, WebDAV) and cloud storage interfaces (e.g., Rackspace, S3), thereby simplifying the transition to/from 4 Verivue, Inc www.verivue.com
popular cloud storage services. The API also provides integrated control over how content is published via a CDN, including when content is published, where it can be accessed, and how long content is available. Finally, audit logs record all access through the management API or management user interface that sits on top of this API. Summary The OneVantage Object Store offers cloud based replicated HTTP storage, which can be used to persistently store media content that is subsequently delivered to users via a CDN. The solution leverages COTS hardware and state of the art clustering software to scale to billions of objects and petabytes of data. It offers high availability, including geo redundancy, supports multi tenant usage scenarios, and supports APIs that ease integration with both cloud storage systems and widely distributed CDNs. And perhaps most uniquely, Object Store is optimized to support a full range of content, including Live, ndvr, and On Demand streaming applications. About the Author Larry Peterson, Chief Scientist, Verivue As Chief Scientist, Larry Peterson provides technical leadership and expertise for research and development projects. He is also the Robert E. Kahn Professor of Computer Science at Princeton University, where he served as Chairman of the Computer Science Department from 2003 2009. He also serves as Director of the PlanetLab Consortium, a collection of academic, industrial, and government institutions cooperating to design and evaluate next generation network services and architectures. Larry has served as Editor in Chief of the ACM Transactions on Computer Systems, has been on the Editorial Board for the IEEE/ACM Transactions on Networking and the IEEE Journal on Select Areas in Communication and is the co author of the best selling networking textbook Computer Networks: A Systems Approach. He is a member of the National Academy of Engineering, a Fellow of the ACM and the IEEE, and the 2010 recipient of the IEEE Kobayahi Computer and Communication Award. He received his Ph.D. degree from Purdue University in 1985. For more information on Verivue s Object Store solution, please visit: www.verivue.com/object store. 5 Verivue, Inc www.verivue.com