Moving Virtual Storage to the Cloud Guidelines for Hosters Who Want to Enhance Their Cloud Offerings with Cloud Storage
Table of Contents Overview... 1 Understanding the Storage Problem... 1 What Makes an Ideal Cloud Storage Solution for Hosters?... 2 Scalability... 2 Multiple Nodes... 2 Block-Based ( Not File-Based) Roots for Virtual Environments... 3 Object Handling... 3 Cloning, Snapshotting and Deduplication... 4 Sparse Objects and Thin Provisioning... 4 Object Resizing and Support for Legacy File System Roots... 4 Redundancy... 5 Cluster Simplicity... 5 Storage Expansion... 5 Summary of Requirements... 5 Must Have for Initial Deployment... 5 Nice to Have for Future Deployments... 6 Conclusion... 6 ii
Overview In traditional hosting models, storage is usually directly attached to the node serving up the virtual environments (VEs) 1. The storage usually comes in the form of SATA devices with 1.5-3Gb/s interfaces and an approximate sustained bandwidth of around 100MB/s. The great advantage of local storage is that it's fast (100MB/s) and it's scalable (as you add nodes, they come with more local storage). But the local nature of the traditional storage model is also a disadvantage because if you want to migrate your virtual environments, you have to take a physical copy of their associated storage, as well. This requirement makes the locally attached storage model inappropriate for dynamic, highly fluid environments, such as those found in the cloud. The ideal virtual storage solution for hosters offering cloud services is one that provides the speed and scalability advantages of locally attached storage but adds the ability to migrate, scale, and snapshot the storage. In addition, its cost per terabyte must be similar to that of local storage, and it should provide object copy redundancy for higher data reliability. The purpose of this white paper is to provide guidance to hosters who are thinking of moving from traditional storage to cloud storage to enhance their cloud offerings. By explaining how to evaluate the various features offered by the large range of storage providers in the market today, it will help you choose the system that s best for you. Understanding the Storage Problem Most hosting providers today are using either some kind of storage area network (SAN) or direct attached storage (DAS). The latter typically consists of a large machine with multiple disks in a RAID configuration, together with redundant power units, and it exports its storage as a block-level device either via iscsi or as shared file-system via NFS. Neither approach is ideal for hosters, however. The problem with enterprise-class SAN solutions is that they are very expensive, so they will significantly decrease your per-customer margin. The common drawback with DAS is that it doesn t allow you to leverage unused disk space if other resources such as CPU or memory are already assigned to virtual environments. As a result, you re unable to make efficient use of your available disk space (see Figure 1). 1 Virtual environments are individual Infrastructure as a Service (IaaS) units, provided either by hypervisor or container technology. 1
Figure 1. Large hosting providers in Europe and North America typically use only 36% of their disk space. What Makes an Ideal Cloud Storage Solution for Hosters? In this section, we look at various terms used to describe storage and relate them back to features that will (or won't) be useful in a hosting environment. Scalability Given the necessity of maintaining local scaling of storage bandwidth, it s clear that you need a distributed environment, because a centralized server system cannot scale. For instance, a centralized NFS server, even on 10Gb/s links, can serve just ten nodes before reaching its maximum bandwidth. Even if you add fabricswitching technology (such as a SAN or InfiniBand switch) to deliver the full available bandwidth from the servers to the clients, it still won t match the capacity you can get from a distributed environment. Because of this requirement for a distributed system, any system that uses file servers like NFS, iscsi, or Fibre Channel is inappropriate. And although it s possible to use split-fabric technology to overcome some of the objections to a centralized store, such solutions tend to raise the cost per terabyte beyond acceptable limits, making this approach impractical for hosters as well. Multiple Nodes In a hosting environment, pretty much every box has one or two 1Gb/s Ethernet interfaces, connected by a switch and a local disk. This means that in a storage hosting environment, it s possible to evenly host at 100MB/s, provided that the storage is served evenly across the cluster, with each node serving storage to all the others at its maximum link speed. However, for this approach to work, the node count of the storage cluster must be the same as that of the compute cluster. This requirement is easiest to achieve if all the nodes in the cluster are both storage and compute nodes. 2
Therefore, in a hosting environment, scalable storage is best delivered by reusing the existing nodes to run as storage servers. But if you take this approach, it s important to not disrupt the existing services running on the nodes by overtaxing them with excessive resource requirements so you need to choose a storage system that makes minimal use of resources. It s interesting to note that a 64-node rack cluster with a fast Ethernet switch supporting fabricswitching technology can, using nothing more than 1Gb network cards and fairly run-of-the-mill SATA devices, deliver an aggregate storage bandwidth of around 50GB/s, provided the data placement is done correctly. That s a greater bandwidth capability than a super-fast SSD array on a modern InfiniBand network, which can only deliver around 40GB/s aggregate on the fabric. Block-Based ( Not File-Based) Roots for Virtual Environments VEs may either (1) use a shared, file-based root (using NFS, a cluster file system like GFS2 or CEPH or, in the case of containers, binding the mounting directly into the host); or (2) use a block-based root (usually either iscsi or a block projection of an image file). The problem with the first approach is that all shared, file-based root systems suffer from a scaling problem as the number of VEs rises. That s because each VE root contains a large number of small files, and aggregating them in a file environment causes the file server to see a massively growing number of objects. As a result, metadata operations will run into bottlenecks. To explain this problem further: if each root has N objects and there are M roots, tracking the combined objects will require an N times M scaling of effort. Additionally, the objects must be tracked by the metadata, and in a root file system, the size of the objects can vary by ten orders of magnitude (that is, from a few bytes up to many gigabytes). Tracking objects of such variable sizes creates considerable metadata overhead. For all these reasons, we don t recommend using shared file-based roots for VEs in highly scalable cloud systems. In contrast, block-based roots can elegantly avoid the metadata scaling and sizing problems of shared file-based roots because the metadata is effectively partitioned. That is, each root runs a separate file system, which encapsulates its own metadata. Consequently, the metadata of the block export system needs to track only the metadata of the object representing the root, so scaling depends only on M (instead of N times M). In addition, because each image representing the block data ranges only from about a gigabyte to a terabyte in size (i.e., varying by only three orders of magnitude), you can use simpler techniques in the metadata to track these objects. Object Handling The ideal backend for handling block-based roots is one that s capable of doing incredibly rapid and random updates to the objects. This requirement tends to rule out abstracted image storage like Amazon S3 an approach that makes it very hard to do random updates. For the fastest possible updates, once the object layout has been identified from the metadata, you shouldn t need any further metadata communication to read from and write to the object. Further, the node performing the read or write should be able to communicate directly with the node(s) providing 3
the object. Any encapsulation should be minimal, so the update process can use as much of the available network bandwidth as it needs. Cloning, Snapshotting and Deduplication Cloning involves using copy-on-write 2 techniques to produce a duplicate of an image object. Snapshotting involves making a volatile copy of the object, either to permit rollback from a fallible operation, such as an update, or simply to facilitate a backup. Deduplication involves identifying and combining storage regions with identical content in different objects. All three techniques have been important for some time in enterprises that manage virtual environments. However, hosters currently have less need for deduplication, as surveys have shown that they have considerable available space in their environments. Thus, although deduplication may become a requirement for the future, it isn't currently high on hosters feature list at present. Sparse Objects and Thin Provisioning Sparse objects are objects in which not every byte has been allocated so these objects actually consume less storage than their size would imply. Sparse objects exist because root image files don't necessarily occupy all the space they have allocated (to see this, just look at the free space on any computer). The technique of using sparse objects is also referred to as "thin provisioning" by array vendors. As with deduplication, use of sparse objects typically is not of interest to hosting providers because they already generally have more storage than they need. The other factor that makes sparse objects a non-issue for hosters is the widespread use of cloning: in a root that s been cloned multiple times, unoccupied space will still point back to the master copy, so the only space saving that a sparse object would create would be a single block in the master copy, despite the existence of hundreds of clones. However, sparse objects are still a useful feature to keep in mind particularly because they allow hosting providers to consider business plans based on overcommitting storage. Object Resizing and Support for Legacy File System Roots Obviously, it should be a requirement that objects representing roots be resizable especially since cloud customers are usually charged per unit of storage, and therefore will want to optimize their use of storage as much as possible. However, in a block-based root system, resizing depends not only on the capabilities of the object store, but also on the file system chosen for the root (a choice that is usually made by the consumer of the VE). The problem here is that, for practical reasons, many root file systems cannot actually be shrunk (a classic example in Linux is the ext3 file system). It is therefore useful for any cloud object store to have assistive technologies for shrinking legacy file systems that are unshrinkable. 2 Copy on write is a technique that enables multiple images to share the same storage block, as long as users only read from those images. Once users write to an image, it will create a block that is unique to that image. 4
Redundancy Hosters today tend to provide redundancy for their local storage solutions by using hardware RAID systems on their individual nodes. Therefore, any cloud storage system based on these nodes can take advantage of the RAID system to provide initial redundancy. However, hosting providers need to be able to survive node failure as well as disk failure, so a cloud storage solution should also be able to duplicate objects across multiple nodes in the cluster. And because object duplication takes additional space, hosters should be able to specify the desired number of object copies. Cluster Simplicity One of the cardinal principles of system design is that a system should be as simple as possible, containing only as much complexity as is needed to perform all of its functions. Additional complexity beyond this point simply increases overhead, impairing the system s efficiency and generally weakening it. This is true in part because shared access to objects adds complexity to the cluster algorithms, and in part because the only use case for object roots is exclusive. In fact, mounting the same root to more than one machine will corrupt the underlying file system, making it important that the storage system itself be able to detect and prevent this condition. Additional complexity also increases the amount of testing required to thoroughly debug the code but since most organizations have fixed budgets for testing, the net effect is that testing is less comprehensive. For all these reasons, it s generally a bad idea to base your cloud storage on clustered file systems. Storage Expansion Since it s a given that customer storage requirements will only increase over time, expanding the capacity of the cloud storage system should be extremely easy whether it s done by adding new disks to individual nodes (preferably by hot-plugging them, so the storage system sees the new disks and simply absorbs them) or by adding additional nodes to the cluster. Summary of Requirements This section summarizes our recommendations for cloud storage, based on the preceding observations. Assuming you ll be deploying the cloud storage solution on mostly existing hardware, we ve divided the requirements into two categories: must have for initial deployment, and nice to have for the future. Must Have for Initial Deployment The absolutely critical requirements for your initial cloud storage deployment are: Cost-effectiveness. The storage solution should be able to reuse your existing hardware setup, require little extra hardware, and be as light as possible in terms of its resource footprint. Multi-node performance. The solution must be spread over enough nodes to be able to deliver the same level of performance as your current locally attached storage. 5
Block-based objects. To assure optimal performance and handling, the technology must be based on objects representing roots. Cloning and snapshotting. The solution must support copy-on-write use of master images, as well as the ability to freeze the state of the storage at any point in time. Hot-plugability. The solution should be easy to expand by simply inserting additional nodes and devices. Failure tolerance and redundancy. At a minimum, the solution should protect against singledisk failure. Ideally, it should protect against single-node failure, as well. Exclusive object access. The solution should ensure that an object representing a root file system is mounted only once in the cluster at any given time. Nice to Have for Future Deployments Some additional features that you may find convenient to add for future deployments are: Deduplication, to free up additional storage space. Sparse objects (thin provisioning), so you can safely overcommit storage. Assistance for shrinking legacy file systems, so customers who are charged per unit of storage can optimize their use of storage. Conclusion As the cloud revolution progresses, the ability to separate storage from your physical systems will become increasingly important. By understanding what your storage requirements are and how well different cloud storage systems match them, you ll be able to take full advantages of the benefits that cloud storage has to offer. To learn more about how cloud storage systems can increase the reliability and scalability of your hosted services and how Parallels helps service providers deliver cloud storage, please visit www.parallels.com/products/pcs. 6