GINORMOUS SYSTEMS April 30 May 1, 2013 Washington, D.C. REINFORCEMENT PAPERS

GINORMOUS SYSTEMS April 30 May 1, 2013 Washington, D.C. REINFORCEMENT PAPERS 20131

Disruptive Change in Storage Technologies for Big Data Dr. Garth Gibson When considering the size requirements of a digital storage system, two metrics come to mind: the number of files it must hold and manage, and the relevant prefix for byte on the total storage system perhaps giga, tera, peta, or exa. A greater number of files does not necessarily correspond to more data. For instance, high-performance computing 1,7,123 on seismic data entails individual files that are measured in terabytes, but key value stores deal with a huge number of very small files of just a few bytes each. Each end of this size/quantity range presents challenges to the design of storage systems and their underlying technologies. 61,62,63,124,125,126,127,128,129,130,131,132,133 RAID pioneer Garth Gibson discusses these and other drivers of change in the storage arena, along with the solutions that are rising to meet each need. Muddying the waters is the shifting economic model for disc drives, with the limits on areal density of conventional drives 125,127,129,130,131 making NAND flash solidstate storage a potentially attractive alternative as cheap even disposable slow memory when the number of small stored items is large and as a replacement to disk drives once the density crossover point appears in the rearview mirror. I can always represent [files] that are getting bigger in a small amount of metadata, do a small amount of locking and a small amount of synchronization, as long as the thing we are operating on keeps getting bigger. Gibson lays out the dual problem, using the example of storage to support high-performance computing. The vast number of objects, even at Los Alamos National Labs operating supercomputers, are tiny, says Gibson, although he goes on to note that the capacity of the 100M-file system remains entirely ample until the largest files, which currently measure 4TB/ file, are taken into account. The performance issues at Los Alamos are driven by dealing with objects that are a gigabyte or larger, but the management issues are typically driven by objects that are very, very small. The time-tested approach to large-file storage is object storage, which replaces fixed-size blocks with flexible-size objects. Whereas block-based storage addresses each block individually, entailing significant metadata overhead for large files and appreciable wasted storage capacity for small files, objects allow for better management. An object-based storage device collects logical blocks into an object that contains not only the data but also its attributes. This approach alleviates the need for the database management system to directly keep track of the block assignments for each file, conferring significant advantage and making it the solution chosen by the cloud and HPC communities. We wrap up blocks into large files and pass around pointers to these large objects to access them, says Gibson. And we leave metadata for accessing how they are stored inside the containers where they are stored to try and minimize the amount we have to synchronize on. Pioneered by Gibson, Panasas s object-based storage for the HPC environment is specifically implemented with a chassis of ten blades in combination with a metadataservicing unit and networking hardware. These elements make up a site s distributed file system, which presents transparently to the customer as a single file server. As with any distributed system, fault-tolerance is essential to the practical use of object-based storage. Quarter-century-old RAID (redundant array of independent disks) remains the underpinning technology in this arena, although its modern implementation in software is where the action is. Although RAID is old and dead, it is hardware RAID that is dead, while variations in software RAID are actually very innovative right now, says Gibson. The challenge is to manage systems as they scale out. The largest Panasas deployments currently entail 8 PB of storage and 500K metadata operations each second, with bandwidths spanning the range of 1.5 150 GB/s. Managed as a distributed system, Panasas s solution reliably tracks components for proper operation and performs smooth failover when problems arise. As long as the object model works, this is primarily an issue of the distributed system, says Gibson. How many things in there can fail? Can I keep an image of what s working and not working? And can I keep a consistency 1,2,6 and failover strategy in place? While not an easy set of attributes to master, Gibson sees this as attainable, and, like Google s Spanner, Panasas places high value on maintaining consistency. With his model, individual clients no longer feature hardware RAID, but a RAID is performed over individual pieces of each file, and then the RAID over these pieces become distributed through the network to the storage component. That is, each client is held responsible for creating redundancy for its native data, ensuring scalability, while the distributed nature of the system and direct writing from client to the parallel storage system provides the speed necessary for consistency, provided the network uses large buffers to accommodate nonuniform traffic flows. 2 I have scaled out all of the RAID computation, all of the reliability computation, and pushed it out to the client nodes, which scale out with the total amount of the system, and I flow this out at the speed of the network in parallel, says Gibson. This scales. Reconstruction, however, poses its own set of considerations. As disk capacity grows 40% year on year, the size of the failure unit scales up commensurately, particularly if recovery is pegged at the node level, rather than the disk level. Unless the reliability of individual components improves dramatically, the 2

Disruptive Change in Storage Technologies for Big Data Garth Gibson inevitable result is more frequent failure and a heavier load on the storage system s recovery mechanism. Being a realist, Gibson knows he must prepare for the worst and therefore assumes the need to accommodate a media error during the reconstruction process itself. You need to be able to protect against two failures from the beginning, he says. The solution: parallelize recovery. Hardware RAID does not permit this, leading to shift to software RAID. The way this works for our system is that individual files pick random locations, they grow out, they get set on some stripe, they calculate a RAID code, while other files are allocated to completely different places, explains Gibson. Over time what you get is RAID sets that are parts of the files distributed over failure domains, and they are not aligned. It is not like a RAID set is ten wide on ten devices; a RAID set is ten wide on 1000 devices, picked at random. Reconstruction occurs through parallel reading, while parallel writing across the free space persists the full corpus reliably, provided the metadata capacity remains sufficient. This latter requirement is tantamount to file sizes growing reliably larger; when data consists of a barrage of small files, the problem becomes one of metadata inundation, as described below. NAND flash becomes double buffering. It becomes cheap, slow memory to offset the fault-tolerance strategy that disks are currently holding. It is about dumping memory really quickly into a copy that then gets dribbled into the storage, because the capacity is still in the storage. Before addressing the problem of small files, however, Gibson raises an issue that plagues the HPC community 1,7 : the need for fast networks to move the scientific community s large files. Exascale systems are coming, 41 raising the bar yet higher on network communications and data management. Another way of looking at this notion of system size is to consider how the number of nodes, which is in the process of growing to fully 1M nodes within a decade. There is no particular reason to assume that each node is going to get more reliable, so failure rates are going to go way up, says Gibson, making effective recovery systems all the more necessary. Roughly halving the time to achieve a memory dump to 300 seconds by 2018 becomes the standard. If the failures are happening more commonly, then I have to have a shorter time period between checkpoints; and if I have a shorter time period between checkpoints, in order to keep the cycles on my computation, I have to have a fault-tolerance strategy that executes faster, which means I have to dump memory in less time. Carrying this logic forward, network speed must rise to carry memory-borne data away from points of failure and to a safe haven. I need a moderate amount of capacity, but a phenomenal amount of bandwidth, says Gibson, who is addressing this need by using solid-state flash to rapidly accept this memory dump and serve as a form of inexpensive memory that can then trickle data out to disk on a more relaxed time scale. This hybrid NAND flash-plus-disk solution is the most cost-effective solution for the combined bandwidth and storage requirement. This use of NAND flash as memory increases both its cost and its value. Currently, raw solid-state NAND exceeds the Using NAND Flash as Checkpoint Memory Saves Systems from Frequent Component Failure FAST WRITE Checkpoint Memory SLOW WRITE Disk Storage Devices Compute Cluster 3

cost of disk storage by a factor of ten, and wrap some processing around it to make it useful in the memory capacity just described, and the cost goes up by another order of magnitude, but the economics of memory follows a different trajectory than that of storage. An SSD device is a little computer with a microcontroller and some DRAM in there to hide the characteristics of the flash, and then you sell that component based on how smart your controller is, and right now that gives a lot of value to the customer, and there are large margins in this space, details Gibson. This makes for an appealing solution for fast-responding consumer devices, but poses a problem in the large-systems space because of the electronic properties at the materials level. This NAND flash stuff is moving dozens of electrons through an insulator into a floating gate, and that floating gate changes the voltage you need to apply to it in order to get current flow, and you can sense how much you have done, says Gibson. That solidstate technology works operating on large chunks disk-like chunks where block sizes for erase will be in megabytes, and writing will be in page-size units, and those units are going to go up with the density of the NAND flash. It is the job of the controller to make the block sizes transparent to the user, as well as to overlay the technology with a workaround for the limited number of writes that NAND flash can handle before the material degrades, as Gibson explains: We try to hide that amount of wear on any one page by mapping and remapping things constantly. Given that the rewrite capacity for the lowest cost so-called triple-level cell NAND flash is roughly 500 cycles, its utilization model in at-scale systems is likely to be similar to a printer s toner cartridge; that is, a consumable storage component. The discussion thus far pertains to systems in which large files determine capacity, such as HPC. Other applications, however, inherently entail a huge number of tiny files. In finance, for instance, 20% of files measure just a couple of kilobytes, while fewer than 10% are greater than 1 MB; key value stores are even more skewed toward small files. Just as finance requires a different storage system setup than HPC, key value stores require yet another alternative approach. Gibson considers each in turn. To satisfy the characteristic file distribution found in the finance community, Panasas configures systems with a more storage-centric use of SSDs, where NAND flash not only serves a memory function, but also steps in on the storage side to accept small files, while leaving disk storage available for the behemoths. We are combining a storage unit in our case it is a blade with some disks and some SSD, with variable configuration sizes, says Gibson. And you choose those configurations based on the workloads. When file sizes become truly small, the metadata describing the data becomes appreciable, even relative to the absolute volume of raw data. We are increasingly getting main memory full of data structures that can get to the actual data you need, when it is small and random, in a small number of fetches, says Gibson, reducing the potential to use SSDs as storage instead of memory. Consider some numbers: If the files in storage are small (1 KB) photos, as with Facebook thumbnails, 4 GB of index in memory will represent 1 TB of data. If, however, the data are tweets (168 bytes each), then it will take 24 GB of index to represent the same 1 TB. Take this down further to the scale of data deduplication hashes (a mere 32 bytes each), and the memory requirement becomes a massive 125 GB of DRAM for single-step SSD lookup. At this scale, the need for efficient index lookup is paramount. Of the various approaches to configuring a key value store index, Gibson sees SILT (small index, large table) as the most promising, given that it requires sub-one-byte per entry on SSD, retrievable in a single lookup, and functions well with systems comprising low-power processors. This system brings with it the challenge of replumbing the operating system, which is unable to handle the unusually fast access rate. With these various architectural changes afoot to accommodate the scope of data, and with NAND flash taking on an increasingly prominent role, disk storage remains the economic medium of choice, at least for the time being. However, it must continue to increase in capacity to keep pace with demand. Gibson describes three disk drive technologies that are competing for acclaim in the next couple of generations of areal densification: heat-assisted magnetic recording (HAMR), bit-patterned media (BPM), and nearest term shingledtrack disks. The heat assist of HAMR disks refers to the need to locally heat the magnetic medium to coax its extra-small bits into changing orientation. We need to make grains of magnetic orientation smaller and smaller, says Gibson, and use materials that resist change more strongly so that the Unshingled (left) vs. Shingled (right) Tracks; w and w lndicate the Write Width, g is the Gap between Tracks, and r is the Read Width for the Shingled Tracks 4

Disruptive Change in Storage Technologies for Big Data Garth Gibson superparamagnetic limit is pushed further out. A microlaser on the write head that introduces enough heat to raise the temperature of the medium by a couple of hundred degrees Celsius does the trick by preparing the high-coercivity disk material to flip orientation. Most of the disk companies are working on this, but it is taking a long time, says Gibson, who recognizes the process engineering challenge of manufacturing laser-augmented heads that travel over the disk surface with nanometer-scale tolerances. If [magnetic-disk drive manufacturers] don t continue to improve areal density, their customers don t have any reason to buy from them at all, he says, highlighting the business challenge for traditional players on the hardware side. Bit-patterned media will be later in coming than heat-assisted technology, but Gibson sees it as the expected follow-on. In this case, nanolithography lays down cells, each capable of storing a single bit. This technology is even further out than HAMR. With engineering and fabrication challenges hampering the release dates of HAMR and BPM, respectively, Gibson anticipates the nearest term update to disk drive technology will be shingling. The current mode of laying down tracks on magnetic storage media is for each track to consume roughly 40 nm of width, with at least 5 nm between adjacent tracks to minimize crosstalk. Instead of this unshingled configuration, the new mode entails a wider swath for writing each track, but with substantial overlap between adjacent tracks. With unshingled tracks, the margins on both sides of each track must be clean, with no data laid down between tracks; in contrast, only one side of each shingled track must have a clean margin, with the read process always occurring close to that edge, as illustrated. The same area of disk can therefore accommodate more tracks, as progress demands. Although increasing density, the overlap inherent with shingled tracks poses the conundrum of not being able to rewrite tracks at will. It will change the system model, says Gibson. The system model will be that the disk will not be able to rewrite individual sectors, or it will be expensive because of having to pick up an entire set of tracks and write them back down, and that will take tens or hundreds of seconds to do. To avoid this potentially fatal flaw, shingled disks, when they appear, may well be fitted with a microcontroller as the NAND flash in SSDs is to force writes to always proceed sequentially and to perform periodic defragmentation to eliminate holes. The technology to do this has been well established in the SSD space, yet Gibson worries that the cost of implementation might be too great to enable shingled disks to succeed in the marketplace. We re not going to be able to pay for it, says Gibson, where the it is the controller, which, recall, boosted the cost of SSDs by an order of magnitude over raw NAND flash. The technology is well understood, but it comes at a cost. We re only keeping up with the areal density, but we re not providing you any more benefit than you expected, so you re not going to pay more. If disks stop decreasing in dollar per gigabyte, then the cost of an SSD becomes less of a disadvantage, and maybe SSDs put disks away. This alone might spur a change in application interface, although the industry is mounting a passionate resistance. The world is going to divide up into people who treat NAND flash as slow, cheap memory and those who treat it as very fast but expensive disks; and the disk world, if it survives, will do it by making itself even bigger, even slower, and even more awkward. Dr. Garth Gibson, Co-Founder and Chief Scientist, Panasas Dr. Garth Gibson s work at Panasas covers large-scale parallelism in computer systems and its implications on application performance, operating system design, fault tolerance, and data center manageability. Panasas is a scalable storage-cluster company using an object-storage architecture and providing hundreds of terabytes of high-performance storage in a single management domain. Dr. Gibson also concentrates on secondary memory system technologies; parallel and distributed file systems; and local-, storage-, and system-area networking. While working on his Ph.D., Garth co-wrote the seminal Berkeley RAID paper. He is also on the faculty at Carnegie Mellon University, where he founded the Parallel Data Laboratory. Garth also formed the Network-Attached Storage Device working group of the National Storage Industry Consortium, led storage systems research at the Data Storage Systems Center, and founded The Petascale Data Storage Institute for the Department of Energy s Scientific Discovery through Advanced Computing. Garth s contributions to computer storage have been recognized with the 1999 IEEE Reynold B. Johnson Information Storage Award for outstanding contributions in the field of information storage, inclusion in the hall of fame of the ACM Special Interest Group for Operating Systems, and the 2012 Jean-Claude Laprie Award in Dependable Computing from the IFIP Working Group 10.4 on Dependable Computing and Fault Tolerance. 5