I D C T E C H N O L O G Y S P O T L I G H T M a s s i ve l y S c a l a b l e Enterprise Storage for B i g D a t a November 2014 Adapted from Worldwide Storage in Big Data 2013 2017 Forecast by Ashish Nadkarni and Laura DuBois, IDC #241319 Sponsored by DataDirect Networks Runaway costs associated with a growing set of data-intensive workloads have big implications for organizations struggling to manage their infrastructure in an efficient manner. Industries like life sciences, oil and gas, financial services, and manufacturing that have embarked on a Big Data journey are especially exposed to the risks of not addressing these challenges appropriately. IDC believes that one way for organizations to implement an efficient cost management strategy is to rightsize their Big Data infrastructure based on massively scalable storage systems. Unlike traditional solutions, massively scalable storage systems are designed to start small but provide on-demand and independent scaling in terms of performance and capacity. An approach championed by several suppliers in recent years leverages a distributed (parallel) file system as the data organization layer. However, such file systems do not provide enterprise features such as data availability, resiliency, and protection as well as ease of management, making enterprise adoption challenging. With its GS7K, DataDirect Networks (DDN) is among the suppliers ushering in an era in which the solution provides the simplicity of an all-in-one appliance, the performance of a distributed/parallel file system, and the prowess of an enterprise storage system with data management and protection capabilities. Introduction: Big Data Is the New Normal Success in business today is data driven. All transactions include some form of digital content, meaning that enterprises must be operationally efficient as they process vast amounts of data in an attempt to gain meaningful information and insight from it. Data stored in enterprise applications must be linked with indirect and external data sources such as social media streams, clickstream data, Web data, etc. Competiveness results from efficient correlation of all data sources and analysis and using that information to make business decisions rapidly. These business-centric endeavors put pressure on enterprise IT as most business processes today are essentially part of a holistic Big Data loop. Entire compute workloads are dedicated to nothing but analyzing increasing amounts of data that are entering or generated by the organization. This is evident in tasks such as energy exploration in the oil and gas industry, analysis of data from manufacturing equipment, social media analysis, surveillance, and others. This increase in data-driven workloads has created more challenges for enterprises as they look for ways to better manage IT infrastructure and associated costs. Adding more compute workloads requires more resources that must be efficiently managed, driving up costs. In a mobile and disparate business environment, managing IT from both the opex perspective and the capex perspective is more difficult. One key challenge is that many workloads often translate into siloed infrastructure, which can be inefficient and expensive to run, creating a "house of cards" for the organization. Any breakdowns or problems in one aspect of the IT infrastructure can have a cascading effect on the organization. IDC 1797
As enterprises strive to become more agile in today's fast-paced business environment, IT executives are looking for ways to maximize efficiencies and better manage costs as the very nature of data and information evolves. Critical to this cost management strategy is the rightsizing of infrastructure with appropriate components that can scale on demand. Massively Scalable Storage to Meet the Demands of the Big Data Enterprise One way in which IT executives are rightsizing their infrastructures is through the use of massively scalable storage systems. This solution grows on demand and provides independent scaling of performance and capacity. Because it is network or cloud based, capacity can be increased by adding new drives, even if they reside in different storage arrays. Most massively scalable platforms utilize a distributed file system or object storage for data management and span multiple server hosts or controllers while presenting a single name space. Data sharding and distribution provides massive parallelism not found in traditional storage systems that are based on a dual-controller RAID architecture. As a result, the initial investment in storage can be lower and there is no limit to the number of arrays that can be added. The linear scaling of massively scalable storage means the infrastructure does not buckle under pressure as the number of workloads increases. Before massively scalable storage systems were available, enterprises often purchased very large storage arrays to ensure that plenty of disk space would be available for future expansion. If that expansion never occurred or the expansion turned out to be less than expected, much of the originally purchased disk space went to waste. In addition, simply throwing high-capacity storage at the massive compute infrastructures needed for Big Data enterprises can create a tollgate effect. If the scaling of the compute infrastructure does not match the scaling of storage capacity, bottlenecks are created, resulting in congestion of the storage system. For example, if a compute system is sending or accessing data on the storage system, and the storage system is unable to perform the task because it is busy handling data from another source, then the compute system will have to wait. And that waiting defeats the purpose of the large storage footprint. This bottleneck is exacerbated when I/O-intensive applications are virtualized. As businesses seek to implement more Big Data workloads, they have to increase their storage footprint, which in turn reduces the utilization efficiency. In addition, as more businesses move to an I/O-intensive analytics environment, they are forced to move the compute layer closer to the data layer, eliminating the benefits of a distributed computing environment. All aspects of an IT system front end, middle tier, and back end need to be in lockstep with each other for optimum efficiency. IDC sees five key business benefits to using massively scalable solutions because of their ability to provide a seamless way to increase capacity and/or performance on demand: Storage costs will be lowered as enterprises can reduce initial capital outlay in depreciating assets. This creates a pay-for-only-what-you-use business model. Storage can become an operational expense if an organization chooses to use the service delivery model for massively scalable storage. Whether storage is a hard asset or a service, the organization still continues to pay for only what it needs. Similarly, because storage costs historically have decreased over time, an organization has the ability to purchase lower-cost resources or adjust storage-as-a-service rates annually. Industry savings can be passed to the enterprise. Deployment of massively scalable storage is rapid. Initial setup can happen in weeks, not months, and ongoing incremental improvements can occur overnight. From a flexibility standpoint, organizations can quickly scale up or down storage resources to increase business agility. 2 2014 IDC
Massively scalable storage can be used to reduce dependence on tape drives and improve backup times dramatically, especially when used in conjunction with an object storage platform. Object storage eliminates the linear aspects of tape by placing the actual data on a lower-cost storage tier, so only data required for a task is brought to the active tier for analysis. As a result, enterprises can minimize or even completely eliminate tape infrastructure as well as tape-handling and tape-collection services. While tape might be cheap, it introduces a lot of risks in the system: Data is not always online; it requires a library, and if the tape is lost or corrupted, the data itself is lost. In addition, most enterprises need to retain data for long periods of time for regulatory, legal, or general business/data analytics reasons. As a result, a significant amount of today's primary storage houses data that is inactive or obsolete for day-to-day use. Traditional storage arrays lump this infrequently needed data with information that is needed regularly and rapidly, creating further drags on performance. Scale-out technology is one way to use cloud storage as a parking lot for older data, which must be retained either to adhere to regulatory/compliance policies or because a firm is not confident it can delete the data. Overcoming Challenges by Implementing Massively Scalable Storage Systems in the Enterprise While the benefits of massively scalable storage systems are many, typical solutions do have challenges. Traditionally, these solutions are designed to provide raw power but do not offer the automated management capabilities or user interfaces of other storage solutions. Tuning and management require expertise and dedicated resources because massively scalable storage systems can be very complex. Power also can come at the expense of data resiliency, protection, and availability. For massively scalable systems to be successful in the enterprise, therefore, they need the benefits of a parallel file system. It is important, for example, that the system is linearly scalable. IT also must be able to add/remove capacity and performance independently of each other to rightsize the storage solution. In addition, a massively scalable storage solution must offer the features of traditional enterprise storage, namely ease of management, cloning, backup, data snapshots, etc. Most important would be the appliance-like simplicity of scaling out capacity. But massively scalable storage solutions are necessary for a Big Data enterprise. Unlike traditional physical disk systems, massively scalable systems can use object storage as the tertiary persistent storage layer, enabling cloud connectivity. And the inherent data-tiering capabilities bring the compute and data layers more in line with each other, increasing performance and cost efficiencies. These storage solutions are especially important because it is difficult to predict exactly how much storage an enterprise will need in an era of Big Data. Massively scalable storage enables IT to remain nimble by giving it the capability of rightsizing storage as the organization changes. Because scale-out architectures are software based, they can be delivered in multiple ways. First, the solution can be offered as a software-only system with the enterprise providing the physical storage resources. Solutions also can be delivered as a custom solution, with both software and storage provided either as a service or as an on-premises solution. In addition, many enterprise IT suppliers are packaging their solutions on storage appliances. All approaches are now designed to allow enterprise storage systems to be easily adapted to individual organizational needs and strive to improve the manageability of such potential power. In addition, these solutions are moving the computing and data storage fabrics closer to each other to provide a holistic service that improves IT efficiencies in a Big Data world. 2014 IDC 3
Considering DDN Santa Clara, California based DDN is a provider of massively scalable storage solutions. DDN recently announced GS7K, a scale-out parallel file system appliance based on the company's Storage Fusion Architecture (SFA). The appliance is designed with smaller enterprise workgroups in mind, providing Big Data storage capabilities in a smaller building block format. Performance and capacity can be added with more GS7K building blocks, eliminating the time and complexity involved in building in-house solutions, helping organizations grow within financial or project constraints. The GS7K modular appliance has the full set of capabilities of its high-end SFA12K block storage platform with the performance and scalability of IBM's GPFS file system. GS7K is designed for enterprise workgroups that need to manage Big Data, including lab and research environments, manufacturing, oil and gas, life sciences, and government. These workgroups can start with a single 4U base appliance and add pre-configured building blocks as necessary. For example, a 4U 60-drive base appliance combined with a 4U 84-drive expansion chassis can provide 11GBps throughput in 8U of rack space. Additional performance requires adding a new GS7K with its 4U 60-drive base appliance and 4U 84-drive expansion. By adding its SFA technology to the performance and scalability of GPFS, DDN provides easier-toimplement enterprise-level features in its GS7K appliance, including usability features such as policydriven snapshots and rollback as well as integrated backup. Snapshots can be used to protect the file system's contents against a user error by preserving a point-in-time version of all or a portion of the file system. Integrated backup, which requires the addition of optional software solutions such as DDN DirectProtect, ensures the reliability of the appliance. Other key usability features include storage quotas for users, groups, or individual file sets so that administrators can maintain proactive control over the shared system. Because of the added management capabilities provided by DDN, the GPFS-based appliance includes synchronous replication of data and metadata and powerful data tiering capabilities. This provides policy-based migration of data between different tiers including SSD and SAS disks to tape or Web Object Scaler (WOS) cloud enabling the creation of multiple storage pools for extremely efficient automated data placement. DDN's hyperscale WOS technology enables Web-scale storage clouds, cost-efficient active archives, and real-time global collaboration infrastructures. Challenges DDN does face challenges. First, the benefits of massively scalable storage are many and the challenges are not well known outside the technical computing community. Expect additional storage suppliers to deliver solutions that directly compete with the GS7K in terms of scalability and ease of use. It is imperative that DDN continue to stress its leadership and successes in delivering massively scalable storage solutions. In addition, DDN has targeted key markets for its GS7K appliance. For the near term, the company should focus solely on these opportunities, establishing a reputation in these Big Data areas. DDN can then use its successes to grow into other markets in the future. Conclusion Today, all companies are faced with the issue of data growth. More storage will be required as enterprises embrace Big Data, mobility, social computing, and virtualization. While the rate of growth may differ from company to company, there is no doubt that organizations will have to consider new approaches for dealing with their data, particularly as they reach the limits of data interfaces of traditional scale-up storage architectures. And as companies continue their adoption of analytics, enterprise-grade massively scalable storage is increasingly becoming a necessity. 4 2014 IDC
It is important that enterprises completely understand their needs, resources, and expertise before selecting a solution. IDC believes that when evaluating massively scalable storage solutions, enterprises should keep in mind the following fundamental elements: Scalability must be considered not just from a hardware perspective but also from throughput, file size, and file volume perspectives. Appropriate solutions will allow each dimension to scale independently. Enterprise-grade capabilities are a big consideration. Data layout and organization may have performance, efficiency, and availability implications. Features such as data protection, resiliency, and availability as well as cloud connectivity are important. The larger the data set and bigger the storage system, the greater the need for data management and reduction techniques (data deduplication, compression, thin provisioning, etc.), so efficiency is critical. Data optimization technologies such as automated data tiering also will be essential. A solution appropriate for a given environment will allow many features to be implemented and recalibrated without major disruptions. Enterprises need to work with suppliers that provide targeted solutions for Big Data workloads and workgroup sizes. Organizations should evaluate data model, infrastructure, workload, and quality-ofservice characteristics of a supplier's scale-out solutions. Most important, enterprises should examine the strategic role massively scalable storage solutions will play in their IT infrastructure, particularly with respect to functionality and investment protection. With the widespread changes in how businesses collect, analyze, store, and manage data, massively scalable is a viable alternative to traditional scale-up approaches. To the extent that DDN can meet the challenges described previously, the GS7K has a significant opportunity for success with enterprises looking for a massively scalable storage solution. A B O U T T H I S P U B L I C A T I ON This publication was produced by IDC Custom Solutions. The opinion, analysis, and research results presented herein are drawn from more detailed research and analysis independently conducted and published by IDC, unless specific vendor sponsorship is noted. IDC Custom Solutions makes IDC content available in a wide range of formats for distribution by various companies. A license to distribute IDC content does not imply endorsement of or opinion about the licensee. C O P Y R I G H T A N D R E S T R I C T I O N S Any IDC information or reference to IDC that is to be used in advertising, press releases, or promotional materials requires prior written approval from IDC. For permission requests, contact the IDC Custom Solutions information line at 508-988-7610 or gms@idc.com. Translation and/or localization of this document require an additional license from IDC. For more information on IDC, visit www.idc.com. For more information on IDC Custom Solutions, visit http://www.idc.com/prodserv/custom_solutions/index.jsp. Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com 2014 IDC 5