Enterprise Data Lake Platforms: Deep Storage for Big Data and Analytics

Transcription

1 Insight Enterprise Data Lake Platforms: Deep Storage for Big Data and Analytics Ashish Nadkarni Laura DuBois IDC OPINION In the past 18 months or so, the term data lakes has surfaced as yet another phrase that seemingly attempts to describe a large repository of unstructured data. Other terms used in the industry have included content depots, content repositories, and object stores and, of course, the well-used moniker Big Data. Given IDC's previous definition of Big Data, there is an obvious affinity or overlap between Big Data repositories and data lakes. With this document, IDC has sought to define data lakes and enterprise data lake platforms (EDLPs) and distinguish them from other similarly descriptive terms. In the Worldwide Storage in Big Data Forecast Update (IDC #244959, December 2013), IDC noted that revenue associated with storage for the Big Data market will surpass $6 billion in IDC expects a significant portion of this revenue to be associated with unstructured and semistructured data collected and collated from different sources into large repositories. These Big Data repositories offer several benefits over traditional NAS systems that are just used for storing unstructured data but pose limitations on how this data can be ingested or accessed. However, many Big Data repositories pose restrictions on data ingest and access, forcing the data to be formatted and moved to the platform for analytics. Data lakes on the other hand are a corpus of unstructured and semistructured data collected and collated from different sources into a large repository. However, more crucially, data lakes support the ingest, persistent storage, and access of data in a manner agnostic to how it is moved into the repository and in a manner that makes it easier for adjacent Big Data workloads to inplace analytics of data in this repository. Specifically, the data is stored using open standard rather than in proprietary formats. Storage platforms that support data lakes (known as enterprise data lake platforms) present firms with opportunity to use multiple analytics tools (such as Hadoop) to concurrently analyze the data. IDC expects that macro trends such as the Internet of Things, associated with the move to the 3rd Platform era, will continue to push more and more businesses to adopt EDLPs for their Big Data repositories and: Consolidate their Big Data storage islands into a single repository for unstructured, structured, and semistructured data (including data stored in Hadoop- and NoSQL-friendly formats). They can leverage a single platform that supports upstream (flash) tiering for analysis of hot data sets and downstream (cloud, cold storage) tiering for inactive data sets. Use a "data in place" model that can concurrently service various Big Data workloads (via their native access mechanisms) such as Hadoop without moving data to the compute layer and back. In essence, they can take a step closer to replacing their enterprise data warehouse (EDW) with an EDL that stores data in an open and multiple-access format. July 2014, IDC #250000

2 IN THIS INSIGHT This IDC Insight provides IDC's perspective on enterprise data lake platforms. The concept is still new and, at this time, in IDC's view, does not warrant a comprehensive taxonomy. However, given the characteristics of systems that are designed to support EDLPs, IDC contends that the relevant definitions and taxonomies when developed will align closely to IDC's Worldwide File- and Object- Based Storage Taxonomy, 2014 (IDC #245940, January 2014). SITUATION OVERVIEW In a recent document that outlined the growth of unstructured data in the enterprise, IDC noted that suppliers shipped nearly 52EB of capacity in 2013 corresponding to nearly $36.1 billion in revenue. In IDC's estimates, unstructured data made up for 64.6% of the total capacity shipped but 46.4% of the total revenue. IDC estimates that by the end of 2015, as far as enterprise disk storage systems are concerned, unstructured data will surpass structured data both in terms of capacity shipped and customer revenue (unstructured data already occupies a lion's share of cloud-based storage): Much of the unstructured data growth in the traditional enterprise will come from end-user computing an ecosystem that is slowly shifting toward BYOD devices like smartphones and tablets. Chiefly, however, in many industries, this data growth will come from devices, sensors, and "connected things" that are spewing analytics-rich (unstructured) data sets on to the enterprise disk storage infrastructure. Much of this infrastructure is capacity optimized, meaning it is tuned for storing large quantities of data at low dollar-per-gigabyte costs. A new hybrid data type known as "semistructured" data will slowly gain a foothold in many industries that are pursuing the "Internet of Things" (i.e., the proliferation of sensor and machine-generated data). IDC also expects that unstructured and semistructured data placement techniques will undergo a shift as well moving from unitary file systems to distributed (scale-out), seamlessly extensible, and unified file- and object-based systems. By leveraging server-side technologies like PCIe flash, enterprises will also be able to realize the value of this data by analyzing it continuously and on demand. Today, businesses are forced to select a suitable storage platform for each of their Big Data and unstructured data storage workloads. This selection has to be done during the design stage, and pretty much locks them into that platform. While selecting the platform they have to: Make the choice between selecting either a file-based or an object-based platform. Choose a single-access mechanism up front for feeding data into the platform. Restrict their applications to the metadata limitations of the access protocol or mechanism. Make the painful choice between moving data to compute or vice versa, especially for large data sets that are incompatible with each other IDC #

3 Create separate storage infrastructure islands for different workloads such as Hadoop, search, and discovery. This results in a fragmented infrastructure, the antithesis of where the IT industry wants to go with converged and densely utilized infrastructure. Other solutions create "islands" of storage that are difficult and costly to manage, resulting in hot spots, inefficiencies, and poor storage utilization. They also require heavy lifting to scale. For example, many businesses that deploy Hadoop have to make the choice to implement a separate repository for Hadoop and move data that requires MapReduce operations to this repository. Not only does this become inefficient as the size of the data sets increases, but it also creates multiple copies of the same data set something that IDC believes is an issue businesses are grappling with as well (see The Copy Data Problem: An Order of Magnitude Analysis, IDC #239875, March 2013, and The Copy Data Management Challenge: 65% of External Storage Capacity Is Used for Copy Data, IDC #lcus , January 2014). Data Lakes and Enterprise Data Lake Platforms Like Big Data repositories, data lakes can be thought of as a corpus of unstructured and semistructured data collected and collated from different sources into a single unified data pool (hence the term data lake). A data lake offers multiple access points for data "on-ramping," meaning support for standard network access protocols (NFS, CIFS, pnfs) as well as RESTful object interfaces by which applications can write data into the repository. However, more crucially, a data lake supports the storing of the data in a manner agnostic to how it is moved into the repository and in a manner that makes it easier for adjacent Big Data workloads to analyze it. Specifically, the data is stored using open standards rather than in proprietary formats. In enterprises, data lakes can be considered to be a central "deep storage" repository for consolidating different types of unstructured, semistructured and, to some extent, structured data. Enterprise data lake platforms, or EDLPs, are seen as a solution to the data deluge and access conundrum faced by enterprises that cannot be solved using Big Data repositories built on a single platform like Hadoop. Akin to enterprise data warehouses, EDLPs allow disparate and incoherent data types to be consolidated onto a single, scalable, extensible, and agile storage platform. IDC expects that most EDLPs will need to support: Multiformat multiprotocol data ingest and access. EDLPs need to support data to be ingested (i.e., placed on them) via a variety of file, object, and even block interfaces that include, but are not limited to, NFS, pnfs SMB, NDMP, HDFS, or RESTful object interfaces (such as OpenStack Swift, Amazon S3, and CDMI) by which applications can write data into the repository. Access mechanisms can be open, standards based or, where required, application specific. The expectation then is that this same data should be consumable (read: accessible) via different mechanisms or interfaces without the need to copy, replicate, or export it (into a different format). Access-agnostic storage. EDLPs should not make any presumptions on the manner in which the data is ingested or accessed the two mechanisms could be completely different. For example, data ingested via NFS could be accessed via HDFS or via an API. This also means that unlike typical file-based storage platforms, EDLPs should make their metadata extensible 2014 IDC #

4 and programmatically accessible, beyond the normal expectations of a specific access interface or API. EDLPs can therefore be suitable for both traditional workloads such as home directories, file shares, sync-and-share applications, and Hadoop, as well as next-generation business and social analytics and cloud and mobile applications. Deep storage with "infinite" scalability and efficiency. EDLPs should support unprecedented nondisruptive scalability and agility. EDLPs should also support efficiency, both for upstream and downstream tiering. The platform should make use of upstream (flash) tiering that support efficient analytics workloads of hot data sets and downstream (cloud, cold storage) tiering for storing inactive data sets but, crucially, should support data movement between tiers depending on the I/O activity and decay. In other words, the highly active data is placed on a tier optimized for performance (dollar per IOPS), while inactive data is placed on a tier optimized for capacity (dollar per gigabyte). EDLPs also need to have built-in data optimization, protection, and availability mechanisms that exceed the service levels established for the workloads operating on them. As a consolidated repository, it is essential that the platform also have a built-in robust data loss prevention (DLP) mechanism. Support the three AAAs of security. Given the fact that most EDLPs will need to store sensitive data sets, it is essential that they support robust authorization, audit, and authentication mechanisms for users and applications. They also need to support inline and at-rest data encryption. With a platform that acts like an enterprise data lake, businesses can minimize fragmentation and gain better and consistent insight into their entire data. FUTURE OUTLOOK IDC believes that EDLPs will become a core part of enterprise storage infrastructure in the coming areas. As businesses learn to collate data from various sources and convert it into consumable nuggets of information for their various organizational units, they will no doubt be compelled to establish enterprisewide data lakes upon which various workloads can concurrently operate. Such data lakes will enable existing workloads, as well as be future proof to seamlessly support new applications and workloads. It is evident from recent announcements that suppliers like EMC are beginning to position their scaleout file and object platforms like Isilon to support data lakes as the next stop on a journey to make their platforms ready for the next wave of Big Data, social, mobile, and cloud applications. EDLPs set the stage for an extension and perhaps an eventual unification of the scale-out file and object market segments. This unification will come at the expense of traditional scale-up unitary file servers and filebased storage (NAS) market segments both of which will shrink further as businesses consolidate their file-based and object-based data onto deep storage repositories also known as enterprise data lakes IDC #

5 About IDC International Data Corporation (IDC) is the premier global provider of market intelligence, advisory services, and events for the information technology, telecommunications and consumer technology markets. IDC helps IT professionals, business executives, and the investment community make factbased decisions on technology purchases and business strategy. More than 1,100 IDC analysts provide global, regional, and local expertise on technology and industry opportunities and trends in over 110 countries worldwide. For 50 years, IDC has provided strategic insights to help our clients achieve their key business objectives. IDC is a subsidiary of IDG, the world's leading technology media, research, and events company. Global Headquarters 5 Speen Street Framingham, MA USA idc-insights-community.com Copyright Notice This IDC research document was published as part of an IDC continuous intelligence service, providing written research, analyst interactions, telebriefings, and conferences. Visit to learn more about IDC subscription and consulting services. To view a list of IDC offices worldwide, visit Please contact the IDC Hotline at , ext (or ) or sales@idc.com for information on applying the price of this document toward the purchase of an IDC service or for information on additional copies or Web rights. Copyright 2014 IDC. Reproduction is forbidden unless authorized. All rights reserved.