How To Manage Big Data

The Data Lake: Taking Big Data Beyond the Cloud by Mark Herman Executive Vice President Booz Allen Hamilton Michael Delurey Principal Booz Allen Hamilton The bigger that big data gets, the more it seems to elude our grasp. While it holds great potential for creating new opportunities in every field, big data is growing so fast that it is now outpacing the ability of our current tools to take full advantage of it. Much of the problem lies in the need to extensively prepare the data before it can be analyzed. Data must be converted into recognizable formats a laborious, time-consuming process that becomes increasingly impractical as data collections grow larger. Although organizations are amassing impressive amounts of data, they simply do not have the time or resources to prepare it all in the traditional manner. This is particularly an issue with unstructured data that does not easily lend itself to formatting, such as photographs, doctors examination notes, police accident reports, and posts on social media sites. Unstructured data accounts for much of the explosion in big data today, and is widely seen as holding the most promise for creating new areas of business growth and government efficiency. But because unstructured data is so difficult to prepare, its enormous value remains largely untapped. With such constraints, organizations are now reaching the limits of what they can do with big data. They are going as far as the current tools will take them, but no further. And as big data grows larger, organizations will only be increasingly inundated with information that they have only a narrow ability to use. It is like the line, Water, water, everywhere What is needed is an entirely new approach to this overwhelming flood of data, one that can manage it and make it useful, no matter how big it grows. That is the concept behind Booz Allen Hamilton s data lake, a groundbreaking invention that scales to an organization s growing data, and makes it easily accessible. With the data lake, an organization s repository of information including structured and unstructured data is consolidated in a single, large table. Every inquiry can make use of the entire body of information stored in the data lake and it is all available at once. The data lake completely eliminates the current cumbersome data-preparation process. All types of data, including unstructured data, are smoothly and rapidly ingested into the data lake. There is no longer any need for the rigid, regimented data structures essentially data silos that currently house most data. Such silos are difficult to connect, which has long hampered the ability of organizations to integrate and analyze their data. The data lake solves this problem by eliminating the silos altogether. With the data lake, it now becomes practical in terms of time, cost, and analytic ability to turn big data into 2013 Booz Allen Hamilton Inc. All rights reserved. No part of this document may be reproduced without prior written permission of Booz Allen Hamilton. 1

opportunity. We can now ask more far-reaching and complex questions, and find the often-hidden patterns and relationships that can lead to game-changing knowledge and insight. More than the Cloud With the advent of cloud computing, business and government organizations are now storing and analyzing far larger amounts of data than ever before. But simply bringing a great deal of data together in the cloud is not the same as creating a data lake. Organizations may have embraced the cloud, but if they continue to use conventional tools, they still must laboriously prepare the data and place it in its designated location (i.e., the silo). Despite its promise to revolutionize data analysis, the cloud does not truly integrate data it simply makes the data silos taller and fatter. While the data lake relies on cloud computing, it represents a new and different mindset. Big data requires organizations to stop thinking in terms of data mining and data warehouses the equivalent of industrial-era processes and to begin considering how data can be more fluid and expansive, like in a data lake. Since with the conventional approach it is difficult to integrate data even in the cloud we tend to use the cloud mostly for storage, and remove portions of it for analysis. But no matter how powerful our analytics are, because we are applying them only to discrete datasets at any time, we never see the full picture. With the data lake, however, all of our data remains in the cloud, consolidated and connected. We can now apply our analytics to the whole of the data, and get far deeper insights. Organizations may be concerned that by consolidating their data, they might be making it more vulnerable. Just the opposite is true. The data lake incorporates a granular level of data security and privacy not available in conventional cloud computing. 1 The data lake was initially created to achieve a high-stakes goal. The US government needed a way to integrate many sources and types of intelligence data, in a secure manner, to search for terrorists and other threats. Booz Allen assisted the government in developing the data lake to achieve that goal, as part of a larger computing framework known as the Cloud Analytics Reference Architecture. The data lake and Cloud Analytics Reference Architecture are now being adapted to the larger business 1 See Booz Allen Viewpoint Enabling Cloud Analytics with Data-Level Security: Tapping the Full Value of Big Data and the Cloud, http://www.boozallen.com/media/file/enabling_cloud_analytics_with_data-level_security.pdf and government communities, bringing with them a range of features that have been successfully tested in the most demanding situations. Building the Data Lake One of the biggest limitations of the conventional approach to data analysis is that analysts often need to spend the bulk of their time just readying the data for use. With each new line of inquiry, a specific data structure and analytic is custom-built. All information entered into the data structure must first be converted into a recognizable format, often a slow, painstaking task. For example, an analyst might be faced with merging several different data sources that each use different fields. The analyst must decide which fields to use and whether entirely new ones need to be created. The more complex the query, the more data sources that typically must be homogenized. At some organizations, analysts may spend as much as 80 percent of their time preparing the data, leaving just 20 percent for conducting actual analysis. Formatting also carries the risk of dataentry errors. With the data lake, there are no individual data structures and so there is no need for formal data formatting. Data from a wide range of sources is smoothly and easily ingested into the data lake. One metaphor for the data lake might be a giant collection grid, like a spreadsheet one with billions of rows and billions of columns available to hold data. Each cell of the grid contains a piece of data a document, perhaps, or maybe a paragraph or even a single word from the document. Cells might contain names, photographs, incident reports, or Twitter feeds anything and everything. It does not matter where in the grid each bit of information is located. It also makes no difference where the data comes from, whether it is formatted, or how it might relate to any other piece of information in the data lake. The data simply takes its place in the cell, and after minimal preparation is ready for use. The image of the grid helps describe the difference between data mining and the data lake. If we want to mine precious metals, we have to find where they are, then dig deep to retrieve them. But imagine if, when the Earth was formed, nuggets of precious metals were laid out in a big grid on top of the ground. We could just walk along, picking up what we wanted. The data lake makes information just as readily available. The process of placing the data in open cells as it comes in gives the ingest process remarkable speed. Large amounts of data that might take 3 weeks to prepare using conventional cloud computing can be placed into the data lake in as little as 3 hours. This 2 MARCH 2013

enables organizations to achieve substantial savings in IT resources and manpower. Just as important, it frees analysts for the more important task of finding connections and value in the data. Many organizations today are trying to do more with less. That is difficult with the conventional approach, but becomes possible, for the first time, with the data lake. Opening Up the Data The ingest process of the data lake also removes another disadvantage of the conventional approach the need to pre-define our questions. With conventional computing techniques, we have to know in advance what kinds of answers we are looking for and where in the existing data the computer needs to look to answer the inquiry. Analysts do not really ask questions of the data they form hypotheses well in advance of the actual analysis, and then create data structures and analytics that will enable them to test those hypotheses. The only results that come back are the ones that the custom-made databases and analytics happen to provide. What makes this exercise even more constraining is that the data supporting an analysis typically contains only a portion of the potentially available information. Because the process of formatting and structuring the data is so time-intensive, analysts have no choice but to cull the data by some method. One of the most prevalent techniques is to discount (and even ignore) unstructured data. This simplifies the data ingest, but it severely reduces the value of the data for analysis. Hampered by these severe limitations, analysts can pose only narrow questions of the data. And there is a risk that the data structures will become closedloop systems echo chambers that merely validate the original hypotheses. When we ask the system what is important, it points to the data that we happened to put in. The fact that a particular piece of data is included in a database tends to make it de facto significant it is important only because the hypothesis sees it that way. With the data lake, data is ingested with a wide-open view as to the queries that may come later. Because there are no structures, we can get all of the data in all 100 variables, or 500, or any other number, so that the data in its totality becomes available. Organizations may have a great deal of data stored in the cloud, but without the data lake they cannot easily connect it all, and discover the often-hidden relationships in the world around us. It is in those relationships that knowledge and insight and opportunity reside. Tagging the Data The data lake also radically differs from conventional cloud computing in the way the data itself is managed. When a piece of data is ingested, certain details, called metadata (or data about the data ), are added so that the basic information can be quickly located and identified. For example, an investor s portfolio balance (the data) might be stored with the name of the investor, the account number, the location of the account, the types of investments, the country the investor lives in, and so on. These metadata tags serve the same purpose as old-style card catalogues, which allow readers to find a book by searching the author, title, or subject. As with the card catalogues, tags enable us to find particular information from a number of different starting points but with today s tagging abilities, we can characterize data in nearly limitless ways. The more tags, the more complex and rich the analytics can become. With the tags, we can look not only for connections and patterns in the data, but in the tags as well. To consider how this technology might be applied, imagine if a pharmaceutical company were able to fully integrate a wide range of public data to identify drug compounds with few adverse reactions, and a high likelihood of clinical and commercial success. Those sources might include social media and market data to help determine the need and clinical test data, chemical structure, disease analysis, even information about patents to find where gaps might exist. In a sense, the pharmaceutical company is looking for a needle in a haystack, a prohibitively expensive and timeconsuming task with conventional cloud computing. However, if the structured and unstructured data is appropriately tagged and placed in the data lake, it becomes cost-effective to find the essential connections in all that data, and make the needle stand out brightly. The data lake allows us to ask questions and search for patterns using either the data itself, the tags themselves, or a combination of both. We can begin our search with any piece of data or tag for example, a market analysis or the existing patents on a type of drug and pivot off of it in any direction to look for connections. While the process of tagging information is not new, the data lake uses it in a unique way as the primary method of locating and managing the data. With the tags, the rigid data structures that so limit the conventional approach are no longer needed. MARCH 2013 3

Along with the streamlined ingest process, tags help give the data lake its speed. When organizations need to update or search the data in new ways, they do not have to tear down and rebuild data structures, as in the conventional method. They can simply update the tags already in place. Tagging all of the data, and at a much more granular level than is possible in the conventional cloud approach, greatly expands the value that big data can provide. Information in the data lake is not random and chaotic, but rather is purposeful. The tags help make the data lake like a viscous medium that holds the data in place, and at the same time fosters connections. The tags also provide a strong new layer of security. We can tag each piece of data, down to the image or paragraph in a document, with the relevant restrictions, authorities, and security and privacy levels. Organizations can establish rules regarding which information can be shared, with whom, and under what circumstances. A New Way of Storing Data With the conventional approach, data storage is expensive even in the cloud. The reason is that so much space is wasted. Imagine a spreadsheet combining two data sources, an original one with 100 fields and the other with 50. The process of combining means that we will be adding 50 new columns into the original spreadsheet. Rows from the original will hold no data for the new columns, and rows from the new source will hold no data from the original. The result will be a great deal of empty cells. This is wasted storage space, and creates the opportunity for a great many errors. In the data lake, however, every cell is filled no space is wasted. This makes it possible to store vast amounts of data in far less space than would be required for even relatively small conventional cloud databases. As a result, the data lake can cost-effectively scale to an organization s growing data, including multiple outside sources. The data lake s almost limitless capacity enables organizations to store data in a variety of different forms, to aid in later analysis. A financial institution, for example, could store records of certain transactions converted into all of the world s major currencies. Or, a company could translate every document on a particular subject into Chinese, and store it until it might be needed. One of the more transformative aspects of the data lake is that it stores every type of data equally not just structured and unstructured, but also batch and streaming. Batch data is typically collected on an automated basis and then delivered for analysis en masse for example, the utility meter readings from homes. Streaming data is information from a continuous feed, such as video surveillance. Formatting unstructured, batch, and streaming data inevitably strips it of much of its richness. And even if a portion of the information can be put into a conventional cloud database, we are still constrained by limited, pre-defined questions. The data lake holds no such constraints. When unstructured, batch, and streaming data are ingested, analytics can take advantage of the tagging approach to begin to look for patterns that naturally emerge. All types of data, and the value they hold, now become fully accessible. The US military is taking advantage of this capability to help track insurgents and others who are planting improvised explosive devices (IEDs) and other bombs. Many of the military s data sources include unstructured data, and using the conventional approach with its extensive preparation had proved unwieldy and time-consuming. With the data lake, the military is now able to quickly integrate and analyze its vast array of disparate data sources including its unstructured data giving military commanders unprecedented situational awareness. This is another example of why simply amassing large amounts of data does not create a data lake. The military was collecting an enormous quantity of data, but without the data lake could not make full use of it to try to stop IEDs. Commanders have reported that the current approach which has the data lake as its centerpiece is saving more lives, and at a lower operating cost than the traditional methods. Accessing the Data for Analytics One of the chief drawbacks of the conventional approach, which the cloud does not ameliorate, is that it essentially samples the data. When we have questions (or want to test hypotheses), we select a sample of the available data and apply analytics to it. The problem is that we are never quite sure we are pulling the right sample that is, whether it is really representative of the whole. The data lake eliminates sampling. We no longer have to guess about which data to use, because we are using it all. With the data lake, our information is available for analysis on-demand, when the need arises. The conventional approach not only requires extensive data preparation, but it is difficult to change databases as queries change. Say the pharmaceutical company wants to add new data sources to identify promising 4 MARCH 2013

drug compounds, or perhaps wants to change the type of financial analyses it uses. With the conventional approach, analysts would have to tear down the initial data and analytics structures, and re-engineer new ones. With the data lake, analysts would simply add the new data, and ask the new questions. Because it is not easy to change conventional data structures, the information they contain can become outdated and even obsolete fairly quickly. By contrast, we are able to add new information to the data lake the moment we need it. This ease in accessibility sets the stage for the advanced, high-powered analytics that can point the way to top-line business growth, and help government achieve its goals in innovative ways. Analytics that search for connections and look for patterns have long been hamstrung by being confined to limited, rigid datasets and databases. The data lake frees them to search for knowledge and insight across all of the data. In essence, it allows the analytics, for the first time, to reach their true potential. Because there is no need to continually engineer and re-engineer data structures, the data lake also becomes accessible to non-technical subject matter experts. They no longer need to rely on computer scientists and others to explore the data they can ask the questions themselves. Subject matter experts best understand how big data can provide value to their businesses and agencies. The data lake helps put the answers directly in their hands. amounts of data no matter how large will not necessarily yield more knowledge and insight. The trick is to connect the data and make it useful essentially, to create the kinds of conditions that can turn big data into opportunity. The data lake and the larger Cloud Analytics Reference Architecture represent a revolutionary approach and a new mindset that make those conditions possible. Opportunity is out there, if we have the tools to look for it. FOR MORE INFORMATION Mark Herman herman_mark@bah.com 703-902-5986 Michael Delurey delurey_mike@bah.com 703-902-6858 www.boozallen.com/cloud A New Mindset Virtually every aspect of the data lake creates cost savings and efficiencies, from freeing up analysts to its ability to easily and inexpensively scale to an organization s growing data. Because the data lake enables organizations to gather and analyze ever-greater amounts of data, it also gives them new opportunities for top-line revenue growth. The data lake enables both business and government to reach that tipping point at which data helps us to do things not just cheaper and better, but in ways we have not yet imagined. Organizations may believe that because they are now in the cloud and can put all their data in one place, they already have a version of the data lake. But greater This document is part of a collection of papers developed by Booz Allen Hamilton to introduce new concepts and ideas spanning cloud solutions, challenges, and opportunities across government and business. For media inquiries or more information on reproducing this document, please contact: James Fisher Senior Manager, Media Relations, 703-377-7595, fisher_james_w@bah.com Carrie Lake Manager, Media Relations, 703-377-7785, lake_carrie@bah.com MARCH 2013 5 12.032.12P