HOW THE DATA LAKE WORKS by Mark Jacobsohn Senior Vice President Booz Allen Hamilton Michael Delurey, EngD Principal Booz Allen Hamilton As organizations rush to take advantage of large and diverse data sets, many find they simply cannot keep up with the exponential growth in the volume, velocity and variety of information today. So much data is coming in at such an overwhelming rate that organizations with conventional approaches to data storage and management cannot hope to capture it all, much less process it all. Inevitably, some of the most valuable information particularly unstructured data gets left on the cutting room floor. And organizations have no way of knowing how much critical knowledge and insight is being lost. To meet this challenge, Booz Allen Hamilton pioneered the Data Lake a completely new approach that not only manages the volume, velocity and variety of data, but actually becomes more powerful as all three aspects increase. What makes this possible is a transformative shift from schemaon-write to schema-on-read. With schema-on-write, which underlies the process known as extract, transform and load (ETL), it is necessary to design the data model and analytic frameworks before any data is loaded. This means we need to know in advance how we might use our data in the future a kind of catch-22 that severely limits the scope and value of our inquiries. With the schema-on-read, however, we can call upon the data for analysis as needed. The frameworks are created ad hoc and in an iterative methodology for whatever purpose we have in mind with only a minimum amount of preparation. This fundamental change in approach has far-reaching implications. Business and government organizations are discovering that the larger and more diverse their data, the less effective ETL becomes. Analysts often must spend the bulk of their time simply creating the frameworks, preparing the data and maintaining the infrastructure. As lines of inquires inevitably change, the frameworks must be torn down and rebuilt, data must be re-ingested and re-indexed, and schemas must be updated also at great effort. And the frameworks themselves are 2014 Booz Allen Hamilton Inc. All rights reserved. No part of this document may be reproduced without prior written permission of Booz Allen Hamilton.
Figure 1 The data cell within the Data Lake KEY ROW ID COLUMN TAG GROUP TAG VISIBILITY TIME STAMP VALUE Source: Booz Allen Hamilton difficult to connect, hampering the ability of organizations to integrate and analyze their data. The Data Lake s schema-on-read eliminates these and many other constraints of ETL, enabling organizations to draw full value from their data, no matter how large it grows. Data can be loaded first, then transformed and indexed in an iterative methodology as organizational understanding of data improves. The Data Lake uses a key/value store, an innovative approach founded on schemaon-read. With the key/value store, all relevant information associated with a piece of data is stored with that item in the form of metadata tags. These tags make it possible to store and manage vast amounts of data of all types and have it immediately available for analysis. This ability, coupled with the Data Lake s inexpensive storage running on commodity hardware, enables organizations to add a virtually unlimited number of new data sources at minimal risk. USING THE KEY/VALUE STORE With the Data Lake, an organization s entire repository of data is entered into a giant table and organized through the metadata tags. Each piece of data, such as a name, a photograph, an incident report or a Twitter feed, is placed in an individual cell. It does not matter where in the Data Lake any piece of data is located, where it comes from, or how it might be formatted. Because all of the data can easily be connected through the tags, the time-intensive frameworks of ETL are no longer necessary. Tags can evolve and be added or changed as analytic needs change; this is a fundamental difference between a relational database, which requires a predefined schema, and the Data Lake. Four different types of tags essentially serve as pointers to the data within the cell. They are the primary tag, the tag group, the time stamp, and the Row ID. In addition, the cell contains information on visibility, presented in a logical expression that governs who has access to the data in the cell and under what circumstances. Figure 1 shows how this information is structured within an individual cell. To help show how these data work together in practice, the data and tags are represented in rows in an example shown in Figure 2. Here, a variety of data about an investor is entered into the Data Lake, such as personal information and stock transactions. The first column shows the actual data. The second column, with the primary tag, identifies the type of data in the cell such as name, birthdate, account number, etc. The tags themselves can be organized into groups in this case, the groups might be Investor information and Transactions. There can be any number of primary tags and tag groups and they do not have to be defined before data is ingested. The time stamp tag uses information embedded with the data in the original data source here, the time and date of the various stock transactions. The time stamp helps to distinguish different versions of a similar activity. Not all data entered into the Data Lake must be accompanied by time and date information, so in those cases a time stamp tag would not be applicable. 2 JUNE 2014
TYPE OF DATA IN THE CELL ORGANIZING PRINCIPLE FOR THE TAGS USED TIME AND DATE OF THE STOCK TRANSACTIONS DESIGNATION THAT THE ENTRIES ARE DIRECTLY CONNECTED DATA PRIMARY TAG TAG GROUP TIME STAMP John Doe Name Investor Information 1 5/17/71 Date of Birth Investor Information 1 1234-56 Account # Investor Information 1 300 Shares ABBC Stock Sales Transactions 9/17/2013 10:43 AM 1 200 Shares ABBC Stock Sales Transactions 9/17/2013 2:34 PM 1 600 Shares XYYZ Stock Purchases Transactions 9/17/2013 3:03 PM 1 ROW ID Figure 2 Four types of tags serving as pointers to the data Source: Booz Allen Hamilton The fourth type of tag is Row ID. The rows themselves are not each given their own number. Instead, entire groups of rows usually all relating to a single person or entity are given the same Row ID number. This designates that they are all directly connected with each other. This also allows closely related data to be sharded, or horizontally partitioned, in close disk locations in the underlying storage. In the example, we know that the birthdate, account number and stock transactions are all associated with John Doe because they all have the same Row ID. In the Data Lake, there can be hundreds and even thousands of rows with the same Row ID. It now becomes possible to ask questions of the data, or search for patterns, using any combination of these points: the data itself, the primary tag, the tag group, the time stamp, or the Row ID. We might want to know, for example, which investors made large purchases of a particular stock within a certain time frame. Or perhaps we want to know the frequency with which investors in certain foreign countries make transactions. Any combination of data and tags can be used in our queries. Every piece of structured and unstructured data does not have to be tagged upon ingest. Say, for example, a data source has a large number of data points about an individual, but only a few are needed for an initial inquiry. Every data point can be added to the Data Lake, though only selected ones are assigned tags. The others do not need to be tagged, other than to be given the particular row number associated with the individual so the data can be located later. This saves time because the analysts do not need to assign tags to the bulk of the data. Not defining all tags in the beginning also saves time related to expensive data modeling activities. And the additional data points are now part of the Data Lake, available to be tagged and analyzed whenever needed. Unlike with traditional data structures, we do not need to capture and define the data up front. A single piece of data can be given multiple primary tags and tag groups, which can be assigned ad hoc as we learn more about our information. Say we later decide to conduct queries on investors who are employees of the JUNE 2014 3
Figure 3 Creating new rows for additional data on John Doe DATA PRIMARY TAG TAG GROUP TIME STAMP ROW ID John Doe Name Investor Information 1 5/17/71 Date of Birth Investor Information 1 1234-56 Account # Investor Information 1 300 Shares ABBC Stock Sales Transactions 9/17/2013 10:43 AM 1 200 Shares ABBC Stock Sales Transactions 9/17/2013 2:34 PM 1 600 Shares XYYZ Stock Purchases Transactions 9/17/2013 3:03 PM 1 John Doe Name Employee 1 202-555-1212 Telephone # Investor Information 1 NEW DATA PROVIDING ADDITIONAL INVESTOR INFORMATION ON JOHN DOE NEW TAG GROUP INDICATING WHETHER JOHN DOE IS AN EMPLOYEE Source: Booz Allen Hamilton Figure 4 Creating a new Row ID, with related data and tags DATA PRIMARY TAG TAG GROUP TIME STAMP ROW ID John Doe Name Investor Information 1 5/17/71 Date of Birth Investor Information 1 1234-56 Account # Investor Information 1 300 Shares ABBC Stock Sales Transactions 9/17/2013 10:43 AM 1 200 Shares ABBC Stock Sales Transactions 9/17/2013 2:34 PM 1 600 Shares XYYZ Stock Purchases Transactions 9/17/2013 3:03 PM 1 John Doe Name Employee 1 202-555-1212 Telephone # Investor Information 1 Jane Smith Name Investor Information 2 2/1/76 Date of Birth Investor Information 2 3634-56 Account # Investor Information 2 1200 Shares ABBC Stock Sales Transactions 6/24/2013 8:16 AM 2 280 Shares QQWD Stock Purchases Transactions 6/24/2013 11:11 AM 2 160 Shares XYYZ Stock Purchases Transactions 6/24/2013 2:36 PM 2 917-555-2121 Telephone # Investor Information 2 Source: Booz Allen Hamilton 4 JUNE 2014
bank. As shown in Figure 3, we can create a new row in the key/value store that provides the data John Doe with an additional tag group: Employee. And at any time, we can add new data on the person, such as a phone number. Figure 3 shows how our updated example might look. Data about another investor might be given the Row ID #2. The Row ID not only connects the data associated with a particular person or entity, it also distinguishes one person or entity from another. Updating the example in Figure 4, we see the content in the new rows. In addition, the key/value store s flexibility in assigning tags means we do not have to know what the data refers to when we enter it into the Data Lake. We might have a nine-digit number associated with a certain person, but perhaps we do not know whether it is a phone number, a Social Security number, a bank account number, or whether it refers to something else. We can add it to the Data Lake and then run queries to see if it is similar to other nine-digit numbers in the Data Lake. Unlike with relational databases, we do not need to know in advance how we will be using the information, or whether we will be using it at all. We can simply add potentially relevant information into the Data Lake and add tags iteratively, as we gain more insight into the data. THE DATA LAKE IN ACTION Because the data in the Data Lake is all connected and uses a schema-on-read approach, its entirety can be searched during any query. In addition, subsets of data and tags can be indexed and analyzed independently. There are three basic ways of searching the Data Lake: by the data itself; by the data and primary tags together; or by the data, primary tags and tag groups together. Say we want to learn what influence a prominent expert on gold trading has on gold prices. We might load into the Data Lake articles, blogs and other content in which the expert is either the author or is quoted by others. Because the Data Lake easily accepts unstructured data, we can include posts on Twitter, Facebook and other social media sites, as well as podcasts and television programs. Searching by the data alone. Say our first question is, Is the price of gold tied to how often the expert s name appears in a tweet, article, blog or in other content? Using the time stamps, we can run analytics that will track mentions of the expert to changes in the price of gold. Searching by data and primary tags together. Say we next want to know, Is the price of gold tied to how often the expert is the author of a tweet, article, blog, etc.? In this case, we would search for the tag Author in all the content that mentions the expert. Searching by the data, primary tags and tag groups together. Next, we might want to narrow our question to, Is the price of gold tied to how often the expert is the author of a tweet? Here, we are looking at content mentioning the expert in which the tag is Author and the tag group is Tweet. Unlike with schema-on-write, where the data structures and analytics must be torn down and rebuilt with each new line of inquiry, using the key/value store makes it easy to switch variables in and out. We might, for example, want to drill down into the content of the tweets, which can also be tagged, and ask what happens when our gold expert discusses a particular subject, such as gold production in China or global supply and demand. Or we may want to see the effect of several experts combined. Or perhaps we want to gauge how influential particular television programs or blogs are on the price of gold. BUILDING THE DATA LAKE The Data Lake is a combination of publicly available powerhouse software programs like Hadoop and Accumulo, and a wide range of Booz Allen proprietary tools and techniques primarily associated with ingest and analytics. In particular, Hadoop and Accumulo, as adapted by the Data Lake, work together to deliver schema-on-read. The Data Lake s key/value store, derived from Accumulo, is supported by a distributed file system (i.e., Hadoop Distributed File System or HDFS), rather than by a conventional storage area network (SAN). With a SAN, data is taken out of storage for processing and then returned, traveling back and forth through a narrow fiber channel that substantially limits speed and JUNE 2014 5
capacity. With the Data Lake s distributed file system, however, the processing is conducted right at the point of storage on thousands of nodes, all networked together in a cloud environment. Through Hadoop, the calculations on all these nodes are conducted in a parallel manner, making it possible for the entirety of the Data Lake to be processed all at once. In essence, Accumulo uses Hadoop to get the data and then moves it to the appropriate locations for analysis. The distributed files system also makes it considerably less expensive to add storage than with a SAN. Instead of continually purchasing and configuring new storage systems, as with a SAN, more nodes can simply be added to the distributed file system as needed. This enables the Data Lake to quickly and easily scale to an organization s growing data. FILLING THE DATA LAKE One of the chief drawbacks of the read-on-write is the sheer time and expense of preparing data. Major IT projects typically require huge datamodeling and standardization committees that often take a year or more to complete their work. The committees must define the problem space they want to tackle, decide what questions they need to ask and then figure out how to design the database schema to answer their questions. Because it is difficult to bring in new data sources once the structure is complete, there is often much disagreement over exactly what information should be included or left out. With these limitations, analysts cannot interactively ask questions of the data they must form hypotheses well in advance of the actual analysis and then create the data structures and analytics to test those hypotheses. Consequently, the only results that come back are the ones that the data structures and analytics are designed to provide. There is a high risk of creating a closedloop system an echo chamber that merely validates the original hypothesis. This is not an issue with the Data Lake, where both structured and unstructured data can be ingested quickly and easily, without data modeling or standardization. Structured data from conventional databases is placed into the rows of the Data Lake table in a largely automated process. Analysts choose which tags and tag groups to assign, typically drawn from the original tabular information. As noted earlier, the same piece of data can be given multiple tags, and tags can be changed or added at any time. Because the schema for storing does not need to be defined up front, expensive and timeconsuming modeling is not needed. INGESTING UNSTRUCTURED AND SEMI-STRUCTURED DATA Unstructured data widely seen as holding the most promise for creating new areas of business growth and efficiency accounts for much of the explosion in big data today (see Figure 5). However, because of the constraints of the conventional schema-onwrite approach, only a small portion of this valuable resource is ever tapped. Using a schema-on-write, unstructured data must be substantially transformed. The process is so time intensive that many organizations have found it nearly impossible to scale to their growing unstructured data. With the Data Lake s schema-on-read, however, there is no need for extensive data transformation unstructured and semi-structured data can be quickly ingested and made ready for analysis. Individual pieces of unstructured data, such as all or portions of tweets, are placed in rows and assigned the appropriate tags. Say a Fortune 500 company tweets its third-quarter financial results. Software configured for the Data Lake identifies various elements of the tweet it recognizes, for example, that a # symbol followed by text is a hashtag. It also recognizes the patterns of URLs, email addresses and other types of information. In a largely automated process, the individual elements of the tweet are identified and then loaded into the Data Lake. Even unstructured data without easily identifiable content, such as doctors notes, can be quickly ingested into the Data Lake. Doctors notes, for example, are typically filled with sentence fragments, medical shorthand and other quirks, such as the framing of patient conditions in the negative (as in, the patient was not sweating ). The Data Lake brings together a variety of natural 6 JUNE 2014
Figure 5 The volume of structured vs. unstructured data in the word 2.2 Trillions of Gigabytes (Zettabytes) 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 Structured Data Unstructured Data Text, Log Files, Blogs, Tweets, Audio, Video, etc. 0.2 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Source: IDC 2011 Digital Universe Study (http://www.emc.com/leadership/programs/digital-universe.htm) language processing techniques and customizes them for specific types of unstructured data in this case, the phrasing of medical conditions. Again, the process is automated, making it possible to ingest and tag large amounts of unstructured data in a short period. ADDING NEW DATA SOURCES With the conventional approach, organizations may be reluctant to add new data sources no matter how promising because they fear the time and expense may outweigh the possible benefit. But with the Data Lake, organizations can add new data sources with little or no risk. This is possible because of two powerful features of the Data Lake s schema-on-read approach: all types of data can be ingested quickly and storage is inexpensive, and data can be stored in HDFS until it is ready to be analyzed. Say an organization has 20 new potential data sources, but does not know in advance which ones, if any, might be useful. An organization using the conventional schema-on-write may be reluctant to add any of the sources. But, the Data Lake actually encourages organizations to add new data sources because the time and resources needed are significantly reduced. Organizations need not fear adding what might be useless information; in a sense, there is no useless information in the Data Lake. NO WASTED SPACE With schema-on-write, data storage is inefficient, even in the cloud. The reason is that so much space is wasted due to the sparse table problem. Imagine a spreadsheet combining two data sources, an original one with 100 fields and the other with 50. The process of combining means that we will be adding 50 new columns into the original spreadsheet. Rows from the original will hold no data for the new columns, and rows from the new source will hold no data from the original. The result will be a great deal of empty cells. This not only wastes storage space, it also creates the opportunity for a great many errors. In the Data Lake, however, no space is wasted. Each piece of data is assigned a row, and since the data does not need to be combined at ingest, there are no empty rows or columns. This makes it possible to store vast amounts of data in far less space than would be required for even relatively small conventional cloud databases. A GRANULAR LEVEL OF SECURITY AND PRIVACY With most relational databases, security and privacy restrictions tend to be at the level of the database or of a table within the database (some databases have row-level security, but that is expensive to implement and maintain). If someone is not authorized to see a single JUNE 2014 7
piece of data in a table, for example, then the entire table is off limits. Analysts running queries of databases may not have access to large swaths of data that should be available to them, severely degrading the results. This is not an issue in the Data Lake, which uses an Attribute-Based Access Control (ABAC) system that allows security and privacy restrictions to be built around each piece of data. As data is ingested into the Data Lake, it is placed in individual cells. Each cell also contains that piece of data s visibility, which determines who has access to the data in the cell and under what circumstances. Visibility might be based on a user s role in the organization. For example, at a health insurance company, cells with patient names and birthdates might be accessible to employees in management and in the accounting, claims, and legal departments. But other cells, say with patient medical information, might have visibility only to employees in claims and legal. When employees log onto the computer system to run queries of the Data Lake or when analytics are run, their department is identified. Their queries will automatically be limited to the appropriate cells. The visibility of a particular piece of data can be configured in multiple ways. Instead of the user s role, it might be based on an individual s clearance to see certain types of information. Or visibility might require both role and clearance. Visibility might also be based on the data source some users, for example, may have access to newspaper articles but not Twitter feeds. Any factor can be considered and can be used in combination with any others. With the conventional approach, changing the visibility of data can be cumbersome and time-consuming it often means stripping information out of one database and putting it in another. But with the Data Lake, it is as simple as making a change to the logical expression, which is a quick, automated task. OVERCOMING THE HURDLES OF VOLUME, VELOCITY AND VARIETY The shift from schema-on-write to schemaon read is not an incremental advance, but rather represents a completely new mindset, one designed expressly for the challenges of large and diverse data sets. With the virtually unlimited capacity of the Data Lake s key/value store, and with its underlying infrastructure that expands easily and inexpensively, organizations can analyze an exponentially increasing volume of data. Free of the constraints of data modeling, normalization and other schema-onwrite requirements, organizations can keep pace with the velocity of information. And because the Data Lake accepts unstructured data without painstaking formatting and structuring, organizations can draw full value from big data in all its variety. Schema-on-read and the Data Lake are new approaches for a new time. FOR MORE INFORMATION Mark Jacobsohn Senior Vice President Jacobsohn_Mark@bah.com 703-902-5290 Michael Delurey, EngD Principal Delurey_Mike@bah.com 703-902-6858 www.boozallen.com/cloud This document is part of a collection of papers developed by Booz Allen Hamilton to introduce new concepts and ideas spanning cloud solutions, challenges, and opportunities across government and business. For media inquiries or more information on reproducing this document, please contact: James Fisher Senior Manager, Media Relations, 703-377-7595, fisher_james_w@bah.com Carrie Lake Manager, Media Relations, 703-377-7785, lake_carrie@bah.com 8 JUNE 2014