HOW THE DATA LAKE WORKS
|
|
- Buck May
- 8 years ago
- Views:
Transcription
1 HOW THE DATA LAKE WORKS by Mark Jacobsohn Senior Vice President Booz Allen Hamilton Michael Delurey, EngD Principal Booz Allen Hamilton As organizations rush to take advantage of large and diverse data sets, many find they simply cannot keep up with the exponential growth in the volume, velocity and variety of information today. So much data is coming in at such an overwhelming rate that organizations with conventional approaches to data storage and management cannot hope to capture it all, much less process it all. Inevitably, some of the most valuable information particularly unstructured data gets left on the cutting room floor. And organizations have no way of knowing how much critical knowledge and insight is being lost. To meet this challenge, Booz Allen Hamilton pioneered the Data Lake a completely new approach that not only manages the volume, velocity and variety of data, but actually becomes more powerful as all three aspects increase. What makes this possible is a transformative shift from schemaon-write to schema-on-read. With schema-on-write, which underlies the process known as extract, transform and load (ETL), it is necessary to design the data model and analytic frameworks before any data is loaded. This means we need to know in advance how we might use our data in the future a kind of catch-22 that severely limits the scope and value of our inquiries. With the schema-on-read, however, we can call upon the data for analysis as needed. The frameworks are created ad hoc and in an iterative methodology for whatever purpose we have in mind with only a minimum amount of preparation. This fundamental change in approach has far-reaching implications. Business and government organizations are discovering that the larger and more diverse their data, the less effective ETL becomes. Analysts often must spend the bulk of their time simply creating the frameworks, preparing the data and maintaining the infrastructure. As lines of inquires inevitably change, the frameworks must be torn down and rebuilt, data must be re-ingested and re-indexed, and schemas must be updated also at great effort. And the frameworks themselves are 2014 Booz Allen Hamilton Inc. All rights reserved. No part of this document may be reproduced without prior written permission of Booz Allen Hamilton.
2 Figure 1 The data cell within the Data Lake KEY ROW ID COLUMN TAG GROUP TAG VISIBILITY TIME STAMP VALUE Source: Booz Allen Hamilton difficult to connect, hampering the ability of organizations to integrate and analyze their data. The Data Lake s schema-on-read eliminates these and many other constraints of ETL, enabling organizations to draw full value from their data, no matter how large it grows. Data can be loaded first, then transformed and indexed in an iterative methodology as organizational understanding of data improves. The Data Lake uses a key/value store, an innovative approach founded on schemaon-read. With the key/value store, all relevant information associated with a piece of data is stored with that item in the form of metadata tags. These tags make it possible to store and manage vast amounts of data of all types and have it immediately available for analysis. This ability, coupled with the Data Lake s inexpensive storage running on commodity hardware, enables organizations to add a virtually unlimited number of new data sources at minimal risk. USING THE KEY/VALUE STORE With the Data Lake, an organization s entire repository of data is entered into a giant table and organized through the metadata tags. Each piece of data, such as a name, a photograph, an incident report or a Twitter feed, is placed in an individual cell. It does not matter where in the Data Lake any piece of data is located, where it comes from, or how it might be formatted. Because all of the data can easily be connected through the tags, the time-intensive frameworks of ETL are no longer necessary. Tags can evolve and be added or changed as analytic needs change; this is a fundamental difference between a relational database, which requires a predefined schema, and the Data Lake. Four different types of tags essentially serve as pointers to the data within the cell. They are the primary tag, the tag group, the time stamp, and the Row ID. In addition, the cell contains information on visibility, presented in a logical expression that governs who has access to the data in the cell and under what circumstances. Figure 1 shows how this information is structured within an individual cell. To help show how these data work together in practice, the data and tags are represented in rows in an example shown in Figure 2. Here, a variety of data about an investor is entered into the Data Lake, such as personal information and stock transactions. The first column shows the actual data. The second column, with the primary tag, identifies the type of data in the cell such as name, birthdate, account number, etc. The tags themselves can be organized into groups in this case, the groups might be Investor information and Transactions. There can be any number of primary tags and tag groups and they do not have to be defined before data is ingested. The time stamp tag uses information embedded with the data in the original data source here, the time and date of the various stock transactions. The time stamp helps to distinguish different versions of a similar activity. Not all data entered into the Data Lake must be accompanied by time and date information, so in those cases a time stamp tag would not be applicable. 2 JUNE 2014
3 TYPE OF DATA IN THE CELL ORGANIZING PRINCIPLE FOR THE TAGS USED TIME AND DATE OF THE STOCK TRANSACTIONS DESIGNATION THAT THE ENTRIES ARE DIRECTLY CONNECTED DATA PRIMARY TAG TAG GROUP TIME STAMP John Doe Name Investor Information 1 5/17/71 Date of Birth Investor Information Account # Investor Information Shares ABBC Stock Sales Transactions 9/17/ :43 AM Shares ABBC Stock Sales Transactions 9/17/2013 2:34 PM Shares XYYZ Stock Purchases Transactions 9/17/2013 3:03 PM 1 ROW ID Figure 2 Four types of tags serving as pointers to the data Source: Booz Allen Hamilton The fourth type of tag is Row ID. The rows themselves are not each given their own number. Instead, entire groups of rows usually all relating to a single person or entity are given the same Row ID number. This designates that they are all directly connected with each other. This also allows closely related data to be sharded, or horizontally partitioned, in close disk locations in the underlying storage. In the example, we know that the birthdate, account number and stock transactions are all associated with John Doe because they all have the same Row ID. In the Data Lake, there can be hundreds and even thousands of rows with the same Row ID. It now becomes possible to ask questions of the data, or search for patterns, using any combination of these points: the data itself, the primary tag, the tag group, the time stamp, or the Row ID. We might want to know, for example, which investors made large purchases of a particular stock within a certain time frame. Or perhaps we want to know the frequency with which investors in certain foreign countries make transactions. Any combination of data and tags can be used in our queries. Every piece of structured and unstructured data does not have to be tagged upon ingest. Say, for example, a data source has a large number of data points about an individual, but only a few are needed for an initial inquiry. Every data point can be added to the Data Lake, though only selected ones are assigned tags. The others do not need to be tagged, other than to be given the particular row number associated with the individual so the data can be located later. This saves time because the analysts do not need to assign tags to the bulk of the data. Not defining all tags in the beginning also saves time related to expensive data modeling activities. And the additional data points are now part of the Data Lake, available to be tagged and analyzed whenever needed. Unlike with traditional data structures, we do not need to capture and define the data up front. A single piece of data can be given multiple primary tags and tag groups, which can be assigned ad hoc as we learn more about our information. Say we later decide to conduct queries on investors who are employees of the JUNE
4 Figure 3 Creating new rows for additional data on John Doe DATA PRIMARY TAG TAG GROUP TIME STAMP ROW ID John Doe Name Investor Information 1 5/17/71 Date of Birth Investor Information Account # Investor Information Shares ABBC Stock Sales Transactions 9/17/ :43 AM Shares ABBC Stock Sales Transactions 9/17/2013 2:34 PM Shares XYYZ Stock Purchases Transactions 9/17/2013 3:03 PM 1 John Doe Name Employee Telephone # Investor Information 1 NEW DATA PROVIDING ADDITIONAL INVESTOR INFORMATION ON JOHN DOE NEW TAG GROUP INDICATING WHETHER JOHN DOE IS AN EMPLOYEE Source: Booz Allen Hamilton Figure 4 Creating a new Row ID, with related data and tags DATA PRIMARY TAG TAG GROUP TIME STAMP ROW ID John Doe Name Investor Information 1 5/17/71 Date of Birth Investor Information Account # Investor Information Shares ABBC Stock Sales Transactions 9/17/ :43 AM Shares ABBC Stock Sales Transactions 9/17/2013 2:34 PM Shares XYYZ Stock Purchases Transactions 9/17/2013 3:03 PM 1 John Doe Name Employee Telephone # Investor Information 1 Jane Smith Name Investor Information 2 2/1/76 Date of Birth Investor Information Account # Investor Information Shares ABBC Stock Sales Transactions 6/24/2013 8:16 AM Shares QQWD Stock Purchases Transactions 6/24/ :11 AM Shares XYYZ Stock Purchases Transactions 6/24/2013 2:36 PM Telephone # Investor Information 2 Source: Booz Allen Hamilton 4 JUNE 2014
5 bank. As shown in Figure 3, we can create a new row in the key/value store that provides the data John Doe with an additional tag group: Employee. And at any time, we can add new data on the person, such as a phone number. Figure 3 shows how our updated example might look. Data about another investor might be given the Row ID #2. The Row ID not only connects the data associated with a particular person or entity, it also distinguishes one person or entity from another. Updating the example in Figure 4, we see the content in the new rows. In addition, the key/value store s flexibility in assigning tags means we do not have to know what the data refers to when we enter it into the Data Lake. We might have a nine-digit number associated with a certain person, but perhaps we do not know whether it is a phone number, a Social Security number, a bank account number, or whether it refers to something else. We can add it to the Data Lake and then run queries to see if it is similar to other nine-digit numbers in the Data Lake. Unlike with relational databases, we do not need to know in advance how we will be using the information, or whether we will be using it at all. We can simply add potentially relevant information into the Data Lake and add tags iteratively, as we gain more insight into the data. THE DATA LAKE IN ACTION Because the data in the Data Lake is all connected and uses a schema-on-read approach, its entirety can be searched during any query. In addition, subsets of data and tags can be indexed and analyzed independently. There are three basic ways of searching the Data Lake: by the data itself; by the data and primary tags together; or by the data, primary tags and tag groups together. Say we want to learn what influence a prominent expert on gold trading has on gold prices. We might load into the Data Lake articles, blogs and other content in which the expert is either the author or is quoted by others. Because the Data Lake easily accepts unstructured data, we can include posts on Twitter, Facebook and other social media sites, as well as podcasts and television programs. Searching by the data alone. Say our first question is, Is the price of gold tied to how often the expert s name appears in a tweet, article, blog or in other content? Using the time stamps, we can run analytics that will track mentions of the expert to changes in the price of gold. Searching by data and primary tags together. Say we next want to know, Is the price of gold tied to how often the expert is the author of a tweet, article, blog, etc.? In this case, we would search for the tag Author in all the content that mentions the expert. Searching by the data, primary tags and tag groups together. Next, we might want to narrow our question to, Is the price of gold tied to how often the expert is the author of a tweet? Here, we are looking at content mentioning the expert in which the tag is Author and the tag group is Tweet. Unlike with schema-on-write, where the data structures and analytics must be torn down and rebuilt with each new line of inquiry, using the key/value store makes it easy to switch variables in and out. We might, for example, want to drill down into the content of the tweets, which can also be tagged, and ask what happens when our gold expert discusses a particular subject, such as gold production in China or global supply and demand. Or we may want to see the effect of several experts combined. Or perhaps we want to gauge how influential particular television programs or blogs are on the price of gold. BUILDING THE DATA LAKE The Data Lake is a combination of publicly available powerhouse software programs like Hadoop and Accumulo, and a wide range of Booz Allen proprietary tools and techniques primarily associated with ingest and analytics. In particular, Hadoop and Accumulo, as adapted by the Data Lake, work together to deliver schema-on-read. The Data Lake s key/value store, derived from Accumulo, is supported by a distributed file system (i.e., Hadoop Distributed File System or HDFS), rather than by a conventional storage area network (SAN). With a SAN, data is taken out of storage for processing and then returned, traveling back and forth through a narrow fiber channel that substantially limits speed and JUNE
6 capacity. With the Data Lake s distributed file system, however, the processing is conducted right at the point of storage on thousands of nodes, all networked together in a cloud environment. Through Hadoop, the calculations on all these nodes are conducted in a parallel manner, making it possible for the entirety of the Data Lake to be processed all at once. In essence, Accumulo uses Hadoop to get the data and then moves it to the appropriate locations for analysis. The distributed files system also makes it considerably less expensive to add storage than with a SAN. Instead of continually purchasing and configuring new storage systems, as with a SAN, more nodes can simply be added to the distributed file system as needed. This enables the Data Lake to quickly and easily scale to an organization s growing data. FILLING THE DATA LAKE One of the chief drawbacks of the read-on-write is the sheer time and expense of preparing data. Major IT projects typically require huge datamodeling and standardization committees that often take a year or more to complete their work. The committees must define the problem space they want to tackle, decide what questions they need to ask and then figure out how to design the database schema to answer their questions. Because it is difficult to bring in new data sources once the structure is complete, there is often much disagreement over exactly what information should be included or left out. With these limitations, analysts cannot interactively ask questions of the data they must form hypotheses well in advance of the actual analysis and then create the data structures and analytics to test those hypotheses. Consequently, the only results that come back are the ones that the data structures and analytics are designed to provide. There is a high risk of creating a closedloop system an echo chamber that merely validates the original hypothesis. This is not an issue with the Data Lake, where both structured and unstructured data can be ingested quickly and easily, without data modeling or standardization. Structured data from conventional databases is placed into the rows of the Data Lake table in a largely automated process. Analysts choose which tags and tag groups to assign, typically drawn from the original tabular information. As noted earlier, the same piece of data can be given multiple tags, and tags can be changed or added at any time. Because the schema for storing does not need to be defined up front, expensive and timeconsuming modeling is not needed. INGESTING UNSTRUCTURED AND SEMI-STRUCTURED DATA Unstructured data widely seen as holding the most promise for creating new areas of business growth and efficiency accounts for much of the explosion in big data today (see Figure 5). However, because of the constraints of the conventional schema-onwrite approach, only a small portion of this valuable resource is ever tapped. Using a schema-on-write, unstructured data must be substantially transformed. The process is so time intensive that many organizations have found it nearly impossible to scale to their growing unstructured data. With the Data Lake s schema-on-read, however, there is no need for extensive data transformation unstructured and semi-structured data can be quickly ingested and made ready for analysis. Individual pieces of unstructured data, such as all or portions of tweets, are placed in rows and assigned the appropriate tags. Say a Fortune 500 company tweets its third-quarter financial results. Software configured for the Data Lake identifies various elements of the tweet it recognizes, for example, that a # symbol followed by text is a hashtag. It also recognizes the patterns of URLs, addresses and other types of information. In a largely automated process, the individual elements of the tweet are identified and then loaded into the Data Lake. Even unstructured data without easily identifiable content, such as doctors notes, can be quickly ingested into the Data Lake. Doctors notes, for example, are typically filled with sentence fragments, medical shorthand and other quirks, such as the framing of patient conditions in the negative (as in, the patient was not sweating ). The Data Lake brings together a variety of natural 6 JUNE 2014
7 Figure 5 The volume of structured vs. unstructured data in the word 2.2 Trillions of Gigabytes (Zettabytes) Structured Data Unstructured Data Text, Log Files, Blogs, Tweets, Audio, Video, etc Source: IDC 2011 Digital Universe Study ( language processing techniques and customizes them for specific types of unstructured data in this case, the phrasing of medical conditions. Again, the process is automated, making it possible to ingest and tag large amounts of unstructured data in a short period. ADDING NEW DATA SOURCES With the conventional approach, organizations may be reluctant to add new data sources no matter how promising because they fear the time and expense may outweigh the possible benefit. But with the Data Lake, organizations can add new data sources with little or no risk. This is possible because of two powerful features of the Data Lake s schema-on-read approach: all types of data can be ingested quickly and storage is inexpensive, and data can be stored in HDFS until it is ready to be analyzed. Say an organization has 20 new potential data sources, but does not know in advance which ones, if any, might be useful. An organization using the conventional schema-on-write may be reluctant to add any of the sources. But, the Data Lake actually encourages organizations to add new data sources because the time and resources needed are significantly reduced. Organizations need not fear adding what might be useless information; in a sense, there is no useless information in the Data Lake. NO WASTED SPACE With schema-on-write, data storage is inefficient, even in the cloud. The reason is that so much space is wasted due to the sparse table problem. Imagine a spreadsheet combining two data sources, an original one with 100 fields and the other with 50. The process of combining means that we will be adding 50 new columns into the original spreadsheet. Rows from the original will hold no data for the new columns, and rows from the new source will hold no data from the original. The result will be a great deal of empty cells. This not only wastes storage space, it also creates the opportunity for a great many errors. In the Data Lake, however, no space is wasted. Each piece of data is assigned a row, and since the data does not need to be combined at ingest, there are no empty rows or columns. This makes it possible to store vast amounts of data in far less space than would be required for even relatively small conventional cloud databases. A GRANULAR LEVEL OF SECURITY AND PRIVACY With most relational databases, security and privacy restrictions tend to be at the level of the database or of a table within the database (some databases have row-level security, but that is expensive to implement and maintain). If someone is not authorized to see a single JUNE
8 piece of data in a table, for example, then the entire table is off limits. Analysts running queries of databases may not have access to large swaths of data that should be available to them, severely degrading the results. This is not an issue in the Data Lake, which uses an Attribute-Based Access Control (ABAC) system that allows security and privacy restrictions to be built around each piece of data. As data is ingested into the Data Lake, it is placed in individual cells. Each cell also contains that piece of data s visibility, which determines who has access to the data in the cell and under what circumstances. Visibility might be based on a user s role in the organization. For example, at a health insurance company, cells with patient names and birthdates might be accessible to employees in management and in the accounting, claims, and legal departments. But other cells, say with patient medical information, might have visibility only to employees in claims and legal. When employees log onto the computer system to run queries of the Data Lake or when analytics are run, their department is identified. Their queries will automatically be limited to the appropriate cells. The visibility of a particular piece of data can be configured in multiple ways. Instead of the user s role, it might be based on an individual s clearance to see certain types of information. Or visibility might require both role and clearance. Visibility might also be based on the data source some users, for example, may have access to newspaper articles but not Twitter feeds. Any factor can be considered and can be used in combination with any others. With the conventional approach, changing the visibility of data can be cumbersome and time-consuming it often means stripping information out of one database and putting it in another. But with the Data Lake, it is as simple as making a change to the logical expression, which is a quick, automated task. OVERCOMING THE HURDLES OF VOLUME, VELOCITY AND VARIETY The shift from schema-on-write to schemaon read is not an incremental advance, but rather represents a completely new mindset, one designed expressly for the challenges of large and diverse data sets. With the virtually unlimited capacity of the Data Lake s key/value store, and with its underlying infrastructure that expands easily and inexpensively, organizations can analyze an exponentially increasing volume of data. Free of the constraints of data modeling, normalization and other schema-onwrite requirements, organizations can keep pace with the velocity of information. And because the Data Lake accepts unstructured data without painstaking formatting and structuring, organizations can draw full value from big data in all its variety. Schema-on-read and the Data Lake are new approaches for a new time. FOR MORE INFORMATION Mark Jacobsohn Senior Vice President Jacobsohn_Mark@bah.com Michael Delurey, EngD Principal Delurey_Mike@bah.com This document is part of a collection of papers developed by Booz Allen Hamilton to introduce new concepts and ideas spanning cloud solutions, challenges, and opportunities across government and business. For media inquiries or more information on reproducing this document, please contact: James Fisher Senior Manager, Media Relations, , fisher_james_w@bah.com Carrie Lake Manager, Media Relations, , lake_carrie@bah.com 8 JUNE 2014
How To Manage Big Data
The Data Lake: Taking Big Data Beyond the Cloud by Mark Herman Executive Vice President Booz Allen Hamilton Michael Delurey Principal Booz Allen Hamilton The bigger that big data gets, the more it seems
More informationData Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
More informationTurning Big Data into Opportunity
Turning Big Data into Opportunity The Data Lake by Mark Herman herman_mark@bah.com Michael Delurey delurey_mike@bah.com Table of Contents Introduction... 1 A New Mindset... 1 Ingesting Data into the Data
More informationDATAOPT SOLUTIONS. What Is Big Data?
DATAOPT SOLUTIONS What Is Big Data? WHAT IS BIG DATA? It s more than just large amounts of data, though that s definitely one component. The more interesting dimension is about the types of data. So Big
More informationUnlocking The Value of the Deep Web. Harvesting Big Data that Google Doesn t Reach
Unlocking The Value of the Deep Web Harvesting Big Data that Google Doesn t Reach Introduction Every day, untold millions search the web with Google, Bing and other search engines. The volumes truly are
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationWe are Big Data A Sonian Whitepaper
EXECUTIVE SUMMARY Big Data is not an uncommon term in the technology industry anymore. It s of big interest to many leading IT providers and archiving companies. But what is Big Data? While many have formed
More informationDELIVERING ON THE PROMISE OF BIG DATA AND THE CLOUD
DELIVERING ON THE PROMISE OF BIG DATA AND THE CLOUD by Mark Jacobsohn Senior Vice President Booz Allen Hamilton Joshua Sullivan, PhD Vice President Booz Allen Hamilton WHY CAN T WE SEEM TO DO MORE WITH
More informationDatenverwaltung im Wandel - Building an Enterprise Data Hub with
Datenverwaltung im Wandel - Building an Enterprise Data Hub with Cloudera Bernard Doering Regional Director, Central EMEA, Cloudera Cloudera Your Hadoop Experts Founded 2008, by former employees of Employees
More informationThe Next Wave of Data Management. Is Big Data The New Normal?
The Next Wave of Data Management Is Big Data The New Normal? Table of Contents Introduction 3 Separating Reality and Hype 3 Why Are Firms Making IT Investments In Big Data? 4 Trends In Data Management
More informationBig Data. Fast Forward. Putting data to productive use
Big Data Putting data to productive use Fast Forward What is big data, and why should you care? Get familiar with big data terminology, technologies, and techniques. Getting started with big data to realize
More informationData Lake-based Approaches to Regulatory- Driven Technology Challenges
Data Lake-based Approaches to Regulatory- Driven Technology Challenges How a Data Lake Approach Improves Accuracy and Cost Effectiveness in the Extract, Transform, and Load Process for Business and Regulatory
More informationW H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
More informationBig Data and its use for Communications Service Providers. IoT, Washington, DC April 9, 2014 Sanjay Mishra
Big Data and its use for Communications Service Providers IoT, Washington, DC April 9, 2014 Sanjay Mishra 4/9/2014 IoT, Washington, DC, Big Data & its use for CSP (c) 2014 1 Information and Communications
More informationIncrease Agility and Reduce Costs with a Logical Data Warehouse. February 2014
Increase Agility and Reduce Costs with a Logical Data Warehouse February 2014 Table of Contents Summary... 3 Data Virtualization & the Logical Data Warehouse... 4 What is a Logical Data Warehouse?... 4
More informationBig Data Defined Introducing DataStack 3.0
Big Data Big Data Defined Introducing DataStack 3.0 Inside: Executive Summary... 1 Introduction... 2 Emergence of DataStack 3.0... 3 DataStack 1.0 to 2.0... 4 DataStack 2.0 Refined for Large Data & Analytics...
More informationAccelerate BI Initiatives With Self-Service Data Discovery And Integration
A Custom Technology Adoption Profile Commissioned By Attivio June 2015 Accelerate BI Initiatives With Self-Service Data Discovery And Integration Introduction The rapid advancement of technology has ushered
More informationInformation Governance
Information Governance & Extended Content Solutions 2013 SOUND FAMILIAR? How do we connect our information together? How do we manage multiple un-integrated repositories of documents? Our users don t know
More informationBig Data for the Rest of Us Technical White Paper
Big Data for the Rest of Us Technical White Paper Treasure Data - Big Data for the Rest of Us 1 Introduction The importance of data warehousing and analytics has increased as companies seek to gain competitive
More informationAnnex: Concept Note. Big Data for Policy, Development and Official Statistics New York, 22 February 2013
Annex: Concept Note Friday Seminar on Emerging Issues Big Data for Policy, Development and Official Statistics New York, 22 February 2013 How is Big Data different from just very large databases? 1 Traditionally,
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK BIG DATA HOLDS BIG PROMISE FOR SECURITY NEHA S. PAWAR, PROF. S. P. AKARTE Computer
More informationCloud Computing: What a Project Manager Needs to Know
Cloud Computing: What a Project Manager Needs to Know Dr. Patrick D. Allen, PMP Patrick.allen@jhuapl.edu Purpose Provide Project Managers with the very basics of the three primary types of Clouds and Cloud
More informationBig Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
More informationNorth Highland Data and Analytics. Data Governance Considerations for Big Data Analytics
North Highland and Analytics Governance Considerations for Big Analytics Agenda Traditional BI/Analytics vs. Big Analytics Types of Requiring Governance Key Considerations Information Framework Organizational
More informationBringing Big Data into the Enterprise
Bringing Big Data into the Enterprise Overview When evaluating Big Data applications in enterprise computing, one often-asked question is how does Big Data compare to the Enterprise Data Warehouse (EDW)?
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationForget information overload..the real challenge is content intelligence
FINAL DRAFT 10 th Feb 2014 Forget information overload..the real challenge is content intelligence Research Summary, MindMetre, February 2014 Management Summary According to research among senior information
More informationThe Principles of the Business Data Lake
The Principles of the Business Data Lake The Business Data Lake Culture eats Strategy for Breakfast, so said Peter Drucker, elegantly making the point that the hardest thing to change in any organization
More informationBig Data and Apache Hadoop Adoption:
Expert Reference Series of White Papers Big Data and Apache Hadoop Adoption: Key Challenges and Rewards 1-800-COURSES www.globalknowledge.com Big Data and Apache Hadoop Adoption: Key Challenges and Rewards
More informationNoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
More informationBIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata
BIG DATA: FROM HYPE TO REALITY Leandro Ruiz Presales Partner for C&LA Teradata Evolution in The Use of Information Action s ACTIVATING MAKE it happen! Insights OPERATIONALIZING WHAT IS happening now? PREDICTING
More informationAdobe Insight, powered by Omniture
Adobe Insight, powered by Omniture Accelerating government intelligence to the speed of thought 1 Challenges that analysts face 2 Analysis tools and functionality 3 Adobe Insight 4 Summary Never before
More informationBBBT Podcast Transcript
BBBT Podcast Transcript About the BBBT Vendor: The Boulder Brain Trust, or BBBT, was founded in 2006 by Claudia Imhoff. Its mission is to leverage business intelligence for industry vendors, for its members,
More informationQLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM
QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment
More informationInternational Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop
ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com
More informationTwo Recent LE Use Cases
Two Recent LE Use Cases Case Study I Have A Bomb On This Plane (Miami Airport) In January 2012, an airline passenger tweeted she had a bomb on a Jet Blue commercial aircraft at the Miami International
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationTen Mistakes to Avoid
EXCLUSIVELY FOR TDWI PREMIUM MEMBERS TDWI RESEARCH SECOND QUARTER 2014 Ten Mistakes to Avoid In Big Data Analytics Projects By Fern Halper tdwi.org Ten Mistakes to Avoid In Big Data Analytics Projects
More informationEliminating Complexity to Ensure Fastest Time to Big Data Value
Eliminating Complexity to Ensure Fastest Time to Big Data Value Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest
More informationA TECHNICAL WHITE PAPER ATTUNITY VISIBILITY
A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY Analytics for Enterprise Data Warehouse Management and Optimization Executive Summary Successful enterprise data management is an important initiative for growing
More informationCapitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
More informationManaging Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges
Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and
More informationGenerating the Business Value of Big Data:
Leveraging People, Processes, and Technology Generating the Business Value of Big Data: Analyzing Data to Make Better Decisions Authors: Rajesh Ramasubramanian, MBA, PMP, Program Manager, Catapult Technology
More informationAnalytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
More informationSources: Summary Data is exploding in volume, variety and velocity timely
1 Sources: The Guardian, May 2010 IDC Digital Universe, 2010 IBM Institute for Business Value, 2009 IBM CIO Study 2010 TDWI: Next Generation Data Warehouse Platforms Q4 2009 Summary Data is exploding
More informationFormal Methods for Preserving Privacy for Big Data Extraction Software
Formal Methods for Preserving Privacy for Big Data Extraction Software M. Brian Blake and Iman Saleh Abstract University of Miami, Coral Gables, FL Given the inexpensive nature and increasing availability
More informationBest Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
More informationForward Thinking for Tomorrow s Projects Requirements for Business Analytics
Seilevel Whitepaper Forward Thinking for Tomorrow s Projects Requirements for Business Analytics By: Joy Beatty, VP of Research & Development & Karl Wiegers, Founder Process Impact We are seeing a change
More informationManaging Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
More informationBefore You Buy: A Checklist for Evaluating Your Analytics Vendor
Executive Report Before You Buy: A Checklist for Evaluating Your Analytics Vendor By Dale Sanders Sr. Vice President Health Catalyst Embarking on an assessment with the knowledge of key, general criteria
More informationUbuntu and Hadoop: the perfect match
WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely
More informationBig Data at Cloud Scale
Big Data at Cloud Scale Pushing the limits of flexible & powerful analytics Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For
More informationTraditional BI vs. Business Data Lake A comparison
Traditional BI vs. Business Data Lake A comparison The need for new thinking around data storage and analysis Traditional Business Intelligence (BI) systems provide various levels and kinds of analyses
More informationEliminating Complexity to Ensure Fastest Time to Big Data Value
Eliminating Complexity to Ensure Fastest Time to Big Data Value Copyright 2013 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest
More informationBanking On A Customer-Centric Approach To Data
Banking On A Customer-Centric Approach To Data Putting Content into Context to Enhance Customer Lifetime Value No matter which company they interact with, consumers today have far greater expectations
More informationAgile Business Intelligence Data Lake Architecture
Agile Business Intelligence Data Lake Architecture TABLE OF CONTENTS Introduction... 2 Data Lake Architecture... 2 Step 1 Extract From Source Data... 5 Step 2 Register And Catalogue Data Sets... 5 Step
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationUsing Tableau Software with Hortonworks Data Platform
Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data
More informationBIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics
BIG DATA & ANALYTICS Transforming the business and driving revenue through big data and analytics Collection, storage and extraction of business value from data generated from a variety of sources are
More informationIdentifying Fraud, Managing Risk and Improving Compliance in Financial Services
SOLUTION BRIEF Identifying Fraud, Managing Risk and Improving Compliance in Financial Services DATAMEER CORPORATION WEBSITE www.datameer.com COMPANY OVERVIEW Datameer offers the first end-to-end big data
More informationBusiness Intelligence Data Detectives. The Truth is in There
Business Intelligence Data Detectives The Truth is in There Welcome Jason Hernandez Director, Information Management Y&L Consulting, Inc. @jasonuhernandez Clint Campbell Solutions Architect Y&L Consulting,
More informationNOSQL, BIG DATA AND GRAPHS. Technology Choices for Today s Mission- Critical Applications
NOSQL, BIG DATA AND GRAPHS Technology Choices for Today s Mission- Critical Applications 2 NOSQL, BIG DATA AND GRAPHS NOSQL, BIG DATA AND GRAPHS TECHNOLOGY CHOICES FOR TODAY S MISSION- CRITICAL APPLICATIONS
More informationBeyond the Data Lake
WHITE PAPER Beyond the Data Lake Managing Big Data for Value Creation In this white paper 1 The Data Lake Fallacy 2 Moving Beyond Data Lakes 3 A Big Data Warehouse Supports Strategy, Value Creation Beyond
More informationWrangling Actionable Insights from Organizational Data
Wrangling Actionable Insights from Organizational Data Koverse Eases Big Data Analytics for Those with Strong Security Requirements The amount of data created and stored by organizations around the world
More informationDISCOVERING AND SECURING SENSITIVE DATA IN HADOOP DATA STORES
DATAGUISE WHITE PAPER SECURING HADOOP: DISCOVERING AND SECURING SENSITIVE DATA IN HADOOP DATA STORES OVERVIEW: The rapid expansion of corporate data being transferred or collected and stored in Hadoop
More informationHortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc. 2011 2014. All Rights Reserved
Hortonworks & SAS Analytics everywhere. Page 1 A change in focus. A shift in Advertising From mass branding A shift in Financial Services From Educated Investing A shift in Healthcare From mass treatment
More informationStreamStorage: High-throughput and Scalable Storage Technology for Streaming Data
: High-throughput and Scalable Storage Technology for Streaming Data Munenori Maeda Toshihiro Ozawa Real-time analytical processing (RTAP) of vast amounts of time-series data from sensors, server logs,
More informationTesting Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
More informationThe Data Engineer. Mike Tamir Chief Science Officer Galvanize. Steven Miller Global Leader Academic Programs IBM Analytics
The Data Engineer Mike Tamir Chief Science Officer Galvanize Steven Miller Global Leader Academic Programs IBM Analytics Alessandro Gagliardi Lead Faculty Galvanize Businesses are quickly realizing that
More informationWhite. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014
White Paper EMC Isilon: A Scalable Storage Platform for Big Data By Nik Rouda, Senior Analyst and Terri McClure, Senior Analyst April 2014 This ESG White Paper was commissioned by EMC Isilon and is distributed
More informationWINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS Managing and analyzing data in the cloud is just as important as it is anywhere else. To let you do this, Windows Azure provides a range of technologies
More informationModern Data Integration
Modern Data Integration Whitepaper Table of contents Preface(by Jonathan Wu)... 3 The Pardigm Shift... 4 The Shift in Data... 5 The Shift in Complexity... 6 New Challenges Require New Approaches... 6 Big
More informationThere s no way around it: learning about Big Data means
In This Chapter Chapter 1 Introducing Big Data Beginning with Big Data Meeting MapReduce Saying hello to Hadoop Making connections between Big Data, MapReduce, and Hadoop There s no way around it: learning
More informationThe evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through
More informationKPMG Unlocks Hidden Value in Client Information with Smartlogic Semaphore
CASE STUDY KPMG Unlocks Hidden Value in Client Information with Smartlogic Semaphore Sponsored by: IDC David Schubmehl July 2014 IDC OPINION Dan Vesset Big data in all its forms and associated technologies,
More informationBeyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.
Beyond Web Application Log Analysis using Apache TM Hadoop A Whitepaper by Orzota, Inc. 1 Web Applications As more and more software moves to a Software as a Service (SaaS) model, the web application has
More informationWhat happens when Big Data and Master Data come together?
What happens when Big Data and Master Data come together? Jeremy Pritchard Master Data Management fgdd 1 What is Master Data? Master data is data that is shared by multiple computer systems. The Information
More informationWhere have you been all my life? How the financial services industry can unlock the value in Big Data
Where have you been all my life? How the financial services industry can unlock the value in Big Data Agenda Why should I care? What is Big Data? Is Big Data for me? What will it take? PwC Slide 1 The
More informationLambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More informationData Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC
Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep Neil Raden Hired Brains Research, LLC Traditionally, the job of gathering and integrating data for analytics fell on data warehouses.
More informationANALYTICS BUILT FOR INTERNET OF THINGS
ANALYTICS BUILT FOR INTERNET OF THINGS Big Data Reporting is Out, Actionable Insights are In In recent years, it has become clear that data in itself has little relevance, it is the analysis of it that
More informationKNOWLEDGENT REPORT. 2015 Big Data Survey: Current Implementation Challenges
KNOWLEDGENT REPORT 2015 Big Data Survey: Current Implementation Challenges INTRODUCTION The amount of data in both the private and public domain is experiencing exponential growth. Mobile devices, sensors,
More informationEMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst
White Paper EMC s Enterprise Hadoop Solution Isilon Scale-out NAS and Greenplum HD By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst February 2012 This ESG White Paper was commissioned
More informationBig Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012
Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation
More informationBIG DATA CHALLENGES AND PERSPECTIVES
BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,
More informationMaster big data to optimize the oil and gas lifecycle
Viewpoint paper Master big data to optimize the oil and gas lifecycle Information management and analytics (IM&A) helps move decisions from reactive to predictive Table of contents 4 Getting a handle on
More informationContent Marketing Integration Workbook
Content Marketing Integration Workbook 730 Yale Avenue Swarthmore, PA 19081 www.raabassociatesinc.com info@raabassociatesinc.com Introduction Like the Molière character who is delighted to learn he has
More informationBIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE
BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE Current technology for Big Data allows organizations to dramatically improve return on investment (ROI) from their existing data warehouse environment.
More informationBig Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014
White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page
More informationBig Data Efficiencies That Will Transform Media Company Businesses
Big Data Efficiencies That Will Transform Media Company Businesses TV, digital and print media companies are getting ever-smarter about how to serve the diverse needs of viewers who consume content across
More informationUsing Predictive Maintenance to Approach Zero Downtime
SAP Thought Leadership Paper Predictive Maintenance Using Predictive Maintenance to Approach Zero Downtime How Predictive Analytics Makes This Possible Table of Contents 4 Optimizing Machine Maintenance
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON BIG DATA MANAGEMENT AND ITS SECURITY PRUTHVIKA S. KADU 1, DR. H. R.
More informationAssociate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
More informationUsing Big Data Analytics for Financial Services Regulatory Compliance
Using Big Data Analytics for Financial Services Regulatory Compliance Industry Overview In today s financial services industry, the pendulum continues to swing further in the direction of lower risk and
More informationCitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
More informationKeywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.
Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics
More informationThe Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
More informationTop Data Management Terms to Know Fifteen essential definitions you need to know
Top Data Management Terms to Know Fifteen essential definitions you need to know We know it s not always easy to keep up-to-date with the latest data management terms. That s why we have put together the
More informationBig Data - Infrastructure Considerations
April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright
More information