Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Similar documents
ENTERPRISE BI AND DATA DISCOVERY, FINALLY

Ignite Your Creative Ideas with Fast and Engaging Data Discovery

Increase Agility and Reduce Costs with a Logical Data Warehouse. February 2014

Adobe Insight, powered by Omniture

SQL Server 2012 Business Intelligence Boot Camp

Business Intelligence: Effective Decision Making

Data Discovery, Analytics, and the Enterprise Data Hub

Accelerate BI Initiatives With Self-Service Data Discovery And Integration

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

Data Modeling in the Age of Big Data

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

THE INTELLIGENT BUSINESS INTELLIGENCE SOLUTIONS

POLAR IT SERVICES. Business Intelligence Project Methodology

Business Intelligence

In-Database Analytics

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Healthcare, transportation,

Microsoft Business Intelligence

Best Practices for Hadoop Data Analysis with Tableau

birt Analytics data sheet Reduce the time from analysis to action

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Ten Mistakes to Avoid

<Insert Picture Here> Extending Hyperion BI with the Oracle BI Server

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

HOW TO DO A SMART DATA PROJECT

Big Data Integration: A Buyer's Guide

Using Tableau Software with Hortonworks Data Platform

Extend your analytic capabilities with SAP Predictive Analysis

How to Enhance Traditional BI Architecture to Leverage Big Data

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Understanding the Value of In-Memory in the IT Landscape

DATAOPT SOLUTIONS. What Is Big Data?

Big Data Analytics Nokia

Business Intelligence

Implementing a Data Warehouse with Microsoft SQL Server

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Three Open Blueprints For Big Data Success

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Cloud Integration and the Big Data Journey - Common Use-Case Patterns

IBM Cognos Insight. Independently explore, visualize, model and share insights without IT assistance. Highlights. IBM Software Business Analytics

A business intelligence agenda for midsize organizations: Six strategies for success

5 Big Data Use Cases to Understand Your Customer Journey CUSTOMER ANALYTICS EBOOK

Management Accountants and IT Professionals providing Better Information = BI = Business Intelligence. Peter Simons peter.simons@cimaglobal.

Traditional BI vs. Business Data Lake A comparison

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

ORACLE BUSINESS INTELLIGENCE SUITE ENTERPRISE EDITION PLUS

Implementing a Data Warehouse with Microsoft SQL Server MOC 20463

COURSE OUTLINE MOC 20463: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Integrating a Big Data Platform into Government:

Implementation of Big Data and Analytics Projects with Big Data Discovery and BICS March 2015

Business Intelligence & IT Governance

Data Doesn t Communicate Itself Using Visualization to Tell Better Stories

ORACLE BUSINESS INTELLIGENCE SUITE ENTERPRISE EDITION PLUS

Databricks. A Primer

INTRODUCTION TO BUSINESS INTELLIGENCE What to consider implementing a Data Warehouse and Business Intelligence

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

Operational Analytics

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

Implementing a Data Warehouse with Microsoft SQL Server

Databricks. A Primer

Data Management Practices for Intelligent Asset Management in a Public Water Utility

IAF Business Intelligence Solutions Make the Most of Your Business Intelligence. White Paper November 2002

MicroStrategy Course Catalog

The Future of Business Analytics is Now! 2013 IBM Corporation

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers

Why Most Big Data Projects Fail

JOURNAL OF OBJECT TECHNOLOGY

Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days

Ganzheitliches Datenmanagement

Why Big Data Analytics?

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Big Data and Healthcare Payers WHITE PAPER

DATA WAREHOUSE AND DATA MINING NECCESSITY OR USELESS INVESTMENT

IBM Big Data Platform

TopBraid Insight for Life Sciences

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

How the oil and gas industry can gain value from Big Data?

Transcription:

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep Neil Raden Hired Brains Research, LLC

Traditionally, the job of gathering and integrating data for analytics fell on data warehouses. Data warehouses cleaned and aggregated the operational exhaust of business transactions. They created precision, accuracy and consistency out of messy transactional structures, so that business people could look back and make sense of the past. As data warehousing evolved, so did our operational applications and with the advent of the Internet, a whole new challenge to precision, accuracy and consistency emerged. Why All the Work Before Analysis is Getting the Attention Sharing results and persisting new views of data and models is core to success. For business analysts and data scientists alike, their work now involves much more searching and discovery of data rather than navigating through steady structures. It is no longer a singular effort. Sharing results and persisting new views of data and models is core to success. And collaboratively governing data catalogs is the approach that works. It is important to make the distinction between tools that provide connectors that operate on a physical level (either generating SQL from structural metadata or using API s, Web Services, etc.), and those that have a rich understanding of the data based on its content, structure and use. Connectors are not sufficient. Someone, or some thing, has to understand the meaning of the data to provide a catalog for analysts to do their work. And collaboratively governing data catalogs is the approach that works. 2 alation.com

Data Scripts Re-Defined for More Agile Analysis Keep in mind that all of the data used by data scientists, analysts and other applications is essentially used. In other words, it was originally created for purposes other than analysis: these purposes include supporting operations, capturing events, and recording transactions. The data at the source (even when it is clear, consistent and error-free, which is rare in an integrated context) will still contain semantic errors, missing entries, or inconsistent formatting for the secondary context of analytics. It must be handled before it can be used. This is especially true if the goal is integration with data from other sources. Connectors are not sufficient. Someone, or some thing, has to understand the meaning of the data. Data warehousing addressed this problem of dealing with used data long ago with processes that executed before data was stored. This is where data warehousing and Hadoop differ. ETL tools, methodologies and best practices ensured that any analyst working with data in a data warehouse, accessed a single source of truth that was already cleaned and aggregated to produce a predefined business metric, often labeled a key performance indicator. Though in fairness, applications from data warehouses are often quite creative. These solutions are also partially useful for data scientists, but because the data was pre-processed to fit a specific model or schema, the richness of the data is lost when it comes to hypothesis testing in a more exploratory manner. In addition, typically a data warehouse is too slow to implement when experiments and discoveries are happening in an unplanned fashion (Our competitor just released a press release about a new pricing model. How should we respond? ) The innovation of Hadoop introduced a better storage mechanism and best practices for hypothesis testing on rich, raw data at low cost. Hadoop evolved out of a 3 alation.com

system designed to capture digital exhaust. Initially the byproduct of online activities, today Hadoop also includes a wealth of machine-generated data events gathered from sensors in real-time. While data warehouses typically stored aggregated information based on application transactions, digital exhaust can be found in a multitude of forms like XML and JSON that are typically referred to as unstructured, though more accurately defined as not highly structured. There is a challenge to working with data stored in its most raw form: any manipulation of that data done for hypothesis testing results in a new, unique data structure. While the data may be in one place physically (typically a Hadoop cluster), a collection of data silos is logically created whereby any new analysis creates a new data structure silo of its own. This provides tremendous agility to analysts who can now ask any question that they d like to of the data. But it has the downside of making it harder to find a single source of truth. In the big data world, most of this agility is accomplished by hand coding, writing scripts or by manually editing data in a spreadsheet, a tedious and time-consuming effort, made worse by slow network connections and under-powered platforms. The only upside to data preparation, if it doesn t consume so much valuable time, is that the process itself often yields new insights about the meaning of the data, what assumptions are safe to make about the data, what idiosyncrasies exist in the collection process, and what models and analyses are appropriate to apply. While the data may be in one place physically, a collection of data silos is still logically created. But that is the work of analysts and data scientists. As an enterprise platform supporting many needs and uses, there are data preparation tools. The Hadoop argument was Why do it in expensive cycles on your RDBMS data warehouse when you can do it in Hadoop? The reason is that writing code is not quite the same as an intelligent tool that provides data preparation assistance with 4 alation.com

versioning, collaboration, reuse, metadata and lots of existing transforms built in. It s also a little contradictory. If Hadoop is for completely flexible and novel analysis, who is going to write transformation code for every project? This approach involves mechanical, time consuming data preparation and filtering that is often one-off, consumes a large percentage of the data scientist s' time and provides no reference to the content or meaning of the data. Re-thinking the Model for Organizational Scale Is there a solution? Let s use a metaphor to describe it. Let the gorillas write the script. In the 1988 film, Gorillas in the Mist, a biopic about the life and work of the late Dian Fossey with the highland gorillas in Rwanda, there was a crisis in the filming that required an emergency meeting back at the studio. The full cast and crew were filming on 12,000-foot mountain in Rwanda and had burned through half of the budget with little progress because, as the director said, The gorillas can t learn their lines. The suggestion was made to return a skeleton crew to just film the gorillas, bring the film back and write a new script around. The film was huge success. From this was born the expression, Let the gorillas write the script. We hear a lot today about democratizing analytics or pervasive analytics, but as the producers of Gorillas in the Mist learned almost thirty years ago, there is great utility in letting things roll and learning as you go, especially if you have tools to capture those interactions, and to provide all of the features that a big data catalog 5 alation.com

platform needs to be effective with both operational and digital exhaust alike. The relevant questions today are not, Is this a single version of the truth? or Who has access to what parts of the schema? Instead, it s the (excuse the big word) phenomenology of how analysts actually work that matters: How do people find and use information in their work? How do they collaborate with others and how do they share insights? How do we make use of today s ample resources in hardware and algorithms to create information agents/ advisers tailored to people s (changing) needs? How do we stitch together data discovery (of what s there, not the market segment), of data and modeling and presentation without boiling the ocean for each new analysis? How do we make the computer understand me, anticipate, in a non-trivial way, what I do and what I need? Help me. and a hundred other things Moving the task of data integration and data extraction to more advanced knowledge integration and knowledge extraction which includes not only input from machine learning but also human collaboration Alation's applications for collaboration help capture knowledge from subject matter experts (knowledge captured by Alation automatically during the act of composing and executing the queries themselves) and help encourage documentation by making it easy for analysts to tag, annotate and share within their existing SQL-based workflows for analytics. What Alation does is to take the guesswork out of what the data means, what it s related to and how it can be dynamically linked together without endless data modeling and remodeling. Machine-learning algorithms, on the other hand, are invaluable for discovering patterns and relationships. The behavior of human analysts, captured in logs of queries and BI tools, provides crucial guidance. 6 alation.com

Both Human Collaboration and Machine Learning Are the Solution The behavior of human analysts, captured in logs of queries and the reports and analyses of BI tools, provides crucial guidance to the work of machine-learning algorithms. Machine-learning algorithms, on the other hand, are invaluable for discovering patterns and relationships that the business analyst may never perceive. But letting the gorillas (e.g. the analysts) write the script instead of waiting for incremental mapping of datasets, ensures that critical context based on use and implied semantic meaning is captured next to the data. Data unification approaches that tie various schema together by column names or column content, are useful to a point, but the approach lacks the critical weighting of the data implied by its previous and ongoing use. A great deal of our problems in both the past and the current approaches to analysis are in understanding data. This was already the case when we had orders of magnitude less of it. And it is an issue only emphasized with today s greater volumes of data. The semantics of data, how it is captured, how it is modeled, what is the gap between the real phenomena we think we re seeing and what an application actually captures and encodes are all core to understanding data. Data mining, predictive models, and machine learning are just that, models of imperfect used data. The process of understanding data and preparing it for analysis demands the input from both machine models and people, assisted by software like Alation that brings both together. About the Author Neil Raden is an author, consultant and industry analyst, a featured figure internationally and the founder and Principal Analyst at Hired Brains Research, firm specializing in the application of data management and analytics. His focus is enabling better informed decision-making. nraden@hiredbrains.com hiredbrains.wordpress.com About Alation Alation is the first data catalog built for collaboration. With Alation, analysts are empowered to search, query and collaborate to achieve faster, more accurate insights. Contact us info@alation.com (650) 799-4440 alation.com