Data Governance in the Hadoop Data Lake. Michael Lang May 2015



Similar documents
Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Luncheon Webinar Series May 13, 2013

Ganzheitliches Datenmanagement

Upcoming Announcements

Dashboard Engine for Hadoop

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

HDP Hadoop From concept to deployment.

Big Data Analytics Nokia

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Data Hubs and BI. Supporting the migration from siloed reporting and BI to centralized services with Hadoop

Datameer Big Data Governance

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

The Future of Data Management

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

How to avoid building a data swamp

Are You Big Data Ready?

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Self-service BI for big data applications using Apache Drill

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Deploying Hadoop with Manager

Safe Harbor Statement

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Self-service BI for big data applications using Apache Drill

Bringing Big Data to People

Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Information Builders Mission & Value Proposition

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Big Data on Microsoft Platform

Qsoft Inc

White Paper. Unified Data Integration Across Big Data Platforms

Unified Data Integration Across Big Data Platforms

MapR: Best Solution for Customer Success

More Data in Less Time

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Discovering Business Insights in Big Data Using SQL-MapReduce

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

HDP Enabling the Modern Data Architecture

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Deploying an Operational Data Store Designed for Big Data

Hadoop & Spark Using Amazon EMR

Cisco IT Hadoop Journey

Data Security in Hadoop

Cisco IT Hadoop Journey

Apache Hadoop: The Big Data Refinery

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Databricks. A Primer

SAP Agile Data Preparation

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Oracle Big Data Building A Big Data Management System

Oracle Big Data SQL Technical Update

BIG DATA TRENDS AND TECHNOLOGIES

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Big Data Management and Security

What's New in SAS Data Management

PLATFORA SOLUTION ARCHITECTURE

Databricks. A Primer

The Future of Data Management with Hadoop and the Enterprise Data Hub

Bringing the Power of SAS to Hadoop. White Paper

Integrating a Big Data Platform into Government:

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data for Investment Research Management

Data processing goes big

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data for Investment Research Management

Traditional BI vs. Business Data Lake A comparison

IBM BigInsights for Apache Hadoop

The Enterprise Data Hub and The Modern Information Architecture

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Sisense. Product Highlights.

New Modeling Challenges: Big Data, Hadoop, Cloud

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Making Sense of the Madness

A Modern Data Architecture with Apache Hadoop

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

Cloudera Enterprise Data Hub. GCloud Service Definition Lot 3: Software as a Service

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Artur Borycki. Director International Solutions Marketing

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

ITG Software Engineering

Hadoop in the Hybrid Cloud

Welkom! Copyright 2014 Oracle and/or its affiliates. All rights reserved.

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Native Connectivity to Big Data Sources in MSTR 10

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

IBM InfoSphere BigInsights Enterprise Edition

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

XpoLog Competitive Comparison Sheet

Transcription:

Data Governance in the Hadoop Data Lake Michael Lang May 2015

Introduction Product Manager for Teradata Loom Joined Teradata as part of acquisition of Revelytix, original developer of Loom VP of Sales Engineering at Revelytix Originally joined Revelytix in 2007 2

Data Governance in a Data Lake A Data Lake is a centralized repository of data into which many data-producing streams flow and from which downstream facilities may draw for a variety of use cases Information Sources Data Lake Downstream Facilities Data governance is a combination of some fundamental capabilities for managing and understanding data and some specialized capabilities to meet regulatory requirements imposed on the data 3 NDA CONFIDENTIAL

Regulatory Compliance Ensuring that all legal requirements to store and protect data are satisfied (Sarbanes-Oxley, HIPAA, Basel II ) Security Auditing Retention Backup Hadoop has built-in support for these capabilities Hadoop distribution vendors have all made improvements in each of these areas A variety of vendors provide specialized capabilities in each area that go beyond what a Hadoop distribution provides 4

Governance and Productivity Governance that supports day-to-day use of data Data workers need a strong understanding of what data is available and how datasets are related Data engineers, data scientists, business analysts, data stewards, data owners Hadoop presents unique challenges No central catalog Schema-on-read Multiple formats of data Multiple storage layers (HDFS, Hive, HBase) Many processing engines (MR, Hive, Pig, Impala, Drill ) Many workflow engines/schedules (Cron, Oozie, Falcon ) Holistic view of data with required level of context is difficult to come by 5

Data Governance Fundamentals Ensuring people working with data can easily find and understand what data is available and assess data quality and fitness for purpose Data Catalog Technical metadata Business metadata Search Data Lineage All about productivity 6

Teradata Solutions for Data Governance in Hadoop ThinkBig Hadoop professional services Hadoop Data Lake packaged service/product offering to build and deploy high-quality, governed data lakes Loom Data Management for Hadoop Data Cataloging, Lineage, Data Wrangling Rainstor Data Archiving Structured data archiving in Hadoop with robust security All recent acquisitions All standalone offerings, with some light integration options Teradata UDA integration on roadmap 7 2014 Teradata

Think Big Data Lake Starter Enables a rapid build for an initial Data Lake Data Lake Build - Provide recommendations and assistance in standing up a 8-16 node data lake on premises or in the cloud Implement and document 2-3 Ingest Pipelines Robust infrastructure to support fast onboarding of new pipelines and use cases Implement an end-to-end Security Plan Perimeter, authentication, authorization and protection Integrated data cataloging and lineage through Loom Implement archiving, if required, through RainStor 8 NDA CONFIDENTIAL 12

Loom Find and Understand Your Data ActiveScan Data cataloging Event triggers Job detection and lineage creation Data profiling (statistics) Workbench and Metadata Registry Data exploration and discovery Technical and business metadata Data sampling and previews Lineage relationships Search over metadata REST API easily integrate third-party apps Prepare Your Data Data Wrangling Self-service, interactive data wrangling for Hadoop Metadata tracked HiveQL Joins, unions, aggregations, UDFs Metadata tracked in Loom 9

RainStor Overview Online archiving solution for Hadoop Compression MPP SQL query engine Encryption Auditing Security (Authentication/authorization) Data import/export - FastForward access to Teradata tape format-files - FastConnect connector to Teradata EDWs 10 CONFIDENTIAL

Summary Data governance is critical to building a successful data lake Fundamental governance capabilities make data workers more productive Solutions for meeting regulatory requirements are also needed Teradata Loom provides required data cataloging and lineage capabilities RainStor provides advanced archiving solution ThinkBig Data Lake provides the complete package Stop by Our Booth for a Demo 11

12 2014 Teradata Backup

Loom Data Wrangling Data preparation consumes a large amount of an analyst s time Data Wrangling - Modify and combine column values to create new columns - Modify schemas add/delete/rename columns, convert datatypes Hive - Joins, unions, aggregations Self-service, interactive UI for working with large data sets Work with a sample of the data set for quick iteration Once the sample is in the desired form, Loom will apply all of the steps against the full data set via MapReduce Leverages the Loom Metadata Registry All data cleaning steps are tracked to provide a complete data lineage picture from the raw source data to the data sets used for analytics User benefits from context provided by metadata in Loom Registry 13

Loom Data Lineage Loom uses multiple methods to collect lineage metadata: Loom initiated transforms - Data Wrangling, Hive ActiveScan Job Detection - TDCH, Sqoop API - Hive, Rainstor (Q3 2015), ThinkBig Data Lake (Q2 2015) - Services engagements can extend this to virtually any execution engine 14

Loom Data Cataloging ActiveScan Automatically build and maintain the catalog Generate technical metadata Technical Metadata Data location, format, structure, schema Data profiling statistics Data previews Lineage Business Metadata Descriptive attributes Custom properties Business glossaries Search and Discovery Search over metadata Navigate relationships between entities Open API RESTful API developer s can use to integrate their own applications and use cases and extend metadata management beyond Hadoop to other big data systems Multiple integration efforts underway within Teradata portfolio 15

Summary Find and Understand Your Data Data Cataloging and Profiling with ActiveScan Data Exploration and Discovery through the Workbench Prepare Your Data for Analysis Data Wrangling with Weaver SQL Transforms with Hive Simplifies Hadoop Use and Management Increases Analyst Productivity 16

User Benefits Analysts Find data fast search and browse over metadata Understand data immediately metadata gives context to the data Reuse work lineage makes it easy to see what others have done Prepare your own data self-service tools for running ad-hoc transformations Data Engineers Integrated metadata deploy multiple processing technologies Quickly troubleshoot operational data pipelines lineage provides the visibility you need 17

Governance and Productivity Data Catalog Central list of all available data across the cluster, with basic level of technical metadata and the ability to add business metadata Data Lineage Shows relationship between raw data and derived data Data Quality? 18

Teradata Loom Editions Teradata Loom Community Edition Freely downloadable as an add-on for all Hadoop distributions: http://downloads.teradata.com/download/uda/teradata-loom Teradata Loom Enterprise Edition Premium version of Loom subscription licensed on a per node basis Fully featured & fully supported Supports all major Hadoop distributions Globally available, but English-only North American locale 19

Regulatory Compliance Security and auditing are platform-level capabilities These are built-in to Hadoop, though the distribution vendors have begun to evolve/implement their own custom solutions Securing data requires that you know what is in each file and what permissions it needs to have Doing this manually is possible for small projects, but does not scale to the levels of a data lake Vendor solutions exist to help solve this problem Dataguise, etc. 20

21 Search

Data Viewer 22 2014 Teradata

Data Lineage 23 2014 Teradata

24 Data Wrangling

Agile ELT for Hadoop Financial Data Provider Situation Enterprise ETL solution in place for operational, mission-critical data pipelines. Problem Analysts do not have access to raw and intermediate datasets. Exploratory analysis cannot be done without changes to long-running data governance processes. Solution Migrate raw data to Hadoop. Organize and describe data in Loom. Provide analysts a self-service Workbench for data discovery and preparation. Impact Improve speed of analytics development process Provide broader access to raw and intermediate data Develop new insights to drive business value 25

Data Governance for Hadoop Bank Holding Company Situation Large scale data lake planned with many heterogeneous sources and many individual analyst users. Problem Lack of centralized metadata repository makes data governance impossible. Enterprise must have transparency into data in the cluster and capability to define extensible metadata. Solution Hadoop provides data lake infrastructure. Loom provides centralized metadata management, with an automation framework. Impact Co-location of data provides more efficient workflow for analysts Hadoop provides scalability at a lower cost than traditional systems Develop new insights to drive business value 26

Telematics Data Analysis Geospatial analytics for better risk management Situation Insurance company needs to accurately calculate scores and adjust risk premiums for enterprise fleets based on vehicle data, driver behavior, GPS data, and other data. Current custom developed applications limits the effectiveness of these scores. Problem Hadoop is used as the infrastructure for data storage and processing, but does not provide intuitive user interfaces for business analysts who need access to data. Solution Loom Workbench provides simple way for analysts to find and understand data in Hadoop. IT can easily enrich descriptions to add context for analysts. Weaver provides a simple interface for self-service data transformation. Impact Quickly analyze data for informed decisions and ad hoc reporting Streamlined process to calculate vehicle and fleet scores Cost effectively quantify, adjust and manage risk premiums 27

Loom Architecture and Deployment Loom Workbench Loom Interface Registry Persistence Loom API Loom Services Loom Activescan Loom Server HDFS Hive/HCat LDAP/Kerberos Hadoop Environment 28 2014 Teradata

Community vs. Enterprise Features Community Enterprise Open metadata repository & API ü ü Automatic discovery & profiling of new data ü ü Lineage tracking via Loom UI and Loom API ü ü Search ü ü Ambari monitoring (future) ü ü Data wrangling steps/operations Up to 20 Unlimited Security authentication using Kerberos/LDAP Execution of custom scripts during data discovery Auto-lineage tracking for data movement outside Hadoop Automated lineage tracking of Hive queries outside Loom ü ü ü ü 29 Support Community Teradata

Regulatory Compliance Sensitive Data Determine security requirements for data for large volumes of individual files/ tables automation is key Security Authentication - Verify identity of users Authorization - Lock down access to data based on user permissions Auditing Record every attempt to access data and ensure that authentication/ authorization policies are being enforced 30

Data Lake: Swamp or Reservoir? Swamp Reservoir 31 NDA CONFIDENTIAL

32 2014 Teradata