Data Governance in the Hadoop Data Lake Michael Lang May 2015
Introduction Product Manager for Teradata Loom Joined Teradata as part of acquisition of Revelytix, original developer of Loom VP of Sales Engineering at Revelytix Originally joined Revelytix in 2007 2
Data Governance in a Data Lake A Data Lake is a centralized repository of data into which many data-producing streams flow and from which downstream facilities may draw for a variety of use cases Information Sources Data Lake Downstream Facilities Data governance is a combination of some fundamental capabilities for managing and understanding data and some specialized capabilities to meet regulatory requirements imposed on the data 3 NDA CONFIDENTIAL
Regulatory Compliance Ensuring that all legal requirements to store and protect data are satisfied (Sarbanes-Oxley, HIPAA, Basel II ) Security Auditing Retention Backup Hadoop has built-in support for these capabilities Hadoop distribution vendors have all made improvements in each of these areas A variety of vendors provide specialized capabilities in each area that go beyond what a Hadoop distribution provides 4
Governance and Productivity Governance that supports day-to-day use of data Data workers need a strong understanding of what data is available and how datasets are related Data engineers, data scientists, business analysts, data stewards, data owners Hadoop presents unique challenges No central catalog Schema-on-read Multiple formats of data Multiple storage layers (HDFS, Hive, HBase) Many processing engines (MR, Hive, Pig, Impala, Drill ) Many workflow engines/schedules (Cron, Oozie, Falcon ) Holistic view of data with required level of context is difficult to come by 5
Data Governance Fundamentals Ensuring people working with data can easily find and understand what data is available and assess data quality and fitness for purpose Data Catalog Technical metadata Business metadata Search Data Lineage All about productivity 6
Teradata Solutions for Data Governance in Hadoop ThinkBig Hadoop professional services Hadoop Data Lake packaged service/product offering to build and deploy high-quality, governed data lakes Loom Data Management for Hadoop Data Cataloging, Lineage, Data Wrangling Rainstor Data Archiving Structured data archiving in Hadoop with robust security All recent acquisitions All standalone offerings, with some light integration options Teradata UDA integration on roadmap 7 2014 Teradata
Think Big Data Lake Starter Enables a rapid build for an initial Data Lake Data Lake Build - Provide recommendations and assistance in standing up a 8-16 node data lake on premises or in the cloud Implement and document 2-3 Ingest Pipelines Robust infrastructure to support fast onboarding of new pipelines and use cases Implement an end-to-end Security Plan Perimeter, authentication, authorization and protection Integrated data cataloging and lineage through Loom Implement archiving, if required, through RainStor 8 NDA CONFIDENTIAL 12
Loom Find and Understand Your Data ActiveScan Data cataloging Event triggers Job detection and lineage creation Data profiling (statistics) Workbench and Metadata Registry Data exploration and discovery Technical and business metadata Data sampling and previews Lineage relationships Search over metadata REST API easily integrate third-party apps Prepare Your Data Data Wrangling Self-service, interactive data wrangling for Hadoop Metadata tracked HiveQL Joins, unions, aggregations, UDFs Metadata tracked in Loom 9
RainStor Overview Online archiving solution for Hadoop Compression MPP SQL query engine Encryption Auditing Security (Authentication/authorization) Data import/export - FastForward access to Teradata tape format-files - FastConnect connector to Teradata EDWs 10 CONFIDENTIAL
Summary Data governance is critical to building a successful data lake Fundamental governance capabilities make data workers more productive Solutions for meeting regulatory requirements are also needed Teradata Loom provides required data cataloging and lineage capabilities RainStor provides advanced archiving solution ThinkBig Data Lake provides the complete package Stop by Our Booth for a Demo 11
12 2014 Teradata Backup
Loom Data Wrangling Data preparation consumes a large amount of an analyst s time Data Wrangling - Modify and combine column values to create new columns - Modify schemas add/delete/rename columns, convert datatypes Hive - Joins, unions, aggregations Self-service, interactive UI for working with large data sets Work with a sample of the data set for quick iteration Once the sample is in the desired form, Loom will apply all of the steps against the full data set via MapReduce Leverages the Loom Metadata Registry All data cleaning steps are tracked to provide a complete data lineage picture from the raw source data to the data sets used for analytics User benefits from context provided by metadata in Loom Registry 13
Loom Data Lineage Loom uses multiple methods to collect lineage metadata: Loom initiated transforms - Data Wrangling, Hive ActiveScan Job Detection - TDCH, Sqoop API - Hive, Rainstor (Q3 2015), ThinkBig Data Lake (Q2 2015) - Services engagements can extend this to virtually any execution engine 14
Loom Data Cataloging ActiveScan Automatically build and maintain the catalog Generate technical metadata Technical Metadata Data location, format, structure, schema Data profiling statistics Data previews Lineage Business Metadata Descriptive attributes Custom properties Business glossaries Search and Discovery Search over metadata Navigate relationships between entities Open API RESTful API developer s can use to integrate their own applications and use cases and extend metadata management beyond Hadoop to other big data systems Multiple integration efforts underway within Teradata portfolio 15
Summary Find and Understand Your Data Data Cataloging and Profiling with ActiveScan Data Exploration and Discovery through the Workbench Prepare Your Data for Analysis Data Wrangling with Weaver SQL Transforms with Hive Simplifies Hadoop Use and Management Increases Analyst Productivity 16
User Benefits Analysts Find data fast search and browse over metadata Understand data immediately metadata gives context to the data Reuse work lineage makes it easy to see what others have done Prepare your own data self-service tools for running ad-hoc transformations Data Engineers Integrated metadata deploy multiple processing technologies Quickly troubleshoot operational data pipelines lineage provides the visibility you need 17
Governance and Productivity Data Catalog Central list of all available data across the cluster, with basic level of technical metadata and the ability to add business metadata Data Lineage Shows relationship between raw data and derived data Data Quality? 18
Teradata Loom Editions Teradata Loom Community Edition Freely downloadable as an add-on for all Hadoop distributions: http://downloads.teradata.com/download/uda/teradata-loom Teradata Loom Enterprise Edition Premium version of Loom subscription licensed on a per node basis Fully featured & fully supported Supports all major Hadoop distributions Globally available, but English-only North American locale 19
Regulatory Compliance Security and auditing are platform-level capabilities These are built-in to Hadoop, though the distribution vendors have begun to evolve/implement their own custom solutions Securing data requires that you know what is in each file and what permissions it needs to have Doing this manually is possible for small projects, but does not scale to the levels of a data lake Vendor solutions exist to help solve this problem Dataguise, etc. 20
21 Search
Data Viewer 22 2014 Teradata
Data Lineage 23 2014 Teradata
24 Data Wrangling
Agile ELT for Hadoop Financial Data Provider Situation Enterprise ETL solution in place for operational, mission-critical data pipelines. Problem Analysts do not have access to raw and intermediate datasets. Exploratory analysis cannot be done without changes to long-running data governance processes. Solution Migrate raw data to Hadoop. Organize and describe data in Loom. Provide analysts a self-service Workbench for data discovery and preparation. Impact Improve speed of analytics development process Provide broader access to raw and intermediate data Develop new insights to drive business value 25
Data Governance for Hadoop Bank Holding Company Situation Large scale data lake planned with many heterogeneous sources and many individual analyst users. Problem Lack of centralized metadata repository makes data governance impossible. Enterprise must have transparency into data in the cluster and capability to define extensible metadata. Solution Hadoop provides data lake infrastructure. Loom provides centralized metadata management, with an automation framework. Impact Co-location of data provides more efficient workflow for analysts Hadoop provides scalability at a lower cost than traditional systems Develop new insights to drive business value 26
Telematics Data Analysis Geospatial analytics for better risk management Situation Insurance company needs to accurately calculate scores and adjust risk premiums for enterprise fleets based on vehicle data, driver behavior, GPS data, and other data. Current custom developed applications limits the effectiveness of these scores. Problem Hadoop is used as the infrastructure for data storage and processing, but does not provide intuitive user interfaces for business analysts who need access to data. Solution Loom Workbench provides simple way for analysts to find and understand data in Hadoop. IT can easily enrich descriptions to add context for analysts. Weaver provides a simple interface for self-service data transformation. Impact Quickly analyze data for informed decisions and ad hoc reporting Streamlined process to calculate vehicle and fleet scores Cost effectively quantify, adjust and manage risk premiums 27
Loom Architecture and Deployment Loom Workbench Loom Interface Registry Persistence Loom API Loom Services Loom Activescan Loom Server HDFS Hive/HCat LDAP/Kerberos Hadoop Environment 28 2014 Teradata
Community vs. Enterprise Features Community Enterprise Open metadata repository & API ü ü Automatic discovery & profiling of new data ü ü Lineage tracking via Loom UI and Loom API ü ü Search ü ü Ambari monitoring (future) ü ü Data wrangling steps/operations Up to 20 Unlimited Security authentication using Kerberos/LDAP Execution of custom scripts during data discovery Auto-lineage tracking for data movement outside Hadoop Automated lineage tracking of Hive queries outside Loom ü ü ü ü 29 Support Community Teradata
Regulatory Compliance Sensitive Data Determine security requirements for data for large volumes of individual files/ tables automation is key Security Authentication - Verify identity of users Authorization - Lock down access to data based on user permissions Auditing Record every attempt to access data and ensure that authentication/ authorization policies are being enforced 30
Data Lake: Swamp or Reservoir? Swamp Reservoir 31 NDA CONFIDENTIAL
32 2014 Teradata