Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015



Similar documents
Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Dashboard Engine for Hadoop

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Ganzheitliches Datenmanagement

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data Analytics Nokia

Luncheon Webinar Series May 13, 2013

Upcoming Announcements

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

GS Big Data Platform

How to avoid building a data swamp

Roadmap Talend : découvrez les futures fonctionnalités de Talend

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

HDP Hadoop From concept to deployment.

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Building Confidence in Big Data Innovations in Information Integration & Governance for Big Data

Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Cisco IT Hadoop Journey

Datameer Big Data Governance

Databricks. A Primer

CrossPoint for Managed Collaboration and Data Quality Analytics

How to Run a Successful Big Data POC in 6 Weeks

Native Connectivity to Big Data Sources in MSTR 10

Databricks. A Primer

Big Data Management and Security

Self-service BI for big data applications using Apache Drill

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Azure Data Lake Analytics

Cisco IT Hadoop Journey

More Data in Less Time

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Self-service BI for big data applications using Apache Drill

The Future of Data Management

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Deploying an Operational Data Store Designed for Big Data

HDP Enabling the Modern Data Architecture

Implement Hadoop jobs to extract business value from large and varied data sets

Data Analyst Program- 0 to 100

Deploying Hadoop with Manager

Teradata and RainStor in a Big Data World Augment Your Data Warehouse Strategy with an Active Archive Running on Hadoop.

Data Security in Hadoop

Integrating a Big Data Platform into Government:

Cloudera Enterprise Data Hub. GCloud Service Definition Lot 3: Software as a Service

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

IBM BigInsights for Apache Hadoop

How to Hadoop Without the Worry: Protecting Big Data at Scale

Information Builders Mission & Value Proposition

Hadoop Data Hubs and BI. Supporting the migration from siloed reporting and BI to centralized services with Hadoop

XpoLog Competitive Comparison Sheet

Data Security as a Business Enabler Not a Ball & Chain. Big Data Everywhere May 12, 2015

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Business Data Authority: A data organization for strategic advantage

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Safe Harbor Statement

Making Sense of the Madness

Oracle BI Roadmap & Visual Analyzer Ljiljana Perica, Oracle Business Solution Leader Ljiljana.perica@oracle.com

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

I/O Considerations in Big Data Analytics

Big Data for Investment Research Management

A Practical Guide to Legacy Application Retirement

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Constructing a Data Lake: Hadoop and Oracle Database United!

How Companies are! Using Spark

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

WHAT S NEW IN SAS 9.4

Workshop on Hadoop with Big Data

SECURITY AND REGULATORY COMPLIANCE OVERVIEW

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Traditional BI vs. Business Data Lake A comparison

Marathon Information Management Program

Apache Sentry. Prasad Mujumdar

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Data Warehouse Optimization with Hadoop

Designing Self-Service Business Intelligence and Big Data Solutions

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Bringing Big Data to People

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

Hadoop & Spark Using Amazon EMR

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

PLATFORA SOLUTION ARCHITECTURE

Qsoft Inc

Lofan Abrams Data Services for Big Data Session # 2987

Case Study : 3 different hadoop cluster deployments

Information Architecture

Big Data for Investment Research Management

Safe Harbor Statement

How To Extend An Enterprise Bio Solution

Transcription:

Data Governance in the Hadoop Data Lake Kiran Kamreddy May 2015

One Data Lake: Many Definitions A centralized repository of raw data into which many data-producing streams flow and from which downstream facilities may draw. Information Sources Data Lake Downstream Facilities Data Variety is the driving factor in building a Data Lake 2 NDA CONFIDENTIAL

Data Lake Maturity & Risks Data Lake initiatives usually start small More Data Sources More Applications More Business Units More Users. grow into more complex environments Without proper governance mechanisms Data lakes risk turning data swamps 3 2014 Teradata

What Is Data Governance? And What can it do for my Data Lake Fundamental capabilities for organizing, managing and understanding data - Where did my data come from? How is it being transformed? - Track usage, resolve anomalies, visualize, optimize and clarify data lineage - Search and access data ( not only browse ) - Assess data quality and fitness for purpose Specialized capabilities to meet regulatory/compliance requirements - Govern who can/cannot access the data and who cannot - Data life cycle management, archiving and retention policies - Auditing, compliance Data Governance first approach to prevent turning to Data swamps Retrofitting data governance is not feasible 4

Governance and Productivity Governance should supports day-to-day use of data Data workers need a strong understanding Roles for data stewards, data owners, data analysts/scientists need to be assigned Operational Metadata is critical to understanding Where did it come from? What is the environment? landing zone, OS, Line of Business What processes touched my data?did you lose any data? Checksums etc. When did the data get ingested, transformed? Did it get exported, when, where how will it be used (organizational)? Provision consistent ingest methods that track operational metadata 5

What is Regulatory Compliance? Compliance and Regulatory Capture, store and move data Sarbanes-Oxley, HIPAA, Basel II Security Authorization, Authentication Handling sensitive data Auditing Recoding every attempt to access Archive & Retention Data life cycle policies 3 Market Approaches Apache Hadoop has built-in support for these capabilities Hadoop distribution vendors have all made improvements in each of these areas A variety of vendors provide specialized capabilities in each area that go beyond what a Hadoop distribution provides 6 2014 Teradata

Data Governance Challenges on Hadoop Hadoop is different to DW - Scale: High volumes of data, multiple user access - Variety: Schema-on-read, multiple formats of data - Multiple storage layers (HDFS, Hive, HBase) - Many processing engines (MR, Hive, Pig, Impala, Drill ) - Many workflow engines/schedules (Cron, Oozie, Falcon ) - Holistic view of data with required context is difficult Hadoop needs less stringent, more flexible mechanisms Balance agility and self service with processes, rules, regulations Maintain Governance without losing Hadoop's power 7 2014 Teradata

Teradata s Approach for Data Governance in Hadoop Teradata Loom Integrated Data Management for Hadoop Metadata management, Lineage, Data Wrangling Automatic data cataloging, data profiling and statistics generation Teradata Rainstor Data Archiving Structured data archiving in Hadoop with robust security Compliance and auditing ThinkBig Hadoop professional services Hadoop Data Lake packaged service/product offering to build and deploy high-quality, governed data lakes 8 2014 Teradata

Teradata Loom Find and Understand Your Data ActiveScan Data cataloging Event triggers Job detection and lineage creation Data profiling (statistics) Workbench and Metadata Registry Data exploration and discovery Technical and business metadata Data sampling and previews Lineage relationships Search over metadata REST API easily integrate third-party apps Prepare Your Data Data Wrangling Self-service, interactive data wrangling for Hadoop Metadata tracked HiveQL Joins, unions, aggregations, UDFs Metadata tracked in Loom 9

Teradata RainStor Retain data online that is queryable for an indefinite period Retire data that are no longer required with auto-expiration policies Comply with strict government rules and regulations Retain the metadata as it was originally captured Store tamper-proof, immutable (unchangeable) data Maintain availability to data as RDBMS versions change or expire Compression, MPP SQL query engine, Encryption, Auditing 10 CONFIDENTIAL

Think Big Data Lake Starter Enables a rapid build for an initial Data Lake Data Lake Build - Provide recommendations and assistance in standing up a 8-16 node data lake on premises or in the cloud Implement and document 2-3 Ingest Pipelines Robust infrastructure to support fast onboarding of new pipelines and use cases Implement an end-to-end Security Plan Perimeter, authentication, authorization and protection Integrated data cataloging and lineage through Loom Implement archiving, if required, through RainStor 11 NDA CONFIDENTIAL 12

Big Data Services from Think Big Big Data Strategy & Roadmap Data Lake Implementation Analytics & Data Science Training & Support 12 Lack of Clear Big Data Strategy 2015 Think Big, a Teradata Company Data Scattered & Not Well Understood Difficulty Turning Data into Action Missing Big Data Skills Focused exclusively on tying Hadoop and big data solutions to measurable business value

Data Governance for Hadoop Bank Holding Company Situation Large scale data lake planned with many heterogeneous sources and many individual analyst users. Problem Lack of centralized metadata repository makes data governance impossible. Enterprise must have transparency into data in the cluster and capability to define extensible metadata. Solution Hadoop provides data lake infrastructure. Loom provides centralized metadata management, with an automation framework. Impact Co-location of data provides more efficient workflow for analysts Hadoop provides scalability at a lower cost than traditional systems Develop new insights to drive business value 13

Summary Data governance is critical to building a successful data lake Fundamental governance capabilities make data workers more productive Solutions for meeting regulatory requirements are also needed Teradata Loom provides required data cataloging and lineage capabilities to make hadoop users more productive RainStor provides advanced archiving solution ThinkBig Data Lake provides the complete package Stop by Our Booth for a Demo 14

15 2014 Teradata