How to avoid building a data swamp

Similar documents
Ganzheitliches Datenmanagement

More Data in Less Time

Business Data Authority: A data organization for strategic advantage

The Future of Data Management

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

The Enterprise Data Hub and The Modern Information Architecture

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Deploying an Operational Data Store Designed for Big Data

Cloudera Enterprise Data Hub. GCloud Service Definition Lot 3: Software as a Service

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Cloudera Enterprise Data Hub in Telecom:

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

How to Run a Successful Big Data POC in 6 Weeks

Building Your Big Data Team

The Data Reservoir as an enabler of differentiating Analytics initiatives

Q1 Labs Corporate Overview

What is Security Intelligence?

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Integrating a Big Data Platform into Government:

Enable Business Agility and Speed Empower your business with proven multidomain master data management (MDM)

Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015

Big Data at Cloud Scale

Datameer Big Data Governance

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

The Future of Data Management with Hadoop and the Enterprise Data Hub

What's New in SAS Data Management

Fighting Cyber Fraud with Hadoop. Niel Dunnage Senior Solutions Architect

Hadoop Trends and Practical Use Cases. April 2014

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

In ediscovery and Litigation Support Repositories MPeterson, June 2009

Qlik Sense Enabling the New Enterprise

DATA GOVERNANCE AT UPMC. A Summary of UPMC s Data Governance Program Foundation, Roles, and Services

Welkom! Copyright 2014 Oracle and/or its affiliates. All rights reserved.

Optimized for the Industrial Internet: GE s Industrial Data Lake Platform

Building Confidence in Big Data Innovations in Information Integration & Governance for Big Data

Big Data must become a first class citizen in the enterprise

Data Security: Strategy and Tactics for Success

Interactive data analytics drive insights

8 Tips for Winning the IT Asset Management Challenge START

Paxata Security Overview

Databricks. A Primer

HITACHI DATA SYSTEMS HADOOP SOLUTION JUNE 12, 2012

Introduction to Glossary Business

Salesforce Certified Data Architecture and Management Designer. Study Guide. Summer 16 TRAINING & CERTIFICATION

Data Integration Checklist

Investor Presentation. Second Quarter 2015

ILM et Archivage Les solutions IBM

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Informatica Data Quality Product Family

Securing Your Enterprise Hadoop Ecosystem Comprehensive Security for the Enterprise with Cloudera

IBM Software Integrating and governing big data

Cisco Data Preparation

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

CrossPoint for Managed Collaboration and Data Quality Analytics

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

Databricks. A Primer

Knowledgent White Paper Series. Developing an MDM Strategy WHITE PAPER. Key Components for Success

From Lab to Factory: The Big Data Management Workbook

Are You Big Data Ready?

Big Data Integration: A Buyer's Guide

Mastering Big Data. Steve Hoskin, VP and Chief Architect INFORMATICA MDM. October 2015

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Seven Things To Consider When Evaluating Privileged Account Security Solutions

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

OWB Users, Enter The New ODI World

Get More Value from Your Reference Data Make it Meaningful with TopBraid RDM

A Holistic Framework for Enterprise Data Management DAMA NCR

Log management & SIEM: QRadar Security Intelligence Platform

Prevent cyber attacks. SEE. what you are missing. Netw rk Infrastructure Security Management

Informatica PowerCenter The Foundation of Enterprise Data Integration

Informatica Data Quality Product Family

IRMAC SAS INFORMATION MANAGEMENT, TRANSFORMING AN ANALYTICS CULTURE. Copyright 2012, SAS Institute Inc. All rights reserved.

Analytics With Hadoop. SAS and Cloudera Starter Services: Visual Analytics and Visual Statistics

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Analance Data Integration Technical Whitepaper

Measure Your Data and Achieve Information Governance Excellence

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Reporting component for templates, reports and documents. Formerly XML Publisher.

Key Management Best Practices

locuz.com Big Data Services

Agile Business Intelligence Data Lake Architecture

Apache Sentry. Prasad Mujumdar

The IBM Cognos Platform

White Paper. Unified Data Integration Across Big Data Platforms

Unified Data Integration Across Big Data Platforms

Detect & Investigate Threats. OVERVIEW

Information Management & Data Governance

SAP Agile Data Preparation

XpoLog Center Suite Data Sheet

Unleash your intuition

Transcription:

How to avoid building a data swamp Case studies in Hadoop data management and governance Mark Donsky, Product Management, Cloudera Naren Korenu, Engineering, Cloudera 1

Abstract DELETE How can you make sure your data doesn t get lost in Apache Hadoop? How can users across your company find the data they care about, and know that it s trustworthy? How can you protect yourself against a data breach? How can you ingest more and more data while meeting these needs? To take advantage of the rising volume, variety, and speed of data, companies must rethink their data management architectures so they can meet current and future business needs. Find out how some of the world s most sophisticated Hadoop deployments are addressing these data challenges head-on, while preserving Hadoop s flexibility, through an integrated data management and governance approach for Hadoop. We ll discuss how users can discover, trust, protect, and govern the data that matters most, for its entire lifecycle. This presentation will feature a live demo and cover audit, lineage, unified metadata, policy management, as well as leading partner integrations for end-to-end visibility. 2

The benefits of Hadoop... One place for unlimited data All types More sources Faster, larger ingestion Unified, multi-framework data access More users More tools Faster changes 3

can create a data swamp: leading to trust, visibility, and governance challenges Compliance Officer Data Stewards & Curators Data Scientists & BI Users Hadoop Admins & DBAs Track, understand and protect access to data Manage and organize data assets at Hadoop scale Effortlessly find and trust the data that matters most Boost user productivity and cluster performance Am I prepared for an audit? How do efficiently manage data lifecycle, from ingest to purge? How can I explore data on my own? How is data being used today? Who s accessing what data? What are they doing with the data? How do I classify data efficiently? Can I trust what I find? How do I use what I find? How can I optimize for future workloads? Is sensitive data governed and protected? How do I make data available to my end users efficiently? How do I find and use related data sets? How can I quickly take advantage of Hadoop risk-free? 4

Hadoop needs a governance foundation Compliance Officer Data Stewards & Curators Data Scientists & BI Users Hadoop Admins & DBAs Track, understand and protect access to data Manage and organize data assets at Hadoop scale Effortlessly find and trust the data that matters most Boost user productivity and cluster performance Am I prepared for an audit? Who s accessing what data? What are they doing with the data? Is sensitive data governed and protected? How do efficiently manage data lifecycle, from ingest to purge? How do I classify data efficiently? How do I make data available to my end users efficiently? How can I explore data on my own? Can I trust what I find? How do I use what I find? How do I find and use related data sets? How is data being used today? How can I optimize for future workloads? How can I quickly take advantage of Hadoop risk-free? Hadoop Governance Foundation Unified metadata Unified lineage Unified auditing Universal policies 5

Enterprise metadata is the foundation of governance Metadata enables you to put context and meaning to data Business Business Glossary Enterprise Taxonomy Ontology Technical Database Schema File Definition ETL Job Design BI Report Definition Data Model Operational Job Run-Time Stats Report Run Information Hardware Usage Scheduler Stats Audit Logs Lineage Policies Technical Metadata Business Metadata 6

Lineage provides context Lineage describes the relationships between data sets: Provenance: what data sets were used to create this data set? Impact: what data sets are derived from this data set? ETL-derived lineage doesn t cover end-user transformations, yet you typically need to capture 100% of lineage for compliance Lineage is embedded in Hadoop technical metadata, but it s hard to access 7

Data policies enable scalability and predictability Information is of limited use unless it is actionable There is a treasure trove of actionable information in the metadata that the various Hadoop services emit Archival of unused data (date created > 7 years ago) Encryption of sensitive data (tags:sensitive) Remediation of incorrect permissions (permissions:rwxrwxrwx and tags:sensitive) Triggers should be configurable based on user-defined criteria Hadoop does not offer a sufficient policy engine or action framework 8

Hadoop Governance Case Studies 9

Compliance Use Cases Compliance Officer Track, understand and protect access to data HADOOP DATA GOVERNANCE & MANAGEMENT Unified metadata Unified lineage Unified auditing Am I prepared for an audit? Who s accessing what data? ENTERPRISE METADATA REPOSITORY ENTERPRISE AUDITING & SECURITY What are they doing with the data? Is sensitive data governed and protected? Common use cases: Security breach detection Data access/usage for PCI compliance 10

Stewardship & Curation Use Cases Data Stewards & Curators Manage and organize data assets at Hadoop scale Define Business Metrics & Glossary Deliver Visualizations, Analytics, Reporting Across Systems How do efficiently manage data lifecycle, from ingest to purge? Ingest & Prepare: Landing Area Analyze, Discover, Search Data Clean, Transform, Refine Data How do I classify data efficiently? How do I make data available to my end users efficiently? HADOOP DATA GOVERNANCE & MANAGEMENT 11

Stewardship & Curation Use Cases Data Stewards & Curators Manage and organize data assets at Hadoop scale Define Business Metrics & Glossary Deliver Visualizations, Analytics, Reporting Across Systems How do efficiently manage data lifecycle, from ingest to purge? Ingest & Prepare: Landing Area Analyze, Discover, Search Data Clean, Transform, Refine Data How do I classify data efficiently? How do I make data available to my end users efficiently? HADOOP DATA GOVERNANCE & MANAGEMENT 12

A few more data management use cases Self-service discovery portal built on top of curated artifacts Curators set up business metadata IT creates a portal using the business metadata This use case is common in healthcare and pharmaceuticals Recreating deleted artifacts Someone runs rm rf on their directory Use lineage and audit queries to reconstruct data sets 13

Metadata interoperability is most important Governing only Hadoop is never enough All metadata must flow freely between enterprise systems, throughout the entire organization From Hadoop to enterprise lineage system From Hadoop to SIEM tools for threat detection This requires open interoperability with all the leading tools Cloudera has introduced the Navigator SDK, an Apache-licensed set of APIs for interoperability, available at http://github.com/cloudera/navigator-sdk 14

Cloudera Navigator Overview & Demo 15

Learn more Please stop by the Cloudera booth! See a demo of Cloudera Enterprise, including our governance solution that s used by over 200 production customers for over two years! Find out what makes Cloudera Enterprise the only PCI-certified Hadoop distribution Learn about our 2000+ partner ecosystem 16

Questions 17

Thank You! md@cloudera.com naren@cloudera.com 18