BIG DATA GOVERNANCE: BALANCING BIG DATA VELOCITY & INFORMATION GOVERNANCE



Similar documents
#MMTM15 #INFOARCHIVE #EMCWORLD 1

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

Big Data and the Data Lake. February 2015

Traditional BI vs. Business Data Lake A comparison

ATMOS & CENTERA WHAT S NEW IN 2015

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Big Data and Analytics in Government

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

5 WAYS STRUCTURED ARCHIVING DELIVERS ENTERPRISE ADVANTAGE

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Internet of Things. Opportunity Challenges Solutions

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

... Foreword Preface... 19

Agenda. Big Data & Hadoop ViPR HDFS Pivotal Big Data Suite & ViPR HDFS ViON Customer Feedback #EMCVIPR

HAVE YOUR AGILITY AND EFFICENCY TOO

Building Confidence in Big Data Innovations in Information Integration & Governance for Big Data

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

Open Platform. Clinical Portal. Provider Mobile. Orion Health. Rhapsody Integration Engine. RAD LAB PAYER Rx

Business white paper. Lower risk and cost with proactive information governance

How To Manage A Single Volume Of Data On A Single Disk (Isilon)

Ganzheitliches Datenmanagement

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Increase Agility and Reduce Costs with a Logical Data Warehouse. February 2014

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

The Principles of the Business Data Lake

Simple. Extensible. Open.

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

VIPR SOFTWARE- DEFINED STORAGE

From Information Management to Information Governance: The New Paradigm

Copyright 2015 EMC Corporation. All rights reserved. 1

IBM Software Wrangling big data: Fundamentals of data lifecycle management

Managing Records in SharePoint

Big Data, Big Risk, Big Rewards. Hussein Syed

Informatica and our product strategy

SAP HANA Vora : Gain Contextual Awareness for a Smarter Digital Enterprise

IBM Solution Framework for Lifecycle Management of Research Data IBM Corporation

Big Data: Overview and Roadmap eglobaltech. All rights reserved.

Big Data Management and Security

The archiving activities occur in the background and are transparent to knowledge workers. Archive Services for SharePoint

EMC DOCUMENTUM CONTENT ENABLED EMR Enhance the value of your EMR investment by accessing the complete patient record.

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

White Paper. Software Development Best Practices: Enterprise Code Portal

Hadoop Data Hubs and BI. Supporting the migration from siloed reporting and BI to centralized services with Hadoop

Using EMC SourceOne Management in IBM Lotus Notes/Domino Environments

Are You Big Data Ready?

LEARNING FROM THE LEADING EDGE: REAL WAYS IT IS CREATING VALUE WITH ENTERPRISE HYBRID CLOUD gsst.01

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Protecting Big Data Data Protection Solutions for the Business Data Lake

Interagency Science Working Group. National Archives and Records Administration

Cohasset Associates, Inc. NOTES Managing Electronic Records Conference 1.1. The discipline of analyzing the. Value Costs and Risks

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

Solving Key Management Problems in Lotus Notes/Domino Environments

Industry Models and Information Server

IDENTIFYING THE RIGHT KIND OF HYBRID CLOUD FOR YOUR BUSINESS

TRANSFORM YOUR BUSINESS: BIG DATA AND ANALYTICS WITH VCE AND EMC

1. Understanding Big Data

A Practical Guide to Legacy Application Retirement

CONNECTING DATA WITH BUSINESS

Laserfiche for Federal Government MEET YOUR AGENCY S MISSION

Defensible Disposition Strategies for Disposing of Structured Data - etrash

Data Refinery with Big Data Aspects

Big Data overview. Livio Ventura. SICS Software week, Sept Cloud and Big Data Day

Virtualizing Apache Hadoop. June, 2012

CONVERGE APPLICATIONS, ANALYTICS, AND DATA WITH VCE AND PIVOTAL

What to Look for When Selecting a Master Data Management Solution

Unlocking The Value of the Deep Web. Harvesting Big Data that Google Doesn t Reach

VNX HYBRID FLASH BEST PRACTICES FOR PERFORMANCE

The Lab and The Factory

Predictive Customer Intelligence

Big Data Integration: A Buyer's Guide

Splunk Company Overview

Certified Information Professional (CIP) Certification Maintenance Form

Laserfiche for Federal Government MEET YOUR AGENCY S MISSION

North Highland Data and Analytics. Data Governance Considerations for Big Data Analytics

4th Annual ISACA Kettle Moraine Spring Symposium

The National Finnish Patient Record Archive & EMC Documentum-DMX-Centera solution Yves Mahieu EMEA Director Healthcare

Get More from Microsoft SharePoint with Oracle Fusion Middleware. An Oracle White Paper January 2008

WHITE PAPER Practical Information Governance: Balancing Cost, Risk, and Productivity

Transcription:

BIG DATA GOVERNANCE: BALANCING BIG DATA VELOCITY & INFORMATION GOVERNANCE Size Matters. The success of big data projects requires access to huge sets of high quality information. Compliant data represents the largest set of high quality business information within an enterprise. Attend this session to learn how to drive compliance and governance into your data lake, to deliver the information that your strategic initiatives require. 1

2

BIG DATA GOVERNANCE PETER SMERALD SENIOR DIRECTOR PRODUCT MARKETING & ENABLEMENT 3

ROADMAP INFORMATION DISCLAIMER EMC makes no representation and undertakes no obligations with regard to product planning information, anticipated product characteristics, performance specifications, or anticipated release dates (collectively, Roadmap Information ). Roadmap Information is provided by EMC as an accommodation to the recipient solely for purposes of discussion and without intending to be bound thereby. Roadmap information is EMC Restricted Confidential and is provided under the terms, conditions and restrictions defined in the EMC Non- Disclosure Agreement in place with your organization. 4

THE GREAT DISMAL SWAMP 5

BACKGROUND ON THIS SESSION Data lakes have begun to resemble a new generation of repositories Documentum is a world leader in repositories. Extensive experience with large scale compliant repositories Extensive experience with combined storage and application level systems (Centera and Documentum) Compliant storage is not an archive 6

SESSION OBJECTIVES HYDRO SERE SUCCESSION Build the case that compliance is the key component to accessing enterprise data Propose a pragmatic information architecture to establish data integrity controls Begin the dialog 7

SUCCESS = Ƒ(QUANTITY) ACCESS = Ƒ(COMPLIANCE) Size Matters: The success of big data projects requires access to huge sets of high quality information. 8

Success = ƒ(quantity) Organizationally, the most comprehensive, most valuable information resides in applications 9

GROUPING PACKAGED APP VALUE TRANSACTION APPLICATIONS Rich, highly validated transaction data. PRINT STREAMS CONTENT AND IMAGES INTERACTION APPLICATIONS COLLABORATIVE APPLICATIONS CMOD Valuable customer communications and financial reporting history. Images- comprehensive archive of legal agreements. Content- vast quantities of work products. Enriched and semi-structured contentparticularly important source of communications history. 10

PROVIDING DATA FOR THE LAKE Format Considerations Applications Images Print streams Unstructured documents Typical format Structured data- highly normalized table structures Multi-page tiff, with little metadata- and minimal text Large files with proprietary formats (ex. AFP, PCL, Postscript, multiple PDFs) Too many to count Why problematic Very difficult to construct business object Little to no textual information Massive sizes, not easily parsed No structure, little to no way to understand what is documented 11

PROVIDING DATA FOR THE LAKE Acquisition Method Duplicate ETL and migration tools into the data lake Pros Represents very distilled and highly valuable information Cons Non-compliant Poor quality Integrate APIs to applications Real-time, elegant Expensive Rigid Aggregate/ archive Archive data Generally pays for itself with infrastructure savings Requires a new mindset 12

A blended approach is needed ETL loads for non-compliant information APIs for hugely important systems that require real-time access because the velocity is so high An HDFS capable compliant archive for everything else 13

Access = ƒ(compliance) Without proper controls, the compliance, risk, and/or legal teams will block efforts to move data into the lake. 14

DATA INTEGRITY- THE GREAT DIVIDE Data Scientists Layers of defensibility: 1. Being able to do the right thing, 2. Doing the right thing, and 3. Proving the right things are being done. Records Managers 15

THE TRUTH, THE WHOLE TRUTH, AND NOTHING BUT THE TRUTH Data integrity Retention Legal holds Chain of custody Security Privacy Auditability 16

Data integrity Retention Legal holds Chain of custody Security Privacy Auditability REQUIREMENTS: Retention : Based on record type- not format Date based and event based Occasionally one record is controlled by multiple retention policies Legal holds: Crosses content types Jurisdictions dictate disposition Occasionally one record is subject to multiple holds 17

Data integrity Retention Legal holds Chain of custody Security Privacy Auditability STRATEGY: Storage based controls: Best place for enforcement of policies Allows most efficient control of content Controls administered by storage team Software based controls: Best place for management of policies Provides proof of proper management Controls administered by records team 18

Data integrity Retention Legal holds Chain of custody Security Privacy Auditability REQUIREMENTS: From the moment the item is collected, every transfer must be documented and it must be provable that it has not be changed. If there are discrepancies, then the chain of custody is broken and The information has limited (if any) value Trust in the results of the analysis will not exist. 19

Data integrity Retention Legal holds Chain of custody Security Privacy Auditability STRATEGY: Treat chain of custody as a process, not a technology. Enforce at the ingestion point. Mark object metadata with identifier Store chain of custody information as records 20

Data integrity Retention Legal holds Chain of custody Security Privacy Auditability REQUIREMENTS: Ensure that content is not accessed by unauthorized parties A hierarchy of information exists Cross border controls Cross content controls exist 21

Data integrity Retention Legal holds Chain of custody Security Privacy Auditability STRATEGY: The scale of a data lake changes everything. Abandon hope of managing privacy separately from security. Manage by sets, not by object. Create homogenous pools of anonymous information Mask Metadata Build sets of homogonous information Abstract 22

Data integrity Retention Legal holds Chain of custody Security Privacy Auditability REQUIREMENTS: Audit controls at two levels: Management and policy functionsprotect against improper changes to the system Access and use of the information 23

Data integrity Retention Legal holds Chain of custody Security Privacy Auditability STRATEGY: Management controls enforced at application layer Access enforced at storage layer 24

Bringing it all together 25

EMC Business Data Lake SUPPORTING CUSTOMER CHOICE Data & Analytics Catalog Pivotal Cloud Foundry Data Lake Platform Manager BIG DATA SUITE PIVOTAL HD GEMFIRE Supported Third Party Platforms Choice of Hadoop Pivotal Distribution Big Data Suite VMware vcloud Suite GREENPLUM DB HAWQ Data Governor EMC II Storage 26

EMC Business Data Lake SUPPORTING CUSTOMER CHOICE Data Lake Platform Manager Compliant data store (cold data EMC lake) II Storage Data Governor 27

EMC s next generation platform for compliant application data preservation 28

Compliant data store (cold data lake) Structured data Unstructured data Applications File records Data records Compound records Enterprise grade integrity controls: Retention controls Legal holds Audit controls Chain of custody Data lake ready information architecture: Metadata together with content Augmented metadata Business object granularitystructure 29

An example Predictive health studies Oncology Patient Record System Compliant data store EMC II Storage (cold data lake) Laboratory Information System Hospital A Patient Record System Hospital B Patient Record System X-Rays Treatment records Progress notes Immunization records Prescriptions Telemetry Attributed with unified patient ID Segmented to discrete record Structured for reuse Patient Centric Application 30

HEALTHY GROWTH OF THE DATA LAKE Data Lake Success = ƒ(quantity of rich data) Access to that data= ƒ(compliant data lake) Size Matters The success of big data projects requires access to huge sets of high quality information. Compliant data represents the largest set of high quality business information within an enterprise. 31

LEARN MORE ABOUT INFOARCHIVE DATE TIME TITLE LOCATION Everyday Self-Paced Hands On Lab: IT Transformation By Application Decommissioning InfoArchive EMC vlabs in the Village Wednesday 1:30 PM 2:30 PM Hands On Lab EMC InfoArchive: An Applied Technology Review Galileo 906 3:00 PM 4:00 PM Big Data Governance: Balancing Big Data Velocity & Information Governance Venetian Ballroom A 3:00 PM 4:00 PM Real Stories EMC InfoArchive - Set Your Data Free! Galileo 1004 Thursday 9:00 AM 1:00 PM Hackathon: From the Ground Up - Developing an EMC InfoArchive Archiving Solution Galileo 1006 InfoArchive Product Community: //community.emc.com/community/products/infoarchive EMC Store InfoArchive: //store.emc.com/us/product-family/emc-infoarchive-products/emc- InfoArchive/p/EMC-InfoArchive 32