Ganzheitliches Datenmanagement für Hadoop Michael Kohs, Senior Sales Consultant @mikchaos
The Problem with Big Data Projects in 2016 Relational, Mainframe Documents and Emails Data Modeler Data Scientist Data Analyst Analytic Apps, Enterprise Apps, etc. Takes too long to get the data Can I trust the data? We have a lot of sensitive data Data Steward Data Engineer Business Too many one off projects Hard to build and maintain Many compliance requirements Increase Customer Loyalty Improve Fraud Detection Reduce Security Risk Social Media, Web Logs Laboratory (insights) Factory (actions) Improve Predictive Maintenance Machine Device, Cloud Data Lakes, DW, DM, NoSQL Increase Operational Efficiency
Criteria for Successful Big Data Projects Relational, Mainframe Documents and Emails Data Modeler Data Scientist Data Analyst Data Steward Data Engineer Business Analytic Apps, Enterprise Apps, etc. Self-Service Autonomy Operational Agility Increase Customer Loyalty Improve Fraud Detection Reduce Security Risk Social Media, Web Logs Laboratory (insights) Factory (actions) Improve Predictive Maintenance Machine Device, Cloud Data Lakes, DW, DM, NoSQL Increase Operational Efficiency
Introducing Informatica Big Data Management Relational, Mainframe Documents and Emails Social Media, Web Logs Machine Device, Cloud Data Modeler Big Data Integration Data Scientist Data Analyst Analytic Apps, Enterprise Apps, etc. Big Data Governance Data Steward Data Lakes, DW, DM, NoSQL Data Engineer Big Data Security Business Increase Customer Loyalty Improve Fraud Detection Reduce Security Risk Improve Predictive Maintenance Increase Operational Efficiency
The 3 Pillars of Informatica Big Data Management Big Data Integration Simple Visual Environment Optimized Execution & Flexible Deployment Dynamic schemas & Templates 100 s of Pre-built Transforms, Connectors & Parsers Big Data Governance Data Quality & Profiling 360 Relationship Views Universal Metadata Catalog with End-to-end Data Lineage Business Glossary Self-service Collaboration Tools Big Data Security Sensitive Data Discovery & Classification Proliferation Analysis Risk Assessment Non-intrusive Data Masking
Big Data Integration
Data Warehouse Optimization Phase 1 Phase 2 Relational, Mainframe 1. Offload data & ELT processing to Hadoop Data Warehouse BI Reports & Apps Documents and Emails 2. Batch load raw data (e.g. transactions, multi-structured) 6. Move high value curated data into data warehouse Social Media, Web Logs 3. Replicate changes & schemas for relational data 5. Parse & prepare (e.g. ETL, data quality) data for analysis Machine Device, Cloud 4. Collect & stream real-time machine data 7
Abstracting complexity and protecting investments Develops universal mapping Source Filter Sorter Aggregator Σ Target Design Workload & Resource Mgmt Informatica YARN Purpose built Execution Engines Native Informatica Engine map/reduce Blaze
Big Data Quality & Governance
Data Profiling on Hadoop 1. Profiling Stats: Min/Max Values, NULLs, Data Types, etc. 2. Frequency distribution 3. Value Drill-Down
Hadoop Data Domain Discovery Finding functional meaning of Hadoop Data 1. Leverage INFA rules/mapplets to identify functional meaning of Hadoop data Sensitive data (e.g. SSN, Credit Card number, etc.) Liability and Compliance risk? PHI: Protected Health Information PII: Personally Identifiable Information Scalable to look for/discover ANY Domain type 2. View/share report of data domains/sensitive data contained in Hadoop. Ability to drill down to see suspect data values.
Governance & Metadata Management 12
Metadata Manager Architecture Consolidated Metadata Catalog Data Lineage Business Glossary Business Glossary Desktop 3rd party BI Metadata Reports Metadata Repository Metadata Bookmarks Mainframe ERP Database Flat Files Data Modeling BI Tools Custom 13
End-to-End Hadoop Lineage Display lineage information about data which has been loaded or extracted from Hadoop Display lineage about map/reduce and Blaze jobs generated by Informatica Big Data Management Will show transformations Connect source to target systems Works with all supported distributions 14
End-To-End Lineage with Informatica and Cloudera Data Source Data Prep on Hadoop via Informatica Hive HQL Target BI/Analytic App https://www.youtube.com/watch?v=rf63wfn8kik 15
Data Intelligence Live Data Map 2015 Informatica. Proprietary and Confidential 16
Live Data Map: Foundation for Data Intelligence Data Discovery Sensitive Data Tracking Stewardship & Governance Smart Suggestions Exploration Semantic Search Relationship Discovery Live Data Map Map Knowledge Relationships Graph of all enterprise Rules EIC Catalog data assets Glossary Statistics Ratings Recommendations 360 degree views User Ratings All Informatica Repositories 3rd party BI, Modeling, Big Data, RDBMS Applications, Business glossary & context User Ratings, Feedback, Operational Stats
Big Data Security
Persistent Data Masking 1. Users load data into Hadoop - Masked or Unmasked 2. Security analyst uses Sensitive Data Discovery to scan and discover where sensitive data exists in Hadoop 3. Sensitive data is masked using Persistent Masking and moved to Analytics or Test Environments within the Hadoop instance or in a separate instance 4. Persistently masked data is queried by BI Analysts HBASE 1 Hadoop. HDFS 2 Hive Sensitive Data Discovery Persistent Data Masking Persistent Data Masking BI & Analytic Layer Query, Reporting, Data Mining, Predictive Analytics 3 4 Hadoop Analytics Environment Hadoop Test Environment
Dynamic Data Masking In-line Proxy Server Delivers Seamless Security Layer for Hive and Hadoop* Values Presented: BLAKE JONES KING Role-based anonymization and real-time prevention Values Presented: BL**** JO**** KI**** Business user application screen Private Information Stored Dynamic Data Masking Layer applies real-time HQL rewrites to mask returned result set Application screens and tools used by production support, DBAs, Outsourced or unauthorized workforce BLAKE JONES KING Hadoop (2)Select substring(name,1,2) *** from table1
Project Sonoma The intelligent Data Lake
A sample Data Lake architecture IT Data Scientists Analysts, Business Users Relational, Mainframe Documents and Emails Integrate Systems Operationalize Discovery results Monitor and manage Discover Data Profile Data Combine data Aggregate Develop Patterns Standardized reports Statistics Social Media, Web Logs Swamp Pool Reservoir Machine Device, Cloud Landing Zone Discovery Consumption 22
Project Sonoma The intelligent Data Lake Self-Service for Analysts Search & Discover Prepare & Publish Governance for IT Usage tracking & monitoring Lineage & Security Operate at scale DATA METADATA Self-Service Data Discovery (Portal) Raw Data Prepare (Rev) Live Data Map IT Monitoring & Tracking Published Data Sets DATA BI & Analytics Project Sonoma Demo at Informatica World 2015 23
Informatica + Hadoop The best out of two worlds SOURCE Best of Informatica 20 years DI innovations TARGET Databases, Files Best of Open Source scalable distributed computing Analytics Teams Servers & Mainframe INGEST Prepare Refine Govern DELIVER Backend DBs Social Batch Replicate Stream Archive MDM Batch Services Events Topics Analytics & Op Dashboards Sensor data EDW Mobile Apps 24
Q&A Visit us at: Thank You! BARC BI&Big Data Forum, Halle 5, Stand B36:
Big Data Edition Trial Sandbox 60 day free trial Available for Cloudera 5.0 and Hortonworks HDP 2.1.3 1 node cluster Sample data/mappings, documentation, videos Mappings can be transferred/reused Go to the Big Data Mall to download 26
Resources Project Sonoma (Data Lake) demo at Informatica World 2015 Project Atlantic (Machine Data Parsing) demo at Informatica World 2015 Informatica solutions for Big Data Informatica Big Data Management Editions Informatica Big Data Management Editions datasheet Big Data Management Deep Dive Webinar & Demo Informatica Blaze Executive Brief Big Data Relationship Manager Metadata Management with Cloudera Navigator