MDM for the Modern Data Architecture. September 2014

Similar documents

Modern Data Architecture for Predictive Analytics

HDP Enabling the Modern Data Architecture

Hadoop, the Data Lake, and a New World of Analytics

HDP Hadoop From concept to deployment.

SAP and Hortonworks Reference Architecture

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

The Future of Data Management

Apache Hadoop's Role in Your Big Data Architecture

Harnessing big data with Hortonworks Data Platform and Red Hat JBoss Data Virtualization

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

The Future of Data Management with Hadoop and the Enterprise Data Hub

Ganzheitliches Datenmanagement

Comprehensive Analytics on the Hortonworks Data Platform

The Enterprise Data Hub and The Modern Information Architecture

Talend Big Data. Delivering instant value from all your data. Talend

Stinger Initiative: Introduction

Big Data Realities Hadoop in the Enterprise Architecture

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

A Modern Data Architecture with Apache Hadoop

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Luncheon Webinar Series May 13, 2013

Big Data: Making Sense of it all!

Using Tableau Software with Hortonworks Data Platform

#TalendSandbox for Big Data

Unified Batch & Stream Processing Platform

Information Builders Mission & Value Proposition

Upcoming Announcements

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

The Next Wave of Data Management. Is Big Data The New Normal?

Hadoop Job Oriented Training Agenda

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Data Integration Checklist

Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Exploiting Data at Rest and Data in Motion with a Big Data Platform

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Integrating a Big Data Platform into Government:

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Introduc8on to Apache Spark

How To Use Big Data For Business

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Bringing Big Data to People

MDM for the Enterprise: Complementing and extending your Active Data Warehousing strategy. Satish Krishnaswamy VP MDM Solutions - Teradata

Big Business Value from Big Data and Hadoop

Big Data and Trusted Information

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Are You Ready for Big Data?

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Hortonworks CISC Innovation day

Modernizing Your Data Warehouse for Hadoop

Evolution from Big Data to Smart Data

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Getting Started Practical Input For Your Roadmap

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Big Data Analytics* Outline. Issues. Big Data

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Bringing the Power of SAS to Hadoop. White Paper

SAP Database Strategy Overview. Uwe Grigoleit September 2013

Dansk IT Big Data i de største danske banker

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Big Data 101 Webinar

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Data Science & Big Data Practice

Microsoft Big Data. Solution Brief

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Cloudera Enterprise Data Hub in Telecom:

What's New in SAS Data Management

TRANSFORM BIG DATA INTO ACTIONABLE INFORMATION

Getting Started & Successful with Big Data

Artur Borycki. Director International Solutions Marketing

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Big Data Analytics Nokia

BIG DATA AND MICROSOFT. Susie Adams CTO Microsoft Federal

HIGH PERFORMANCE ANALYTICS FOR TERADATA

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Descriptive to Predictive to Prescriptive Analytics: Move Up the Value Chain. Suren Nathan CTO

Are You Ready for Big Data?

Please give me your feedback

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

How the oil and gas industry can gain value from Big Data?

Open Source in Financial Services: Meet the challenges of new business models and disruption

Manifest for Big Data Pig, Hive & Jaql

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

IBM Big Data Platform

Apache Hadoop Patterns of Use

IP Expo 2014 Pentaho Big Data Analytics Accelerating the time to big data value London, UK

Optimized for the Industrial Internet: GE s Industrial Data Lake Platform

Hortonworks Data Platform for Hadoop and SAP HANA

The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Architecting for the Internet of Things & Big Data

Transcription:

MDM for the Modern Data Architecture September 2014

Purpose of MDM Create correct and consistent data across the enterprise that fosters trust in information and acceleration of growth. 2

Why it matters Without data you re just another person with an opinion. W. Edwards Deming 3

Vicious Cycle of Unmanaged Data Data Issues 1 Master remain unaddressed or unresolved conflicts 4 Data reinforce siloed 2 Garbage in/garbage out Unmanaged Data operations creates process confusion of process trust slows 3 Lack business momentum 4

A Data Architecture Under Pressure Unstructured documents, emails Server logs Applications Business Analytics Custom Applications Packaged Applications Sentiment, web data Hierarchical data 2.8 ZB in 2013 RDBMS Data System EDW MPP Repositories OLTP, ERP, CRM 85% from new data types 15x Machine Data by 2020 Transactional data 40 ZB by 2020 Master data Source: IDC Existing Sources Sources (CRM, ERP, Clickstream, Logs) Sensor, machine data Geolocation Hortonworks Inc. 2014 Clickstream 5

Broad Spectrum of Benefits Across Industries Financial Services New account risk screens Fraud prevention Trading risk Maximize deposit spread Insurance underwriting Accelerate loan processing Retail Telecom Manufacturing 360 view of the customer Analyze brand sentiment Localized, personalized promotions Website optimization Optimal store layout Call detail records (CDRs) Infrastructure investment Next product to buy (NPTB) Real-time bandwidth allocation New product development Supplier consolidation Supply chain and logistics Assembly line quality assurance Proactive maintenance Crowdsourced quality assurance Healthcare Genomic data for medical trials Monitor patient vitals Reduce re-admittance rates Store medical research data Recruit cohorts for pharmaceutical trials 6 Utilities, Oil & Gas Public Sector Smart meter stream analysis Slow oil well decline curves Optimize lease bidding Compliance reporting Proactive equipment repair Seismic image processing Analyze public sentiment Protect critical networks Prevent fraud and waste Crowdsource reporting for repairs to infrastructure Fulfill open records requests

Gartner s Nexus of Forces Making Things Worse 7

Business Benefits of MDM Today IT data mgmt. pros focus on: Business leaders really care about: Eliminating duplicate/orphaned data Increasing revenue Standardizing and centralizing data/metadata Decreasing costs Meeting operational SLAs Increasing operational efficiencies Data enrichment Reducing risk Data integration and synchronization Improving customer experiences Use business-value driven KPIs to evangelize MDM benefits 8 Reduction in direct marketing postage costs Reduction in average handle time in call center Increase in customer self-service for order management, technical support and customer service Increase in campaign response rates Reduction in customer privacy compliance risk exposure Delivering a consistent crosschannel customer experience

How About MDM on a Data Lake? 9 Benefits of a Hadoop Data Lake Challenges to Data Lake Approach Data is ingested in its raw state regardless of format, structure or lack of structure Raw data can be used and reused for differing purposes across the enterprise Beyond inexpensive storage, Hadoop is an extremely power and scalable and segmentable computational platform Master Data can be fed across the enterprise and deep analytics on clean data is immediately enabled Severe shortage of Map Reduce skilled resources Inconsistent skills lead to inconsistent results of code based solutions Nascent technologies require multiple point solutions Technologies are not enterprise grade Some functionality may not be possible within these frameworks

Key Functions for Master Data Management ETL & ELT Profiling, reads/writes, transformations Single project for all jobs Master Key Management Create keys Track changes Maintain matches over time 10 Data Quality Integration & Matching Cleanse data Parsing, correction Geo-spatial analysis Grouping Fuzzy match Web Services Integration Process Automation & Operations Consume and publish HTTP/HTTPS protocols XML/JSON/SOAP formats Job scheduling, monitoring, notifications Central point of control Meta Data Management

Data Lake is the Center of Your MDM Strategy Ingestion of all data available from any source, format, cadence, structure or non-structure ELT and data transformation, refinement, cleansing, completion, validation and standardization Geospatial processing and geocoding Data profiling, lineage and metadata management Identity resolution and persistent keying and entity profile management 11

Data Lake Architecture for MDM Data Sources Clickstream CRM Online Chat ERP Sensor Data Billing Subscrib er Product Social Media + Call Detail Records Network Fabrication Logs Weather Sales Feedback Compete Field Feedback Manuf. Field Feedback 12

How Can That Possibly Work? More Map Reduce! 13 YARN!

Overview What is Hadoop/Hadoop 2.0 Hadoop 1.0 All operations based on Map Reduce Intrinsic inconsistency of code based solutions Highly skilled and expensive resources needed 3rd party applications constrained by the need to generate code 14 Hadoop 2.0 Introduction of the YARN: a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters. Mature applications can now operate directly on Hadoop Reduce skill requirements and increased consistency

RedPoint Data Management on Hadoop Parallel Section 15 Data I/O Key / Split Analysis N R A Y Partition Data server Execution AM / Tasks c u d e R p a M Partitioning AM / Tasks

Reference Hadoop Architecture Monitoring and Management Tools SOURCE DATA Query/Visualization/ Reporting/Analytical Tools and Apps AMBARI DBs INTERACTIVE DATA REFINEMENT Fil Fil es Files es HIVE PIG HIVE Server2 MAPREDUCE STRUCTURE JMS Queue s REST - Sensor Logs - Clickstream - Flat Files - Unstructured - Sentiment - Customer - Inventory YARN LOAD HTTP SQOOP WebHDFS NFS STREAM Flume HCATALOG (metadata services) 1 n HDFS RDBMS LOAD SQOOP/Hive Web HDFS RedPoint Functional Footprint 16 Data Sources EDW

Benchmarks Project Gutenberg Sample MapReduce (small subset of the entires code which totals nearlywithout 150 lines): the UDF: ample Pig script public static class MapClass SETIntWritable> pig.maxcombinedsplitsize 67108864 extends Mapper<WordOffset, Text, Text, { private final static String delimiters = SET pig.splitcombination true "',./<>?;:\"[]{}-=_+()&*%^#$!@`~ \\ ± "; A == LOAD '/testdata/pg/*/*/*'; private final static IntWritable one new IntWritable(1); private Text word = new Text(); B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) public void map(wordoffset key, Text value, Context context) C = FOREACH B GENERATE UPPER(word) AS word; throws IOException, InterruptedException { String line = value.tostring(); D = GROUP C BY word; StringTokenizer itr = new StringTokenizer(line, E = FOREACH delimiters); D GENERATE COUNT(C) AS occurrences, group; while (itr.hasmoretokens()) { word.set(itr.nexttoken()); F = ORDER E BY occurrences DESC; context.write(word, one); STORE F INTO '/user/cleonardi/pg/pig-count'; } } } Map Reduce 17 Pig RedPoint >150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code 6 hours of development 3 hours of development 15 min. of development 6 minutes runtime 15 minutes runtime 3 minutes runtime Extensive optimization needed User Defined Functions required prior to running script No tuning or optimization required

Data Lake Architecture for MDM Data Sources CRM Clickstream ERP Online Chat Billing Sensor Data Subscrib er Social Media Product + Call Detail Records Network Fabrication Logs Weather Sales Feedback Compete Field Feedback Manuf. Field Feedback 18