MDM for the Modern Data Architecture September 2014
Purpose of MDM Create correct and consistent data across the enterprise that fosters trust in information and acceleration of growth. 2
Why it matters Without data you re just another person with an opinion. W. Edwards Deming 3
Vicious Cycle of Unmanaged Data Data Issues 1 Master remain unaddressed or unresolved conflicts 4 Data reinforce siloed 2 Garbage in/garbage out Unmanaged Data operations creates process confusion of process trust slows 3 Lack business momentum 4
A Data Architecture Under Pressure Unstructured documents, emails Server logs Applications Business Analytics Custom Applications Packaged Applications Sentiment, web data Hierarchical data 2.8 ZB in 2013 RDBMS Data System EDW MPP Repositories OLTP, ERP, CRM 85% from new data types 15x Machine Data by 2020 Transactional data 40 ZB by 2020 Master data Source: IDC Existing Sources Sources (CRM, ERP, Clickstream, Logs) Sensor, machine data Geolocation Hortonworks Inc. 2014 Clickstream 5
Broad Spectrum of Benefits Across Industries Financial Services New account risk screens Fraud prevention Trading risk Maximize deposit spread Insurance underwriting Accelerate loan processing Retail Telecom Manufacturing 360 view of the customer Analyze brand sentiment Localized, personalized promotions Website optimization Optimal store layout Call detail records (CDRs) Infrastructure investment Next product to buy (NPTB) Real-time bandwidth allocation New product development Supplier consolidation Supply chain and logistics Assembly line quality assurance Proactive maintenance Crowdsourced quality assurance Healthcare Genomic data for medical trials Monitor patient vitals Reduce re-admittance rates Store medical research data Recruit cohorts for pharmaceutical trials 6 Utilities, Oil & Gas Public Sector Smart meter stream analysis Slow oil well decline curves Optimize lease bidding Compliance reporting Proactive equipment repair Seismic image processing Analyze public sentiment Protect critical networks Prevent fraud and waste Crowdsource reporting for repairs to infrastructure Fulfill open records requests
Gartner s Nexus of Forces Making Things Worse 7
Business Benefits of MDM Today IT data mgmt. pros focus on: Business leaders really care about: Eliminating duplicate/orphaned data Increasing revenue Standardizing and centralizing data/metadata Decreasing costs Meeting operational SLAs Increasing operational efficiencies Data enrichment Reducing risk Data integration and synchronization Improving customer experiences Use business-value driven KPIs to evangelize MDM benefits 8 Reduction in direct marketing postage costs Reduction in average handle time in call center Increase in customer self-service for order management, technical support and customer service Increase in campaign response rates Reduction in customer privacy compliance risk exposure Delivering a consistent crosschannel customer experience
How About MDM on a Data Lake? 9 Benefits of a Hadoop Data Lake Challenges to Data Lake Approach Data is ingested in its raw state regardless of format, structure or lack of structure Raw data can be used and reused for differing purposes across the enterprise Beyond inexpensive storage, Hadoop is an extremely power and scalable and segmentable computational platform Master Data can be fed across the enterprise and deep analytics on clean data is immediately enabled Severe shortage of Map Reduce skilled resources Inconsistent skills lead to inconsistent results of code based solutions Nascent technologies require multiple point solutions Technologies are not enterprise grade Some functionality may not be possible within these frameworks
Key Functions for Master Data Management ETL & ELT Profiling, reads/writes, transformations Single project for all jobs Master Key Management Create keys Track changes Maintain matches over time 10 Data Quality Integration & Matching Cleanse data Parsing, correction Geo-spatial analysis Grouping Fuzzy match Web Services Integration Process Automation & Operations Consume and publish HTTP/HTTPS protocols XML/JSON/SOAP formats Job scheduling, monitoring, notifications Central point of control Meta Data Management
Data Lake is the Center of Your MDM Strategy Ingestion of all data available from any source, format, cadence, structure or non-structure ELT and data transformation, refinement, cleansing, completion, validation and standardization Geospatial processing and geocoding Data profiling, lineage and metadata management Identity resolution and persistent keying and entity profile management 11
Data Lake Architecture for MDM Data Sources Clickstream CRM Online Chat ERP Sensor Data Billing Subscrib er Product Social Media + Call Detail Records Network Fabrication Logs Weather Sales Feedback Compete Field Feedback Manuf. Field Feedback 12
How Can That Possibly Work? More Map Reduce! 13 YARN!
Overview What is Hadoop/Hadoop 2.0 Hadoop 1.0 All operations based on Map Reduce Intrinsic inconsistency of code based solutions Highly skilled and expensive resources needed 3rd party applications constrained by the need to generate code 14 Hadoop 2.0 Introduction of the YARN: a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters. Mature applications can now operate directly on Hadoop Reduce skill requirements and increased consistency
RedPoint Data Management on Hadoop Parallel Section 15 Data I/O Key / Split Analysis N R A Y Partition Data server Execution AM / Tasks c u d e R p a M Partitioning AM / Tasks
Reference Hadoop Architecture Monitoring and Management Tools SOURCE DATA Query/Visualization/ Reporting/Analytical Tools and Apps AMBARI DBs INTERACTIVE DATA REFINEMENT Fil Fil es Files es HIVE PIG HIVE Server2 MAPREDUCE STRUCTURE JMS Queue s REST - Sensor Logs - Clickstream - Flat Files - Unstructured - Sentiment - Customer - Inventory YARN LOAD HTTP SQOOP WebHDFS NFS STREAM Flume HCATALOG (metadata services) 1 n HDFS RDBMS LOAD SQOOP/Hive Web HDFS RedPoint Functional Footprint 16 Data Sources EDW
Benchmarks Project Gutenberg Sample MapReduce (small subset of the entires code which totals nearlywithout 150 lines): the UDF: ample Pig script public static class MapClass SETIntWritable> pig.maxcombinedsplitsize 67108864 extends Mapper<WordOffset, Text, Text, { private final static String delimiters = SET pig.splitcombination true "',./<>?;:\"[]{}-=_+()&*%^#$!@`~ \\ ± "; A == LOAD '/testdata/pg/*/*/*'; private final static IntWritable one new IntWritable(1); private Text word = new Text(); B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) public void map(wordoffset key, Text value, Context context) C = FOREACH B GENERATE UPPER(word) AS word; throws IOException, InterruptedException { String line = value.tostring(); D = GROUP C BY word; StringTokenizer itr = new StringTokenizer(line, E = FOREACH delimiters); D GENERATE COUNT(C) AS occurrences, group; while (itr.hasmoretokens()) { word.set(itr.nexttoken()); F = ORDER E BY occurrences DESC; context.write(word, one); STORE F INTO '/user/cleonardi/pg/pig-count'; } } } Map Reduce 17 Pig RedPoint >150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code 6 hours of development 3 hours of development 15 min. of development 6 minutes runtime 15 minutes runtime 3 minutes runtime Extensive optimization needed User Defined Functions required prior to running script No tuning or optimization required
Data Lake Architecture for MDM Data Sources CRM Clickstream ERP Online Chat Billing Sensor Data Subscrib er Social Media Product + Call Detail Records Network Fabrication Logs Weather Sales Feedback Compete Field Feedback Manuf. Field Feedback 18