Hadoop Data Hubs and BI. Supporting the migration from siloed reporting and BI to centralized services with Hadoop

Hadoop Data Hubs and BI Supporting the migration from siloed reporting and BI to centralized services with Hadoop John Allen October 2014

Introduction John Allen; computer scientist Background in data analytics, distributed systems, enterprise architecture and hacking things at home. Consultancy, start-ups and big businesses; government, gaming, telco and finance John.howard.allen@gmail.com

Objectives Outline the benefits of Hadoop for data staging (Data Lake) Explain why Hadoop is an effective part of an analytics service (Data Lake) Talk a little about how Hadoop can play a role in transforming an organization's data architecture (Enterprise Data Hub)

Disclaimer Conceptually we re not trying to do anything different, the goals of data management and business intelligence remain the same. What s different is Data the Standards trade offs and compromises we can now make ELT or ETL Data masking And the cost of managing change Data governance Cloud or Private Data Profiling Scale up or Scale out Early or Late Optimization Commodity or Specialist Metadata Management Master Data Management Model Driven Data Quality Schema on Write Schema on Need

What is new Unrivalled volume of data Increased variety of sources and formats Need for improved approach to change Agility Cost Flexibility Improved ROI Demand for increased data led decision making Demand for predictive, not just descriptive User expectations around Timeliness Accuracy Ease of use Concerns: Technology Centric Needs: Business Centric

Changes in Data Management 2010: Hadoop was a build it yourself batch processing system 2014: Hadoop is an off the shelf extensible data processing platform with a range of streaming, in-memory and batch processing 80% of all Informatica products to run natively on Hadoop - Ori Lev Ran, Senior Director Big Data and Hadoop: more than just volume and cost Analysts recognise that agility and flexibility are a key components of the BigData story (schema on read, data model flexibility) Gartner, 2014 Hadoop 2 and YARN: game changing data-processing platform Vendors seamlessly deploy their own applications into our cluster and data Run machine learning, streaming, ETL applications and batch job all on the same platform

Changes in Data Integration 80% of Informatica suite will run natively on Hadoop by 2014 Ori Lev Ran, Sr. Director, Strategic Business Development Vendors promoting BigData as compute platforms Data analysis now possible of without traditional ETL (NoETL) ( schema on read ) Industry analysts recognising value of BigData for structured and unstructured data Industry analysts talk about the rise of Data Hubs serving the Logical Data Warehouse NoETL, Data Hubs and semantic data services are a key component of DI in 2014 2013 Ted Friedman, VP Distinguished Analyst, Data Integration Analysts recommend building a Logical data Warehouse using a blend of traditional and BigData technologies

Data Data Everywhere Data Management Heterogeneous IT Complex integration (P2P) Multiple LOB systems, ODS, warehouses Range of models and standards Usage and Analytics Slow BI responsiveness Range of tooling and approaches Departmental approaches Driver: Rapid Growth / Acquisition Driver: Traditional IT Models

Data Hub - Analytics Enablement Data Management Centralized data platform Strategic data sourcing Common data interface Enhanced retention & fidelity Increased compute (MPP) Reduce cost of ETL Usage and Analytics Centralized analytics Self-service BI Advanced data analysis Benefit: Lower Complexity Benefit: Improved TTM Benefit: Improved Insight

Data Hub Reporting Consolidation Data Management Data standards enabled De-duplication and consolidation Reporting focused Usage and Analytics Migration of ad-hoc reporting (Excel) Report consolidation Increased automation of reports Benefit: Reduced Cost Benefit: Improved Accuracy Benefit: Improved Timeliness

Data Hub Data Bus Integration Data Management Removal of P2P Pub/Sub communication Distribution of data products Usage and Analytics Service oriented Event based Benefit: Reduced Cost Benefit: Reduced Duplication Benefit: Improved Timeliness

Functional View Discover Find data by taxonomy, metadata, keys Self-service, data entitlement gateway On-boarding Dedicated managed service All sources and data types Process / Transform Batch and on-demand transform Enrichment and heavy lifting Store Polystructure, elastic, fault-tolerant, MPP Secure, governed, many interfaces. Access and Entitlement Controlled audited access SQL, file, API, web interfaces Manage ITIL services Analysis Descriptive, reactive and predictive Self-service and business led Visualisation Dashboard development Interactive data exploration

Putting it all Together

Next Steps Improved Data Discovery and Data Cleansing Trifacta, Paxata, Established Players Improved On-Cluster Complex Analytics Actian (KNIME), Rapid Miner (Rahoop), SAS Improved Unstructured Support and Search Squirro Improved SQL and (H)OLAP support Splice Machine, Impala 2.0 Improved Metadata Management Informatica, Navigator, Others

Challenges Feed Management and Reconciliation Metadata Management and Lineage Globalized Data Management Sovereign data, Regulatory constraints Security (i.e. Row/Cell-Level) Avoiding Silos and Vendor Lock-in

Thank You