More Data in Less Time Leveraging Cloudera CDH as an Operational Data Store Daniel Tydecks, Systems Engineering DACH & CE
Goals of an Operational Data Store Load Data Sources Traditional Architecture Operational Data Store Enterprise Data Warehouse Archive Storage N Applications BI System Ingest Storage #2 Unstructured Storage #1 ELT Serve Modeling Ingest Process Reporting Structured ETL Enterprise Data Warehouse Ingest Data Prepare Data Store Data 3
Load Challenges with a Traditional Architecture Traditional Architecture Data Sources Operational Data Store Enterprise Data Warehouse Applications 3 Archive Storage N BI System Ingest Unstructured 1 Storage #1 Storage #2 ELT 2 Serve Modeling Ingest Process 2 Reporting Structured ETL Enterprise Data Warehouse 1) Limited Data Ingest Unstructured Data Challenge Data Siloes Limit Data Collection 2) Inefficient Data Processing Resource Intensive ELT Transforming Unstructured Data Meeting SLAs 3) Data Archived Decrease Data Returns Archive is offline Data Deleted 4
A New Way Forward Data Sources Modern Architecture Operational Data Store Applications BI System Unstructured 1 Ingest ETL 3 EDH ELT Archiv e Load 2 Serve Modeling Active Structured Data Serve Reporting Structured Enterprise Data Warehouse Enterprise Data Warehouse 1) Ingest More Data Collect Any Data Volume Collect Data in Full Fidelity Diverse Data 2) Optimize Data Processing ELT Offload Parallel Processing Scalable Storage 3) Automated Secure Archive Historic Data Access Cost Effective Data Storage Compliance-Ready 5
Customer Spotlight Challenge Traditional system could not process omni-channel data fast enough Limiting customers to monthly reports Forcing decisions to be made with stale data Leading to poor consumer experience due to latency Solution Cloudera provided a landing zone where Experian could process and store large amounts of disparate data at scale. Benefit Process 28K records per second Process data 50X faster Increase consumer report frequency from monthly to weekly We needed to leap forward in our processing ability. We wanted to process data orders of magnitude faster so we could react to tomorrow s consumer. -Jeff Hassemer, VP of Product Strategy 6
How Cloudera Helps 1. Scalable Storage & Ingest 2. ETL Tool Integration 3. Data Modeling CLOUDERA S ENTERPRISE DATA HUB BATCH ANALYTIC SQL SEARCH ENGINE MACHINE LEARNING WORKLOAD STREAM 3 RD PARTY APPS DATA 4. Parallel Processing 5. Data Security & Governance 6. High Availability Administration Filesystem STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE Online NoSQL SYSTEM Companies that are more data driven are 5 percent more productive and 6 percent more profitable than other companies. 7
Store and Ingest More Data Data Storage CLOUDERA S ENTERPRISE DATA HUB Store any volume or type of data in full fidelity Storage for Replay Data Ingestion Easily integrate data from existing systems (relational, EDW, NoSQL, etc) Quickly ingest multiple data types (schema on read vs schema on write) BATCH ANALYTIC SQL Filesystem SEARCH ENGINE MACHINE LEARNING WORKLOAD STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE STREAM Online NoSQL 3 RD PARTY APPS DATA SYSTEM The NetApp Open Solution for Hadoop system offers us the scalability and flexibility we need to effectively support our growing client base and rapidly expanding data stores Marty Mayer, Director of Customer Tools Structured Unstructured 8
Integrate with Existing Tools ETL Partners Integrate with ETL tools to compliment existing investments and skills CLOUDERA S ENTERPRISE DATA HUB BATCH ANALYTIC SQL SEARCH ENGINE MACHINE LEARNING WORKLOAD STREAM 3 RD PARTY APPS DATA Filesystem STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE Online NoSQL SYSTEM 9
Model Structured & Unstructured Data Faster Data Management CLOUDERA S ENTERPRISE DATA HUB Use lineage to discover, track, and validate new and old data to ensure proper use Analytic SQL Quickly discover patterns in new data to facilitate large scale processing BATCH ANALYTIC SQL Filesystem SEARCH ENGINE MACHINE LEARNING WORKLOAD STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE STREAM Online NoSQL 3 RD PARTY APPS DATA SYSTEM 10
Parallel Process Data Volumes Batch Processing Fault-tolerant processing of large volumes of diverse data Stream Processing Process data as it s made available CLOUDERA S ENTERPRISE DATA HUB BATCH ANALYTIC SQL Filesystem SEARCH ENGINE MACHINE LEARNING WORKLOAD STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE STREAM Online NoSQL 3 RD PARTY APPS DATA SYSTEM "The Orbitz Worldwide sites process millions of searches and transactions every day... Hadoop was selected to provide a solution to the problem of long-term storage and processing - Jonathan Seifman, Lead Engineer for the Intelligent Marketplace Team 12
Protect and Govern Your Data Enterprise Security & Governance CLOUDERA S ENTERPRISE DATA HUB End-to-end protection with integrated authentication, role based authorization, encryption, key management, audit, and lineage Native platform solution ensures unified data management for easy reporting and discovery of data Compliance-ready to meet stringent regulatory requirements, out-of-the-box BATCH ANALYTIC SQL Filesystem SEARCH ENGINE MACHINE LEARNING WORKLOAD STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE STREAM Online NoSQL 3 RD PARTY APPS DATA SYSTEM "We selected Cloudera because of its short deployment time and breadth of mission-critical features, which satisfy the strict security and reliability requirements of our business. Stefan Apitz, VP of Operations 13
Manage Overall System Performance High Availably Administration CLOUDERA S ENTERPRISE DATA HUB Simple, centralized system view from ingest to analysis Supports mission critical workloads with necessary enterprise features (BDR, Proactive Support, Security) Zero downtime rolling upgrades Natively deploy and mange ETL tools Cloudera Enterprise gives our operations team the confidence that we are ahead of the curve in terms of keeping our cluster running with peak performance. Nick Halstead, Founder BATCH ANALYTIC SQL Filesystem SEARCH ENGINE MACHINE LEARNING WORKLOAD STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE STREAM Online NoSQL 3 RD PARTY APPS DATA SYSTEM 14
Keep Services Running Focus on the solution, not the cluster, with the only complete, zero-downtime administration tool for Apache Hadoop. Cloudera Enterprise gives our operations team the confidence that we are ahead of the curve in terms of keeping our cluster running with peak performance. Nick Halstead, Founder Unique Capabilities: Unified configuration, management and monitoring across all services Online installation and upgrades Direct connection to Cloudera Support 3 rd Party Extensibility 15
Load Traditional vs Modern Architectures Data Sources Traditional Architecture Operational Data Store Archive Enterprise Data Warehouse Applications Data Sources Modern Architecture Operational Data Store Applications Unstructured Ingest Storage N Storage #1 Storage #2 ELT Serve BI System Modeling Unstructured Ingest ETL EDH ELT Archiv e Load Serve BI System Modeling Ingest Process Reporting Active Structured Data Serve Reporting Structured ETL Enterprise Data Warehouse Structured Enterprise Data Warehouse Enterprise Data Warehouse Ingest More Data Optimize Data Processing Automated Secure Archive 16
The Road to Success Administrator Training Security Integration Configure, install, and monitor clusters for optimal performance Implement security measures and multi-user functionality Audit architecture in light of security policies and best practices Implement custom security to authenticate users, admins, and apps Data Analyst Training Apply SQL to much larger data sets with Impala, Hive, and Pig Master advanced techniques that boost Hadoop accessibility ETL Ingestion Pilot Reference implementation to 3 sources, 5 transforms, 1 target Create, execute, test, and review a custom ingestion/etl plan 17
Disrupt the Industry Not Your Business Implement Full Governance, Privacy, and Compliance Enable Big Data Processing and Applications Development Activate All Your Data in One Place Align Systems, Operations, & Strategy to Best-in-Class Proposed Evolution of Cloudera Enterprise Deployment Estimated Data in Production Proposed Services Timeline Administrator Training 4 Days Cluster Setup & Certification 1 Week Security Integration 1-2 Week Data Analyst Training 3 Days ETL Ingestion Pilot 2 Weeks 18
Thank you.
Why Cloudera? Enterprise-Grade Hadoop Differentiated performance, security, management, and governance. Expertise No one knows Hadoop better than Cloudera. Enablement Support, Training, and Professional Services enable and deliver success. Ecosystem Cloudera ensures that Hadoop works with the platforms, tools, and integrators you rely on. Sustainable Innovation Our hybrid open source model delivers the benefits of open source and what the enterprise requires, while enabling us to invest in the future for our customers. 20
The Most Complete Ecosystem Applications More than 1,200 partners ensure compatibility with existing investments, lower skill barriers, and help maximize value from your data. Enterprise Data Hub Data Systems Process Discover Model Serve Security and Administration Unlimited Storage System Integration Operational Tools Infrastructure 21
The Journey to a Data Strategy Operational Efficiency New Business Value Optimize your architecture. IT Discover the value in your data. analysts and data scientists Empower users directly. everyone Proces Discov Model Serve s er Security and Administration Unlimited Storage 22