Hadoop Trends and Practical Use Cases John Howey Cloudera jhowey@cloudera.com Kevin Lewis Cloudera klewis@cloudera.com April 2014 1
Agenda Hadoop Overview Latest Trends in Hadoop Enterprise Ready Beyond Batch Use Cases Skill Sets needed for Hadoop. 2
Agenda Hadoop Overview Latest Trends in Hadoop Enterprise Ready Beyond Batch Use Cases Skill Sets needed for Hadoop. 3
GIGABYTES OF DATA CREATED (IN BILLIONS) Explosive Data Growth 10,000 1.8 trillion gigabytes of data was created in 2011 5,000 More than 90% is unstructured data Approx. 500 quadrillion files Quantity doubles every 2 years 0 2005 2015 2010 Source: IDC 2011 STRUCTURED DATA UNSTRUCTURED DATA 4
The Big Data Challenge Big Data Contains Limitless Insights BUT OPERATIONAL DATA WEB LOGS DIGITAL CONTENT VOLUME VARIETY FILES SOCIAL MEDIA SMART GRIDS VELOCITY TRANSACTIONAL DATA VALUE AD IMPRESSIONS R&D DATA DEMANDED A NEW APPROACH 5
Common Legacy Data Architecture Offline data can t be analyzed easily TAPE ARCHIVE Can t explore original high fidelity data BI REPORTS & INTERACTIVE APPS STORAGE ONLY GRID (ORIGINAL RAW DATA) ETL COMPUTE GRID RDBMS (AGGREGATED DATA) DATA COLLECTION Moving data to compute doesn t scale DATA SOURCES 6
Expanding Data Requires A New Approach 1980s Bring Data to Compute Now Bring Compute to Data Compute Compute Compute Data Data Process-centric businesses use: Structured data mainly Internal data only Important data only Compute Data Information-centric businesses use all data: Multi-structured, internal & external data of all types Data Compute Data Compute Relative size & complexity 7
Why Use Hadoop Move beyond rigid legacy frameworks. Hadoop handles any data type, in any quantity. Structured, unstructured Schema, no schema High volume, low volume All kinds of analytic applications Hadoop grows with your business. Proven at petabyte scale Capacity and performance grow simultaneously Leverages commodity hardware to mitigate costs Hadoop is 100% Apache licensed and open source. No vendor lock-in Community development Rich ecosystem of related projects Hadoop helps you derive the complete value of all your data. Drives revenue by extracting value from data that was previously out of reach Controls costs by storing data more affordably than any other platform 1 2 3 8
Why Hadoop Was Created New opportunities to derive value from all your data. Exploding Data Volumes & Types Driving The Need For A Flexible, Scalable Solution DIGITAL CONTENT NEW OPPORTUNITIES FILES SOCIAL MEDIA WEB LOGS SMART GRIDS OPERATIONAL DATA HARD PROBLEMS Extract More Value From More Data More Cost Effectively With Greater Flexibility AD IMPRESSIONS TRANSACTIONAL DATA R&D DATA BIG DATA Deep Analysis Exhaustive & Detailed Sophisticated Algorithms Generate Results Quickly It s difficult to handle data this diverse, at this scale. Traditional platforms can t keep pace. Any Kind From Any Source Structured & Unstructured At Scale 9
What is Apache Hadoop? Apache Hadoop is an open source distributed computing platform for data storage and processing that is Scalable No limits Fault tolerant Failures Expected Distributed Utilize many computers/cores in parallel Think Large computer built out of many smaller computers CORE HADOOP SYSTEM COMPONENTS Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Resource Management (YARN) A framework for job scheduling and cluster resource management. Has the Flexibility to Store and Mine Any Type of Data Ask questions across structured and unstructured data that were previously impossible to ask or solve Not bound by a single schema or storage format Excels at Processing Complex Data Scale-out architecture divides workloads across multiple nodes Flexible file system eliminates ETL bottlenecks Scales Economically Can be deployed on commodity hardware Open source platform guards against vendor lock 10
Core Hadoop: HDFS (Hadoop Distributed File System) Based on GFS Distributed, fault-tolerant filesystem No RAID needed, JBOD (just a bunch of disks) is used Primarily designed for cost and scale Works on commodity hardware 20PB / 4000 node cluster at Facebook Store any format of data (text, structured, binary) Can copy to and from, even use NFS mounts 11
Core Hadoop: Map Reduce Distributed, fault-tolerant data processing mechanism Primarily designed for batch mode Designed around functional programming Developer doesn t have to worry about typical issues with Distributed programming Distributed Parallel execution close to the data means exceptional performance 12
Core Hadoop: YARN Enterprise Workload Management Capabilities Multiple engines Better scalability Workload management Shared resources Fine-grained scheduling Workload isolation Benefits Mixed usage platform Enables workload SLAs Group-based policies 13
Deploying Hadoop on Your Own Select Components Based on Use Case Manage Component Versions & Interoperability Deployment & Configuration of Services Ongoing Configuration & Management Support & Meeting SLA s Ensuring Repeatable Success Time-to-Value and Risk 14
Cloudera is Leading the Way in Data Management Powered by Hadoop 2008 2009 2011 2012 2013 2014 CLOUDERA FOUNDED BY MIKE OLSON AMR AWADALLAH & JEFF HAMMERBACHER CLOUDERA RELEASES CDH THE FIRST COMMERCIAL APACHE HADOOP DISTRIBUTION CLOUDERA REACHES 100 PRODUCTION CUSTOMERS CLOUDERA ENTERPRISE 4 THE STANDARD FOR HADOOP IN THE ENTERPRISE CLOUDERA IMPALA CLOUDERA NAVIGATOR CLOUDERA SEARCH THE ENTERPRISE DATA HUB LAUNCHED CDH Cloudera Manager CLOUDERA ENTERPRISE 4 ASK BIGGER QUESTIONS ENTERPRISE DATA HUB 2009 2010 2011 2012 2013 HADOOP CREATOR DOUG CUTTING JOINS CLOUDERA CLOUDERA MANAGER: FIRST MANAGEMENT APPLICATION FOR HADOOP CLOUDERA UNIVERSITY EXPANDS TO 140 COUNTRIES CLOUDERA CONNECT REACHES 300 PARTNERS TOM REILLY JOINS AS CEO OVER 800 PARTNERS IN CLOUDERA CONNECT 15
Cloudera the Leader in Data Management powered by Apache Hadoop Founded 2008, by former employees of Employees Over 500 Global 24x7 Support Follow-the-sun capability; Pro-active & Predictive Support Programs Dedicated Support Engineers; Support Centers in NA, Europe & Asia Professional Services World class services delivery teams worldwide Mission Critical Thousands of enterprise customers rely on Cloudera 50% of the Fortune 50; 65% of the Fortune 500 Top Defense & Intelligence Agencies The Largest Ecosystem Over 800 Members of our Partner Program, ClouderaConnect Cloudera University Over 40,000 people trained around the world Open Source Leaders Cloudera employees are founders of most of the Apache Hadoop ecosystem projects, and leading contributors to all of them, providing 60% of the solutions to JIRAs The Leading Open Source Distribution of Apache Hadoop Powerful Suite of System & Data Management Software Built for the Enterprise 16
Hadoop is a Full, Thriving Ecosystem Workload Management Hadoop 2 YARN CLOUDERA S ENTERPRISE DATA HUB Enterprise Workload Management Hadoop 2 YARN Diverse Analytic Platform BATCH PROCESSING Analytic SQL MAPREDUCE IMPALA Cloudera Impala Search Engine Cloudera Search (Solr) Machine Learning & Stream Processing Apache Spark 3 rd Party Applications Managed & Secure Cloudera FILESYSTEM Manager Sentry Navigator ANALYTIC SQL STORAGE FOR ANY TYPE OF DATA HDFS SEARCH ENGINE SOLR MACHINE LEARNING SPARK WORKLOAD MANAGEMENT YARN UNIFIED, ELASTIC, RESILIENT, SECURE STREAM PROCESSING SPARK STREAMING ONLINE NOSQL HBASE 3 RD PARTY APPS CLOUDERA NAVIGATOR CLOUDERA MANAGER DATA MANAGEMENT SYSTEM MANAGEMENT Diverse Analytic Platform Analytic SQL Cloudera Impala Search Engine Cloudera Search (Solr) Machine Learning & Stream Processing Apache Spark 3 rd Party Applications Managed and Secure Cloudera Manager Audit, Governance (Navigator) Security (Sentry). 17
Agenda Hadoop Overview Latest Trends in Hadoop Enterprise Ready Beyond Batch Use Cases Skill Sets needed for Hadoop. 18
Wide-Spread in the Enterprise Proven Track Record 20+ B events online per day are ingested by Cloudera 70% of all the smart phones in the U.S. are powered by Cloudera 250 million Tweets per day are filtered for actionable business insights by Cloudera 4 of the top financial institutions have standardized on Cloudera Leading technology company standardizes globally with Cloudera as a single Big Data platform 3 of the top 5 organizations in telecoms, defense, media, banking and retail run Cloudera 19
Enterprise-grade Security for Hadoop Perimeter Data Access Visibility Guarding access to the cluster itself Protecting data in the cluster from unauthorized visibility Defining what users and applications can do with data Reporting on where data came from and how it s being used Technical Concepts: Authentication Network isolation Technical Concepts: Encryption Data masking Technical Concepts: Permissions Authorization Technical Concepts: Auditing Lineage Kerberos Oozie Knox Certified Partners Sentry Cloudera Navigator 20
Cloudera Navigator Data Management Layer for Cloudera Enterprise Audit & Access Control Ensuring appropriate permissions & reporting on data access for compliance Discovery & Exploration Finding out what data is available and what it looks like Lineage Tracing data back to its original source CLOUDERA NAVIGATOR CDH Audit & Access Control Discovery & Exploration Lineage Enterprise Metadata Repository Business metadata Lineage metadata Operational metadata Lifecycle Mgmt. Lifecycle Management Migration of data based on policies HDFS HBASE HIVE 21
Cloudera BDR Backup and Disaster Recovery for Cloudera Enterprise Reduce Complexity Centrally manage backup & DR workflows Simple setup via an intuitive user interface Maximize Efficiency Simplify processes to meet or exceed SLAs & Recovery Time Objectives (RTOs) Optimize system performance & network impact through scheduling Reduce Risk & Exposure Eliminate error-prone manual processes Get notified when issues occur The only solution for metadata replication (Hive) 22
Trend is here: Hadoop as Enterprise Data Hub 4 Multi-workload analytic platform Bring applications to data Combine different workloads on common data (i.e. SQL + Search) True BI agility 4 3 3 2 Self-service exploratory BI Simple search + BI tools Schema on read agility Reduce BI user backlog requests Data management, transformations One source of data for all analytics Persisted state of transformed data Significantly faster & cheaper Servers Marts EDWs 2 1 Documents Storage Search 1 Archives 1 Active archive Full fidelity original data Indefinite time, any source Lowest cost storage ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources 23
24
Agenda Hadoop Overview Latest Trends in Hadoop Enterprise Ready Beyond Batch Use Cases Skill Sets needed for Hadoop. 25
Restructure Your Thinking An Example: EDW Optimization - free up costly resources. Optimize your specialized EDW systems for high-performance operational analytics KEEP IN EDW Operational Analytics Reporting Business Analytics MOVE TO CLOUDERA Historical Data Data Processing Ad Hoc Exploratory Transformation/Batch Data Hub 26
Interactive Analytic SQL on Hadoop Cloudera Impala and Hive/Stinger Unlocks self-service, exploratory BI on any Hadoop data Modern MPP SQL query engine >10x faster than the latest Hive Runs IN Hadoop ANSI SQL compliant Use existing BI tools Secure and governed Easy to manage Apache-licensed open source Use Cases Data warehouse offload Interactive BI/analytics on more data Full-fidelity, active compliance archiving New for Impala 1.2 UDFs and prebuilt analytic functions Automatic metadata refresh Cost-based join order optimizer Initial integration with YARN New for Hive 0.12 YARN Integration Performance and Query Optimizations Updated SQL 27
Benefits of Impala/SQL on Hadoop Fastest SQL for Hadoop Flexible Modern MPP architecture: no MapReduce Comparable performance to RDBMS 10-100x faster than Hive/Stinger Native and Open Quickly explore any Hadoop data Schema on read or write Shares data with other engines, e.g. search,ml Managed No remote query, no data movement Uses Hadoop metadata, security, resources Apache-licensed open source Easy to Use Integrated with YARN Easy installation, management, monitoring, upgrades via Cloudera Manager Secure and Governed ANSI SQL-compliant Certified for popular BI tools Pre-built analytic functions with MADlib Comprehensive data security Granular role-based access controls (Sentry) Auditable permissions 28
Interactive Analytic SQL Think Differently Offload the Data Warehouse Optimize for the right workload Today Relentless EDW growth Tomorrow The right workloads in the right system 100 TB 200 TB 100 TB 100 TB EDW Operational Analytics Reporting Business Analytics EDW + CLOUDERA Historical Data Data Processing Ad Hoc Exploratory Transformation/Batch Data Hub 29
Transform the Economics of Data Traditional Data Warehouse Add 100 TB = With Cloudera Add 100 TB = TO in incremental spend 1/10th the cost of legacy systems 30 CONFIDENTIAL - RESTRICTED
Search Cloudera Search (Apache Solr) Explore Navigate Correlate Accessible Interactive full-text and faceted navigation Real-time exploration of all your data Multi-audience friendly Flexible Batch, real-time, and on-demand (re)indexing Multi-datatype, multi-format support Natively integrates with other Hadoop engines Rich API and ecosystem 100% Open Source Industry standard search engine Mature code base, vibrant community Cloudera was the first commercial Hadoop vendor shipping and supporting Search 31
Machine Learning and Stream Processing Apache Spark Open source parallel data processing framework Fast. Memory usage unlocks > 100x faster than MapReduce for data processing, enables iterative machine learning and analytics Developer-friendly. Write in Java, Scala, Python with rich APIs Integrated. Shipped with CDH, managed through Cloudera Manager, supported and developed in collaboration with Databricks Easy, real-time stream processing Easy. API enables fast development of streaming apps Fault-tolerant. Exactly-once semantics out-of-the-box Integrated. Shares data, models with Spark Cloudera is the only commercial Hadoop vendor shipping Spark, and with the ability to support Spark 32
Extensive Partner Network for Cloudera and Hadoop BI and Analytics SI Database Reseller Data Integration Hardware 33
Agenda Hadoop Overview Latest Trends in Hadoop Enterprise Ready Beyond Batch Use Cases Skill Sets needed for Hadoop 34
Hadoopable Big Data Use Case Indicators Best Practice: first deploy Operational use case, follow with Analytics use cases 1. The business wants to analyze new data sources 2. Storage needs (and costs) are increasing dramatically 3. Insufficient batch processing power/capacity to meet internal SLAs 4. Need to extend life of existing analytics or ETL systems 5. Financial pressures to reduce IT costs 35
Two Categories of Hadoop Use Cases Business Intelligence Advanced Analytics Applications Innovation and Advantage Ask Bigger Questions: Gain value from all your data Data Processing: ETL Offload Data Storage: Enterprise Data Hub Most Companies Start Here Operational Efficiency Perform existing workloads faster, cheaper, better 36
Ask Bigger Questions: How can we increase sales? ebay increased top line revenues by 2% through search optimization across 300 million listings, 97 million buyers & sellers, and 50,000 product categories. 37
Cloudera delivers ROI The Challenge: Need to understand massive volumes of clickstream data Merchants post near-duplicate entries which reduce the number of unique, relevant results per search ebay drove a 2% increase to top line revenues and achieved ROI on their Cloudera investment in 6 months through search optimization. The Solution: Cloudera Enterprise Data Hub Edition Multi-tenant environment links every search with structured profile data to de-clutter website and deliver greatest variety of relevant search results 38 CONFIDENTIAL - RESTRICTED
Ask Bigger Questions: How can we conserve energy? Opower provides 360-degree views into energy usage patterns and similar household comparisons to help consumers save energy. 39
Cloudera converts smart grid data into value The Challenge: Ever-growing utility data streams that should be captured and analyzed (AMI, smart appliances, interactive user apps, sensors, social media) Utilities companies strive to help customers understand energy usage Opower helps 4+ millions homes save hundreds of millions of dollars on energy bills through big data analysis. The Solution: Cloudera Enterprise Data Hub Edition deployed to store, transform and query time series and social data 40
Ask Bigger Questions: How can we better understand risk? Allstate s universal data archive allows co-mingling of 80+ years data spanning all business units and all 50 states. 41
Allstate builds a universal data archive The Challenge: Data silos spread across company with 80+ years historical data; only some digitized Analysis on one state s data takes 24 hours; can t analyze all 50 states at once Allstate optimizes offers and pricing with a comprehensive view of individual risk. The Solution: Universal data archive on Cloudera Enterprise spans enterprise-wide systems 3 use cases: storage, ETL, applied math Analyze all 50 states in 16 hours using Hive; 500X speed-up; previously each state took about a day! 42 Resource: Cloudera Sessions Chicago 2013 video
Agenda Hadoop Overview Latest Trends in Hadoop Enterprise Ready Beyond Batch Use Cases Skill Sets needed for Hadoop 43
Hadoop Administrator Professional Profile Required Skills Linux Administration Java Knowledge Networking Knowledge Understanding of Hardware Responsibilities Install, configure and upgrade Hadoop Manage hardware components Monitor and configure the cluster Integrate Hadoop with other systems 44
Hadoop Developer / Analyst Professional Profile Required Skills Basic Linux use Programming Knowledge (Java, SQL, Scripting etc) Understanding of Data, ETL Responsibilities Develop Hadoop Programs (Map Reduce, Spark) Manage data files (command line, HUE) Monitor Jobs (Web UI) Manage Data Lifecycle 45
46