Hadoop in the Enterprise Modern Architecture with Hadoop 2 Jeff Markham Technical Director, APAC Hortonworks
Hadoop Wave ONE: Web-scale Batch Apps relative % customers 2006 to 2012 Web-Scale Batch Applications Innovators, technology enthusiasts Early adopters, visionaries The CHASM Early majority, pragmatists Late majority, conservatives Laggards, Skeptics time Customers want technology & performance Customers want solutions & convenience Source: Geoffrey Moore - Crossing the Chasm
Hadoop Wave TWO: Broad Enterprise Apps relative % customers 2013 & Beyond Batch, Interactive, Online, Streaming, etc., etc. Innovators, technology enthusiasts Early adopters, visionaries The CHASM Early majority, pragmatists Late majority, conservatives Laggards, Skeptics time Customers want technology & performance Customers want solutions & convenience Source: Geoffrey Moore - Crossing the Chasm
Hadoop 2.0 Key Highlights 2.0 Architected for the Broad Enterprise Single Cluster, Many Workloads Enterprise Requirements Mixed workloads Interactive Query Reliability Point in time Recovery HDP 2.0 Features YARN Hive on Tez Full Stack HA Snapshots BATCH INTERACTIVE ONLINE STREAMING Multi Data Center Disaster Recovery ZERO downtime Rolling Upgrades
The 1 st Generation of Hadoop: Batch HADOOP 1.0 Built for Web-Scale Batch Apps Single App INTERACTIVE Single App ONLINE All other usage patterns must leverage that same infrastructure Single App BATCH Single App BATCH Single App BATCH Forces the creation of silos for managing mixed workloads HDFS HDFS HDFS
A Transition From Hadoop 1 to 2 HADOOP 1.0 MapReduce (cluster resource management & data processing) HDFS (redundant, reliable storage)
A Transition From Hadoop 1 to 2 HADOOP 1.0 HADOOP 2.0 MapReduce (cluster resource management & data processing) HDFS (redundant, reliable storage) MapReduce (data processing) YARN (cluster resource management) HDFS (redundant, reliable storage) Others (data processing)
The Enterprise Requirement: Beyond Batch To become an enterprise viable data platform, customers have told us they want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service BATCH INTERACTIVE ONLINE STREAMING GRAPH IN- MEMORY HPC MPI OTHER HDFS (Redundant, Reliable Storage) Page 17
YARN: Taking Hadoop Beyond Batch Created to manage resource needs across all uses Ensures predictable performance & QoS for all apps Enables apps to run IN Hadoop rather than ON Key to leveraging all other common services of the Hadoop platform: security, data lifecycle management, etc. ApplicaIons Run NaIvely IN Hadoop BATCH (MapReduce) INTERACTIVE (Tez) ONLINE (HBase) STREAMING (Storm, S4, ) GRAPH (Giraph) IN- MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave ) YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) Page 18
Old School Hadoop: MapReduce
New School Hadoop with YARN Node Manager Container App Mstr Client Client Resource Manager Node Manager App Mstr Container MapReduce Status Job Submission Node Status Resource Request Container Node Manager Container
5 Key Benefits of YARN 5 1. Scale! 2. Compatibility with MapReduce. 3. Improved cluster utilization. 4. New Programming Models 5. Agility Page 23
Apache Tez An alternate data processing framework to MapReduce Improves performance of low-latency applications Page 24
SQL-IN-Hadoop with Apache Hive Hadoop Business AnalyIcs MAP REDUCE SQL HIVE YARN HDFS2 Custom Apps TEZ Apache Hive: First Application to use YARN Hive on Tez optimizes resource for Hive queries to improve performance Apache Hive is the standard for SQL interaction in Hadoop (Most applications claim Hive compatibility today) Apache Tez: optimized for YARN, general purpose processing framework for existing Hadoop applications Stinger Initiative Simple Focus 1 2 100x Performance Improvement Increased SQL Compatibility Enable Hive to support interactive workloads Improve existing tools & preserve investments SInger Phase 1 Base OpJmizaJons SQL AnalyJcs ORCFile Format SInger Phase 2 YARN Resource Mgmnt Hive on Apache Tez Query Service (always on) SInger Phase 3 Vector Query Buffer Cache Query Planner Page 25
Hive: More SQL & 100X Faster Stinger Phase 1 Base Optimizations SQL Analytics ORCFile Format Stinger Phase 2 YARN Resource Mgmnt Hive on Apache Tez Query Service Stinger Phase 3 Vector Query Buffer Cache Query Planner Done in Hive 0.11 We Are Here Work Started SQL Compliance Highlights ROLLUP and CUBE Windowing functions (OVER, RANK, etc.) DECIMAL CHAR VARCHAR DATE UNION DISTINCT and UNION outside of subquery Sub-queries for IN/NOT IN, HAVING EXISTS / NOT EXISTS INTERSECT, EXCEPT
Hive s Performance Trajectory http://hortonworks.com/blog/delivering-on-stinger-a-phase-3-progress-update/
Making Hadoop Enterprise Ready
Thank You! http://hortonworks.com/sandbox