Hadoop in the Enterprise

Size: px

Start display at page:

Download "Hadoop in the Enterprise"

Brook Parrish
10 years ago
Views:

1 Hadoop in the Enterprise Modern Architecture with Hadoop 2 Jeff Markham Technical Director, APAC Hortonworks

2 Hadoop Wave ONE: Web-scale Batch Apps relative % customers 2006 to 2012 Web-Scale Batch Applications Innovators, technology enthusiasts Early adopters, visionaries The CHASM Early majority, pragmatists Late majority, conservatives Laggards, Skeptics time Customers want technology & performance Customers want solutions & convenience Source: Geoffrey Moore - Crossing the Chasm

majority, pragmatists Late majority, conservatives Laggards, Skeptics time Customers want

3 Hadoop Wave TWO: Broad Enterprise Apps relative % customers 2013 & Beyond Batch, Interactive, Online, Streaming, etc., etc. Innovators, technology enthusiasts Early adopters, visionaries The CHASM Early majority, pragmatists Late majority, conservatives Laggards, Skeptics time Customers want technology & performance Customers want solutions & convenience Source: Geoffrey Moore - Crossing the Chasm

4 Hadoop 2.0 Key Highlights 2.0 Architected for the Broad Enterprise Single Cluster, Many Workloads Enterprise Requirements Mixed workloads Interactive Query Reliability Point in time Recovery HDP 2.0 Features YARN Hive on Tez Full Stack HA Snapshots BATCH INTERACTIVE ONLINE STREAMING Multi Data Center Disaster Recovery ZERO downtime Rolling Upgrades

Requirements Mixed workloads Interactive Query Reliability Point in time Recovery HDP 2.

5 The 1 st Generation of Hadoop: Batch HADOOP 1.0 Built for Web-Scale Batch Apps Single App INTERACTIVE Single App ONLINE All other usage patterns must leverage that same infrastructure Single App BATCH Single App BATCH Single App BATCH Forces the creation of silos for managing mixed workloads HDFS HDFS HDFS

All other usage patterns must leverage that same infrastructure Single App

6 A Transition From Hadoop 1 to 2 HADOOP 1.0 MapReduce (cluster resource management & data processing) HDFS (redundant, reliable storage)

7 A Transition From Hadoop 1 to 2 HADOOP 1.0 HADOOP 2.0 MapReduce (cluster resource management & data processing) HDFS (redundant, reliable storage) MapReduce (data processing) YARN (cluster resource management) HDFS (redundant, reliable storage) Others (data processing)

(redundant, reliable storage) MapReduce (data processing) YARN

8 The Enterprise Requirement: Beyond Batch To become an enterprise viable data platform, customers have told us they want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service BATCH INTERACTIVE ONLINE STREAMING GRAPH IN- MEMORY HPC MPI OTHER HDFS (Redundant, Reliable Storage) Page 17

MULTIPLE WAYS Simultaneously & with predictable levels of service BATCH INTERACTIVE

9 YARN: Taking Hadoop Beyond Batch Created to manage resource needs across all uses Ensures predictable performance & QoS for all apps Enables apps to run IN Hadoop rather than ON Key to leveraging all other common services of the Hadoop platform: security, data lifecycle management, etc. ApplicaIons Run NaIvely IN Hadoop BATCH (MapReduce) INTERACTIVE (Tez) ONLINE (HBase) STREAMING (Storm, S4, ) GRAPH (Giraph) IN- MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave ) YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) Page 18

10 Old School Hadoop: MapReduce

11 New School Hadoop with YARN Node Manager Container App Mstr Client Client Resource Manager Node Manager App Mstr Container MapReduce Status Job Submission Node Status Resource Request Container Node Manager Container

App Mstr Container MapReduce Status Job Submission

12 5 Key Benefits of YARN 5 1. Scale! 2. Compatibility with MapReduce. 3. Improved cluster utilization. 4. New Programming Models 5. Agility Page 23

13 Apache Tez An alternate data processing framework to MapReduce Improves performance of low-latency applications Page 24

14 SQL-IN-Hadoop with Apache Hive Hadoop Business AnalyIcs MAP REDUCE SQL HIVE YARN HDFS2 Custom Apps TEZ Apache Hive: First Application to use YARN Hive on Tez optimizes resource for Hive queries to improve performance Apache Hive is the standard for SQL interaction in Hadoop (Most applications claim Hive compatibility today) Apache Tez: optimized for YARN, general purpose processing framework for existing Hadoop applications Stinger Initiative Simple Focus x Performance Improvement Increased SQL Compatibility Enable Hive to support interactive workloads Improve existing tools & preserve investments SInger Phase 1 Base OpJmizaJons SQL AnalyJcs ORCFile Format SInger Phase 2 YARN Resource Mgmnt Hive on Apache Tez Query Service (always on) SInger Phase 3 Vector Query Buffer Cache Query Planner Page 25

for existing Hadoop applications Stinger Initiative Simple Focus 1 2 100x Performance Improvement Increased SQL Compatibility Enable Hive to support interactive workloads Improve existing tools &

15 Hive: More SQL & 100X Faster Stinger Phase 1 Base Optimizations SQL Analytics ORCFile Format Stinger Phase 2 YARN Resource Mgmnt Hive on Apache Tez Query Service Stinger Phase 3 Vector Query Buffer Cache Query Planner Done in Hive 0.11 We Are Here Work Started SQL Compliance Highlights ROLLUP and CUBE Windowing functions (OVER, RANK, etc.) DECIMAL CHAR VARCHAR DATE UNION DISTINCT and UNION outside of subquery Sub-queries for IN/NOT IN, HAVING EXISTS / NOT EXISTS INTERSECT, EXCEPT

11 We Are Here Work Started SQL Compliance Highlights ROLLUP and CUBE Windowing functions (OVER, RANK, etc.

16 Hive s Performance Trajectory

17 Making Hadoop Enterprise Ready

18 Thank You!

Big Data Realities Hadoop in the Enterprise Architecture

Big Data Realities Hadoop in the Enterprise Architecture Paul Phillips Director, EMEA, Hortonworks [email protected] +44 (0)777 444 3857 Hortonworks Inc. 2012 Page 1 Agenda The Growth of Enterprise