Extending Hadoop beyond MapReduce

Similar documents
Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Big Data Processing using Hadoop. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

YARN Apache Hadoop Next Generation Compute Platform

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Apache Hadoop YARN: The Nextgeneration Distributed Operating. System. Zhijie Shen & Jian Hortonworks

Hadoop implementation of MapReduce computational model. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo

Apache Hadoop: Past, Present, and Future

CSE-E5430 Scalable Cloud Computing Lecture 2

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Architecture of Next Generation Apache Hadoop MapReduce Framework

Big Data Technology Core Hadoop: HDFS-YARN Internals

6. How MapReduce Works. Jari-Pekka Voutilainen

Hadoop IST 734 SS CHUNG

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop: Embracing future hardware

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Apache Hadoop. Alexandru Costan

Upcoming Announcements

Big Data With Hadoop

Hortonworks Architecting the Future of Big Data

Virtualizing Apache Hadoop. June, 2012

Hadoop 2.6 Configuration and More Examples

Hadoop & Spark Using Amazon EMR

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Accelerating and Simplifying Apache

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Qsoft Inc

Constructing a Data Lake: Hadoop and Oracle Database United!

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Sujee Maniyam, ElephantScale

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Hadoop Size does Hadoop Summit 2013

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop Architecture. Part 1

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Deploying Hadoop with Manager

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Big Fast Data Hadoop acceleration with Flash. June 2013

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

Design and Evolution of the Apache Hadoop File System(HDFS)

THE HADOOP DISTRIBUTED FILE SYSTEM

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Hadoop Ecosystem B Y R A H I M A.

Chapter 7. Using Hadoop Cluster and MapReduce

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

Hadoop and Map-Reduce. Swati Gore

Hadoop in the Enterprise

Open source Google-style large scale data analysis with Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

In Memory Accelerator for MongoDB

Big Data Course Highlights

HiBench Introduction. Carson Wang Software & Services Group

BIG DATA HADOOP TRAINING

Task Scheduling in Hadoop

Hadoop Optimizations for BigData Analytics

Pivotal HD Enterprise

MapReduce with Apache Hadoop Analysing Big Data

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Hadoop Scheduler w i t h Deadline Constraint

Implement Hadoop jobs to extract business value from large and varied data sets

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop Job Oriented Training Agenda

Parallel Processing of cluster by Map Reduce

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

The Greenplum Analytics Workbench

BIG DATA TRENDS AND TECHNOLOGIES

Workshop on Hadoop with Big Data

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Apache Hama Design Document v0.6

Apache HBase. Crazy dances on the elephant back

BIG DATA USING HADOOP

Benchmarking Hadoop & HBase on Violin

A Brief Introduction to Apache Tez

The Hadoop Distributed File System

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Dell Reference Configuration for Hortonworks Data Platform

COURSE CONTENT Big Data and Hadoop Training

Transcription:

Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1

Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core member of design and development team on MR Next Gen Apache ZooKeeper since 2008 committer and current PMC chair Lead Apache ZooKeeper development and support @ Yahoo! Co-founder @hortonworks Page 2

Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core member of design and development team for MR Next Gen Apache ZooKeeper since 2008 committer and current PMC chair Lead Apache ZooKeeper development and support @ Yahoo! Co-founder @hortonworks Page 3

Agenda Overview Current Limitations and Requirements Architectures Improvements and Updates Q&A Page 4

Hadoop MapReduce Classic JobTracker Manages cluster resources and job scheduling TaskTracker Per-node agent Manage tasks

Current Limitations Hard partition of resources into map and reduce slots Low resource utilization Lacks support for alternate paradigms Iterative applications implemented using MapReduce are 10x slower Hacks for the likes of MPI/Graph Processing Lack of wire-compatible protocols Client and cluster must be of same version Applications and workflows cannot migrate to different clusters 6

Current Limitations Utilization Scalability Maximum Cluster size 4,000 nodes Maximum concurrent tasks 40,000 Coarse synchronization in JobTracker Single point of failure Failure kills all queued and running jobs Jobs need to be re-submitted by users Restart is very tricky due to complex state 7

Requirements Reliability Availability Utilization Wire Compatibility Agility & Evolution Ability for customers to control upgrades to the grid software stack. Scalability - Clusters of 6,000-10,000 machines Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks 100,000+ concurrent tasks 10,000 concurrent jobs 8

Design Centre Split up the two major functions of JobTracker Cluster resource management Application life-cycle management MapReduce becomes user-land library 9

Architecture Application Application is a job submitted to the framework Example Map Reduce Job Container Basic unit of allocation Example container A = 2GB, 1CPU Replaces the fixed map/reduce slots 10

Architecture Resource Manager Global resource scheduler Hierarchical queues Node Manager Per-machine agent Manages the life-cycle of container Container resource monitoring Application Master Per-application Manages application scheduling and task execution E.g. MapReduce Application Master 11

Architecture Node Manager Container App Mstr Client Client Resource Manager Node Manager App Mstr Container MapReduce Status Job Submission Node Status Resource Request Node Manager Container Container

Architecture Resource Manager Resource Manager Applications Manager Scheduler Resource Tracker Applications Manager Responsible for launching and monitoring Application Masters (per Application process) Restarts an Application Master on failure Scheduler Responsible for allocating resources to the Application Resource Tracker Responsible for managing the nodes in the cluster

Improvements vis-à-vis classic MapReduce Utilization Generic resource model Memory CPU Disk b/q Network b/w Remove fixed partition of map and reduce slot Scalability Application life-cycle management is very expensive Partition resource management and application life-cycle management Application management is distributed Hardware trends - Currently run clusters of 4,000 machines 6,000 2012 machines > 12,000 2009 machines <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB> 14

Improvements vis-à-vis classic MapReduce Fault Tolerance and Availability Resource Manager No single point of failure state saved in ZooKeeper (coming soon) Application Masters are restarted automatically on RM restart Application Master Optional failover via application-specific checkpoint MapReduce applications pick up where they left off via state saved in HDFS Wire Compatibility Protocols are wire-compatible Old clients can talk to new servers Rolling upgrades 15

Improvements vis-à-vis classic MapReduce Innovation and Agility MapReduce now becomes a user-land library Multiple versions of MapReduce can run in the same cluster (a la Apache Pig) Faster deployment cycles for improvements Customers upgrade MapReduce versions on their schedule Users can customize MapReduce 16

Improvements vis-à-vis classic MapReduce Support for programming paradigms other than MapReduce MPI Master-Worker Machine Learning Iterative processing Enabled by allowing the use of paradigm-specific application master Run all on the same Hadoop cluster 17

Is it released? Available in 0.23.1 release Coming soon 0.23.2 release 18

Any Performance Gains? 2x+ across the board MapReduce Unlock lots of improvements from Terasort record (Owen/Arun, 2009) Shuffle 30%+ Small Jobs Uber AM Re-use task slots (container reuse) More details: http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/ Page 19

Testing? Testing, *lots* of it Benchmarks (every release should be at least as good as the last one) Integration testing HBase Pig Hive Oozie Functional tests Nightly Over1000 functional tests for Map-Reduce alone Several hundred for Pig/Hive etc. Deployment discipline Page 20

Benchmarks Benchmark every part of the HDFS & MR pipeline HDFS read/write throughput NN operations Scan, Shuffle, Sort GridMixv3 Run production traces in test clusters Thousands of jobs Stress mode v/s Replay mode Page 21

Deployment Alpha/Test (early UAT) in November 2011 Small scale (500-800 nodes) Alpha in February 2012 Majority of users ~1000 nodes per cluster, > 2,000 nodes in all Beta Misnomer: 10s of PB of storage Significantly wide variety of applications and load 4000+ nodes per cluster, > 15000 nodes in all Q2, 2012 Production Well, it s production Mid-to-late Q2 2012 Page 22

Questions? hadoop-0.23.1 (alpha release): http://hadoop.apache.org/common/releases.html Release Documentation: http://hadoop.apache.org/common/docs/r0.23.1/ Hortonworks website: http://hortonworks.com/ Page 23

Other Resources Hadoop Summit June 13-14 San Jose, California www.hadoopsummit.org Hadoop Training and Certification Developing Solutions Using Apache Hadoop Administering Apache Hadoop http://hortonworks.com/training/ On-demand Webinars Available now on Hortonworks website http://hortonworks.com/webinars/ Hortonworks Inc. 2012 Page 24

Thank You @mahadevkonar