Job Scheduling for MapReduce

Similar documents

Delay Scheduling. A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Hadoop Fair Scheduler Design Document

Survey on Scheduling Algorithm in MapReduce Framework

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Survey on Job Schedulers in Hadoop Cluster

Job Scheduling with the Fair and Capacity Schedulers

Scheduling Algorithms in MapReduce Distributed Mind

Big Data Analysis and Its Scheduling Policy Hadoop

Research on Job Scheduling Algorithm in Hadoop

Improving MapReduce Performance in Heterogeneous Environments

Matchmaking: A New MapReduce Scheduling Technique

An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems

Apache Hadoop. Alexandru Costan

Towards a Resource Aware Scheduler in Hadoop

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Technology Core Hadoop: HDFS-YARN Internals

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Main Points. Scheduling policy: what to do next, when there are multiple threads ready to run. Definitions. Uniprocessor policies

Evaluating Task Scheduling in Hadoop-based Cloud Systems

An improved task assignment scheme for Hadoop running in the clouds

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Task Scheduling in Hadoop

BBM467 Data Intensive ApplicaAons

CPU Scheduling. Basic Concepts. Basic Concepts (2) Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems

GraySort on Apache Spark by Databricks

Adaptive Task Scheduling for Multi Job MapReduce

Capacity Scheduler Guide

Spark: Cluster Computing with Working Sets

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types

Fair Scheduler. Table of contents

How MapReduce Works 資碩一戴睿宸

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Improving MapReduce Performance in Heterogeneous Environments

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

A Comprehensive View of Hadoop MapReduce Scheduling Algorithms

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

The Impact of Capacity Scheduler Configuration Settings on MapReduce Jobs

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Keywords: Big Data, HDFS, Map Reduce, Hadoop

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

CSE-E5430 Scalable Cloud Computing Lecture 2

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Hadoop Data Locality Change for Virtualization Environment

Extending Hadoop beyond MapReduce

Introduction to Apache YARN Schedulers & Queues

MAPREDUCE [1] is proposed by Google in 2004 and

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

HiBench Introduction. Carson Wang Software & Services Group

CSE-E5430 Scalable Cloud Computing Lecture 11

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Big Data Analytics - Accelerated. stream-horizon.com

Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum

Comp 204: Computer Systems and Their Implementation. Lecture 12: Scheduling Algorithms cont d

Performance Tuning and Optimizing SQL Databases 2016

Unified Big Data Processing with Apache Spark. Matei

Hadoop & its Usage at Facebook

Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System

Hadoop & its Usage at Facebook

Data Management in the Cloud

CapacityScheduler Guide

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Snapshots in Hadoop Distributed File System

Data Warehousing and Analytics Infrastructure at Facebook

Scheduling policy. ULK3e 7.1. Operating Systems: Scheduling in Linux p. 1

Big Data Processing using Hadoop. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Hadoop and Map-Reduce. Swati Gore

Transcription:

UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Scott Shenker, Ion Stoica RAD Lab, * Facebook Inc 1

Motivation Hadoop was designed for large batch jobs FIFO queue + locality optimization At Facebook, we saw a different workload: Many users want to share a cluster Many jobs are small (10-100 tasks) Sampling, ad-hoc queries, periodic reports, etc How should we schedule tasks in a shared MapReduce cluster?

Benefits of Sharing Higher utilization due to statistical multiplexing Data consolidation (ability to query disjoint data sets together)

Why is it Interesting? Data locality is crucial for performance Conflict between locality and fairness 70% gain from simple algorithm

Outline Task scheduling in Hadoop Two problems Head-of-line scheduling Slot stickiness A solution (global scheduling) Lots more problems (future work)

Task Scheduling in Hadoop Slaves send heartbeats periodically Master responds with task if a slot is free, picking task with data closest to the node Master Job queue

Problem 1: Poor Locality for Small Jobs Job Sizes at Facebook # of Maps Percent of Jobs < 25 58% 25-100 18% 100-400 14% 400-1600 7% 1600-6400 3% > 6400 0.26%

Problem 1: Poor Locality for Small Jobs 100 Job Locality at Facebook Percent Local Maps 80 60 40 20 0 10 100 1000 10000 100000 Job Size (Number of Maps) Node Locality Rack Locality

Cause Only head-of-queue job is schedulable on each heartbeat Chance of heartbeat node having local data is low Jobs with blocks on X% of nodes get X% locality Expected Node Locality (%) Locality vs. Job Size in 100-Node Cluster 100 80 60 40 20 0 0 20 40 60 80 100 Job Size (Number of Maps)

Problem 2: Sticky Slots Suppose we do fair sharing as follows: Divide task slots equally between jobs When a slot becomes free, give it to the job that is farthest below its fair share

Problem 2: Sticky Slots Master Job 1 Job 2

Problem 2: Sticky Slots Master Job Fair Running Share Tasks Job 1 2 12 Job 2 2 2 Problem: Jobs never leave their original slots

Calculations Expected Node Locality (%) Locality vs. Concurrent Jobs in 100-Node Cluster 120 100 80 60 40 20 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Concurrent Jobs

Solution: Locality Wait Scan through job queue in order of priority Jobs must wait before they are allowed to run non-local tasks If wait < T 1, only allow node-local tasks If T 1 < wait < T 2, also allow rack-local If wait > T 2, also allow off-rack

Locality Wait Example Master Job Fair Running Share Tasks Job 1 2 12 Job 2 2 3 2 Jobs can now shift between slots

Evaluation Locality Gains Job Type Default Scheduler With Locality Wait Node Loc.Rack Loc.Node Loc.Rack Loc. Small Sort 2% 50% 81% 96% Small Scan 2% 50% 75% 94% Medium Scan 37% 98% 99% 99% Large Scan 84% 99% 94% 99%

Throughput Gains Normalized Running Time 1.2 1 0.8 0.6 0.4 0.2 18% 20% 70% 31% Default Scheduler With Locality Wait 0 Small Sort Small Scan Medium Scan Large Scan

Network Traffic Reduction Network Traffic in Sort Workload With locality wait Without locality wait

Further Analysis When is it worthwhile to wait, and how long? For throughput: Always worth it, unless there s a hotspot If hotspot, prefer to run IO-bound tasks on the hotspot node and CPU-bound tasks remotely (rationale: maximize rate of local IO)

Further Analysis When is it worthwhile to wait, and how long? For response time: E(gain) = (1 e -w/t )(D t) Expected time to get local heartbeat Wait amount Delay from running non-locally Worth it if E(wait) < cost of running non-locally Optimal wait time is infinity

Problem 3: Memory-Aware Scheduling a) How much memory does each job need? Asking users for per-job memory limits leads to overestimation Use historic data about working set size? b) High-memory jobs may starve Reservation scheme + finish time estimation?

Problem 4: Reduce Scheduling Maps Job 1 Job 2 Time Reduces Time

Problem 4: Reduce Scheduling Maps Job 1 Job 2 Job 2 maps done Time Reduces Time

Problem 4: Reduce Scheduling Maps Job 1 Job 2 Time Reduces Job 2 reduces done Time

Problem 4: Reduce Scheduling Maps Job 1 Job 2 Job 3 Job 3 submitted Time Reduces Time

Problem 4: Reduce Scheduling Maps Job 1 Job 2 Job 3 Time Job 3 maps done Reduces Time

Problem 4: Reduce Scheduling Maps Job 1 Job 2 Job 3 Time Reduces Time

Problem 4: Reduce Scheduling Maps Job 1 Job 2 Job 3 Time Problem: Job 3 can t launch reduces until Job 1 finishes Reduces Time

Conclusion Simple idea improves throughput by 70% Lots of future work: Memory-aware scheduling Reduce scheduling Intermediate-data-aware scheduling Using past history / learning job properties Evaluation using richer benchmarks