Delay Scheduling. A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

Similar documents
Job Scheduling for MapReduce

Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Matchmaking: A New MapReduce Scheduling Technique

Improving MapReduce Performance in Heterogeneous Environments

Survey on Scheduling Algorithm in MapReduce Framework

Big Data Analysis and Its Scheduling Policy Hadoop

Apache Hadoop. Alexandru Costan

GraySort on Apache Spark by Databricks

An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Job Scheduling with the Fair and Capacity Schedulers

Survey on Job Schedulers in Hadoop Cluster

Scheduling Algorithms in MapReduce Distributed Mind

Deadline-based Workload Management for MapReduce Environments: Pieces of the Performance Puzzle

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Large scale processing using Hadoop. Ján Vaňo

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

How To Make A Cluster Of Workable, Efficient, And Efficient With A Distributed Scheduler

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Extending Hadoop beyond MapReduce

Evaluating Task Scheduling in Hadoop-based Cloud Systems

BBM467 Data Intensive ApplicaAons

Hadoop & its Usage at Facebook

Research on Job Scheduling Algorithm in Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Hadoop & its Usage at Facebook

CSE-E5430 Scalable Cloud Computing Lecture 2

Application Development. A Paradigm Shift

Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Data Management in the Cloud

Adaptive Task Scheduling for Multi Job MapReduce

Towards an Optimized Big Data Processing System

Snapshots in Hadoop Distributed File System

Introduction to Apache YARN Schedulers & Queues

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types

Cloud Management: Knowing is Half The Battle

Hadoop and Map-Reduce. Swati Gore

Efficient Data Replication Scheme based on Hadoop Distributed File System

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

Task Scheduling in Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

YARN Apache Hadoop Next Generation Compute Platform

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Hadoop Parallel Data Processing

Cloud Computing using MapReduce, Hadoop, Spark

Open source Google-style large scale data analysis with Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 11

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Hadoop Fair Scheduler Design Document

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Design and Evolution of the Apache Hadoop File System(HDFS)

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Benchmarking Hadoop & HBase on Violin

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

MAPREDUCE [1] is proposed by Google in 2004 and

Cura: A Cost-optimized Model for MapReduce in a Cloud

CapacityScheduler Guide

From GWS to MapReduce: Google s Cloud Technology in the Early Days

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters

Nexus: A Common Substrate for Cluster Computing

Quantcast Petabyte Storage at Half Price with QFS!

COSHH: A Classification and Optimization based Scheduler for Heterogeneous Hadoop Systems

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

NoSQL and Hadoop Technologies On Oracle Cloud

Sparrow: Scalable Scheduling for Sub-Second Parallel Jobs

Optimizing Shared Resource Contention in HPC Clusters

Capacity Planning Process Estimating the load Initial configuration

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Transcription:

Delay Scheduling A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research

Motivation MapReduce / Hadoop originally designed for high throughput batch processing Today s workload is far more diverse: Many users want to share a cluster Engineering, marketing, business intelligence, etc Vast majority of jobs are short Ad- hoc queries, sampling, periodic reports Response time is critical Interactive queries, deadline- driven reports How can we efticiently share MapReduce clusters between users?

Example: Hadoop at Facebook 600- node, 2 PB data warehouse, growing at 15 TB/day Applications: data mining, spam detection, ads 200 users (half non- engineers) 7500 MapReduce jobs / day CDF 100% 80% 60% 40% 20% 0% Median: 84s 1 10 100 1000 10000 100000 Job Running Time (s)

Approaches to Sharing Hadoop default scheduler (FIFO) Problem: short jobs get stuck behind long ones Separate clusters Problem 1: poor utilization Problem 2: costly data replication Full replication across clusters nearly infeasible at Facebook/Yahoo! scale Partial replication prevents cross- dataset queries

Our Work Hadoop Fair Scheduler Fine- grained sharing at level of map & reduce tasks Predictable response times and user isolation Main challenge: data locality For efticiency, must run tasks near their input data Strictly following any job queuing policy hurts locality: job picked by policy may not have data on free nodes Solution: delay scheduling Relax queuing policy for limited time to achieve locality

The Problem Master Job 1 Job 2 Scheduling order Task 2 Task 5 Task 3 Task 1 Task 7 Task 4 Slave Slave Slave Slave Slave Slave File 1: File 2: 2 3 6 2 5 8 3 5 8 1 4 9 1 7 9 4 6 7 1 2 1 3 2 3

The Problem Master Job 2 Job 1 Scheduling order Task 2 Task 5 Task 31 Task 1 Task 72 Slave Slave Slave Slave Slave Task 43 Slave File 1: File 2: 2 3 6 2 5 8 3 5 8 1 4 9 1 7 9 4 6 7 1 2 1 3 2 3 Problem: Fair decision hurts locality Especially bad for jobs with small input Tiles

Locality vs. Job Size at Facebook Data Locality in Production at Facebook Percent Local Map Tasks 100% 80% 58% of jobs 60% 40% 20% 0% 10 100 1000 10000 100000 Job Size (Number of Input Blocks) Node Locality Rack Locality

Special Instance: Sticky Slots Under fair sharing, locality can be poor even when all jobs have large input Tiles Problem: jobs get stuck in the same set of task slots When one a task in job j Tinishes, the slot it was running in is given back to j, because j is below its share Bad because data Tiles are spread out across all nodes Master Task Task Task Task Slave Slave Slave Slave Fair Running Job Share Tasks Job 1 2 12 Job 2 2 2

Special Instance: Sticky Slots Under fair sharing, locality can be poor even when all jobs have large input Tiles Problem: jobs get stuck in the same set of task slots When one a task in job j Tinishes, the slot it was running in is given back to j, because j is below its share Bad because data Tiles are spread out across all nodes % Local Map Tasks Data Locality vs. Number of Concurrent Jobs 100% 80% 60% 40% 20% 0% 5 Jobs 10 Jobs 20 Jobs 50 Jobs

Solution: Delay Scheduling Relax queuing policy to make jobs wait for a limited time if they cannot launch local tasks Result: Very short wait time (1-5s) is enough to get nearly 100% locality

Delay Scheduling Example Wait! Master Job 2 Job 1 Scheduling order Task 2 Task 51 Task 38 Task 13 Task 72 Slave Slave Slave Slave Slave Task 46 Slave File 1: File 2: 2 3 6 2 5 8 3 5 8 1 4 9 1 7 9 4 6 7 1 2 1 3 2 3 Idea: Wait a short time to get data- local scheduling opportunities

Delay Scheduling Details Scan jobs in order given by queuing policy, picking Tirst that is permitted to launch a task Jobs must wait before being permitted to launch non- local tasks If wait < T 1, only allow node- local tasks If T 1 < wait < T 2, also allow rack- local If wait > T 2, also allow off- rack Increase a job s time waited when it is skipped

Analysis When is it worth it to wait, and for how long? Waiting worth it if E(wait time) < cost to run non-locally Simple model: Data- local scheduling opportunities arrive following Poisson process Expected response time gain from delay scheduling: E(gain) = (1 e - w/t )(d t) Wait amount Expected time to get local scheduling opportunity Delay from running non- locally Optimal wait time is inainity

Analysis Continued E(gain) = (1 e - w/t )(d t) Wait amount Expected time to get local scheduling opportunity Delay from running non- locally Typical values of t and d: t = (avg task length) / (Tile replication tasks per node) d (avg task length) 3 6

More Analysis What if some nodes are occupied by long tasks? Suppose we have: Fraction φ of active tasks are long R replicas of each block L task slots per node For a given data block, P(all replicas blocked by long tasks) = φ RL With R = 3 and L = 6, this is less than 2% if φ < 0.8

Evaluation Macrobenchmark IO- heavy workload CPU- heavy workload Mixed workload Microbenchmarks Sticky slots Small jobs Hierarchical fair scheduling Sensitivity analysis Scheduler overhead

Macrobenchmark 100- node EC2 cluster, 4 cores/node Job submission schedule based on job sizes and inter- arrival times at Facebook 100 jobs grouped into 9 bins of sizes Three workloads: IO- heavy, CPU- heavy, mixed Three schedulers: FIFO Fair sharing Fair + delay scheduling (wait time = 5s)

Results for IO-Heavy Workload Job Response Times 1 1 1 0.8 Up to 5x 0.8 40% 0.8 CDF 0.6 0.4 CDF 0.6 0.4 CDF 0.6 0.4 0.2 0 FIFO Fair Fair + DS 0 50 100 150 200 0.2 0 FIFO Fair Fair + DS 0 50 100 150 200 FIFO 0.2 Fair Fair + DS 0 0 100 200 300 400 500 Time (s) Time (s) Time (s) Small Jobs (1-10 input blocks) Medium Jobs (50-800 input blocks) Large Jobs (4800 input blocks)

Results for IO-Heavy Workload Percent Local Maps Speedup from Delay Scheduling 100% 80% 60% 40% 20% 0% 80% 60% 40% 20% 0% FIFO Fair Fair + Delay Sched. 1 2 3 4 5 6 7 8 9 Bin 1 2 3 4 5 6 7 8 9 Bin

Sticky Slots Microbenchmark 5-50 jobs on EC2 100- node cluster 4 cores / node Percent Local Maps 100% 80% 60% 40% 20% 0% 5 Jobs 10 Jobs 20 Jobs 50 Jobs 5s delay scheduling Benchmark Running Time (minutes) Without Delay Scheduling 50 40 30 20 10 0 With Delay Scheduling 5 Jobs 10 Jobs 20 Jobs 50 Jobs 2x Without Delay Scheduling With Delay Scheduling

Conclusions Delay scheduling works well under two conditions: SufTicient fraction of tasks are short relative to jobs There are many locations where a task can run efticiently Blocks replicated across nodes, multiple tasks/node Generalizes beyond MapReduce and fair sharing Applies to any queuing policy that gives ordered job list Used in Hadoop Fair Scheduler to implement hierarchical policy Applies to placement needs other than data locality

Thanks! This work is open source, packaged with Hadoop as the Hadoop Fair Scheduler Old version without delay scheduling in use at Facebook since fall 2008 Delay scheduling will appear in Hadoop 0.21 My email: matei@eecs.berkeley.edu

Related Work Quincy (SOSP 09) Locality aware fair scheduler for Dryad Expresses scheduling problem as min- cost Tlow Kills tasks to reassign nodes when jobs change One task slot per machine HPC scheduling Inelastic jobs (cannot scale up/down over time) Data locality generally not considered Grid computing Focused on common interfaces for accessing compute resources in multiple administrative domains

Delay Scheduling Details when there is a free task slot on node n: sort jobs according to queuing policy for j in jobs: if j has node- local task t on n: j.level := 0; j.wait := 0; return t else if j has rack- local task t on n and (j.level 1 or j.wait T 1 ): j.level := 1; j.wait := 0; return t else if j.level = 2 or (j.level = 1 and j.wait T 2 ) or (j.level = 0 and j.wait T 1 + T 2 ): j.level := 2; j.wait := 0; return t else: j.wait += time since last scheduling decision

Sensitivity Analysis Percent Local Maps 100% 80% 60% 40% 20% 0% 98% 99% 100% 100% 80% 68% 11% 5% No Delay 1s Delay 5s Delay 10s Delay 4- Map Jobs 12- Map Jobs

Hadoop Terminology Job = entire MapReduce computation; consists of map & reduce tasks Task = unit of work that is part of a job Slot = location on a slave where a task can run Block Map Input File Block Map Reduce Reduce Output File Block Map

Analysis Required waiting time is a fraction of the average task length as long as there are many locations where a task can run Ex: with 3 replicas / block and 8 tasks / node, wait 1/24 th of average task length Non- locality decreases exponentially with waiting time

Analysis Suppose we have: M machines L slots/machine T second average task length Job with data on fraction f of nodes Let D = max # of scheduling opportunities skipped P(non- local) = (1- f) D waiting time = D * T/(ML) Decreases exponentially with D Fraction of mean task length Decreases with # slots/node

Small Jobs Experiment 10-20 min benchmark consisting of several hundred IO- heavy jobs Job sizes: 3, 10 and 100 map tasks Environment: 100- node 0.4 cluster on 4 racks; 5 map 0.2 0 slots per node Normalized Running Time Job Size 1 0.8 0.6 Default Scheduler Node Loc. Rack Loc. Node Loc. Rack Loc. 3 maps 2% 50% 10 maps 37% 98% 100 maps 84% 99% 3 Maps 10 Maps 100 Maps Job Size (Number of Maps) With Delay Sched. 75% 94% 99% 99% 94% 99% No Delay Scheduling With Delay Scheduling