Job Scheduling with the Fair and Capacity Schedulers



Similar documents
Fair Scheduler. Table of contents

Hadoop Fair Scheduler Design Document

Job Scheduling for MapReduce

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

CapacityScheduler Guide

Delay Scheduling. A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

Survey on Job Schedulers in Hadoop Cluster

Research on Job Scheduling Algorithm in Hadoop

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

Big Data Analysis and Its Scheduling Policy Hadoop

Improving MapReduce Performance in Heterogeneous Environments

Survey on Scheduling Algorithm in MapReduce Framework

Capacity Scheduler Guide

Introduction to Apache YARN Schedulers & Queues

Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

Integrating VoltDB with Hadoop

Matchmaking: A New MapReduce Scheduling Technique

How MapReduce Works 資碩一 戴睿宸

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Analysis of Information Management and Scheduling Technology in Hadoop

An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems

From Relational to Hadoop Part 2: Sqoop, Hive and Oozie. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Map Reduce & Hadoop Recommended Text:

Task Scheduling in Hadoop

Getting Started with SandStorm NoSQL Benchmark

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

GraySort on Apache Spark by Databricks

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

The Improved Job Scheduling Algorithm of Hadoop Platform

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Spark. Fast, Interactive, Language- Integrated Cluster Computing

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

YARN Apache Hadoop Next Generation Compute Platform

Towards a Resource Aware Scheduler in Hadoop

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Unified Big Data Processing with Apache Spark. Matei

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

A Performance Analysis of Distributed Indexing using Terrier

AutoPig - Improving the Big Data user experience

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

How Companies are! Using Spark

Large scale processing using Hadoop. Ján Vaňo

BBM467 Data Intensive ApplicaAons

Brave New World: Hadoop vs. Spark

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Ali Ghodsi Head of PM and Engineering Databricks

L1: Introduction to Hadoop

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

Microsoft Business Intelligence 2012 Single Server Install Guide

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Load balancing MySQL with HaProxy. Peter Boros Percona 4/23/13 Santa Clara, CA

How To Create A Data Visualization With Apache Spark And Zeppelin

Integration Of Virtualization With Hadoop Tools

Hadoop Ecosystem B Y R A H I M A.

Hadoop & Spark Using Amazon EMR

Student Project 1 - Explorative Data Analysis with Hadoop and Spark

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Red Hat Enterprise Linux OpenStack Platform 7 OpenStack Data Processing

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Cloudera Navigator Installation and User Guide

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

COSHH: A Classification and Optimization based Scheduler for Heterogeneous Hadoop Systems

Cloudera Administrator Training for Apache Hadoop

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Evaluating Task Scheduling in Hadoop-based Cloud Systems

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Creating Connection with Hive

Building a data analytics platform with Hadoop, Python and R

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Survey on Hadoop and Introduction to YARN

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Xiaoming Gao Hui Li Thilina Gunarathne

Apache Hadoop: Past, Present, and Future

Extending Hadoop beyond MapReduce

Cloudera Certified Developer for Apache Hadoop

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

WebLogic Server: Installation and Configuration

CloudRank-D:A Benchmark Suite for Private Cloud Systems

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

7 Deadly Hadoop Misconfigurations. Kathleen Ting February 2013

Transcription:

Job Scheduling with the Fair and Capacity Schedulers Matei Zaharia UC Berkeley Wednesday, June 10, 2009 Santa Clara Marriott

Motivation» Provide fast response times to small jobs in a shared Hadoop cluster» Improve utilization and data locality over separate clusters and Hadoop on Demand

Hadoop at Facebook» 600-node cluster running Hive» 3200 jobs/day» 50+ users» Apps: statistical reports, spam detection, ad optimization,

Facebook Job Types» Production jobs: data import, hourly reports, etc» Small ad-hoc jobs: Hive queries, sampling» Long experimental jobs: machine learning, etc GOAL: fast response times for small jobs, guaranteed service levels for production jobs

Outline» Fair scheduler basics» Configuring the fair scheduler» Capacity scheduler» Useful links

FIFO Scheduling Job Queue

FIFO Scheduling Job Queue

FIFO Scheduling Job Queue

Fair Scheduling Job Queue

Fair Scheduling Job Queue

Fair Scheduler Basics» Group jobs into pools» Assign each pool a guaranteed minimum share» Divide excess capacity evenly between pools

Pools» Determined from a configurable job property Default in 0.20: user.name (one pool per user)» Pools have properties: Minimum map slots Minimum reduce slots Limit on # of running jobs

Example Pool Allocations entire cluster 100 slots matei jeff tom min share = 30 ads min share = 40 job 1 30 slots job 2 15 slots job 3 15 slots job 4 40 slots

Scheduling Algorithm» Split each pool s min share among its jobs» Split each pool s total share among its jobs» When a slot needs to be assigned: If there is any job below its min share, schedule it Else schedule the job that we ve been most unfair to (based on deficit )

Scheduler Dashboard

Scheduler Dashboard Change priority FIFO mode (for testing) Change pool

Additional Features» Weights for unequal sharing: Job weights based on priority (each level = 2x) Job weights based on size Pool weights» Limits for # of running jobs: Per user Per pool

Installing the Fair Scheduler» Build it: ant package» Place it on the classpath: cp build/contrib/fairscheduler/*.jar lib

Configuration Files» Hadoop config (conf/mapred-site.xml) Contains scheduler options, pointer to pools file» Pools file (pools.xml) Contains min share allocations and limits on pools Reloaded every 15 seconds at runtime

Minimal hadoop-site.xml <property> <name>mapred.jobtracker.taskscheduler</name> <value>org.apache.hadoop.mapred.fairscheduler</value> </property> <property> <name>mapred.fairscheduler.allocation.file</name> <value>/path/to/pools.xml</value> </property>

Minimal pools.xml <?xml version="1.0"?> <allocations> </allocations>

Configuring a Pool <?xml version="1.0"?> <allocations> <pool name="ads"> <minmaps>10</minmaps> <minreduces>5</minreduces> </pool> </allocations>

Setting Running Job Limits <?xml version="1.0"?> <allocations> <pool name="ads"> <minmaps>10</minmaps> <minreduces>5</minreduces> <maxrunningjobs>3</maxrunningjobs> </pool> <user name="matei"> <maxrunningjobs>1</maxrunningjobs> </user> </allocations>

Default Per-User Running Job Limit <?xml version="1.0"?> <allocations> <pool name="ads"> <minmaps>10</minmaps> <minreduces>5</minreduces> <maxrunningjobs>3</maxrunningjobs> </pool> <user name="matei"> <maxrunningjobs>1</maxrunningjobs> </user> <usermaxjobsdefault>10</usermaxjobsdefault> </allocations>

Other Parameters mapred.fairscheduler.assignmultiple:» Assign a map and a reduce on each heartbeat; improves ramp-up speed and throughput; recommendation: set to true

Other Parameters mapred.fairscheduler.poolnameproperty:» Which JobConf property sets what pool a job is in - Default: user.name (one pool per user) - Can make up your own, e.g. pool.name, and pass in JobConf with conf.set( pool.name, mypool )

Useful Setting <property> <name>mapred.fairscheduler.poolnameproperty</name> <value>pool.name</value> </property> <property> <name>pool.name</name> <value>${user.name}</value> </property> Make pool.name default to user.name

Future Plans» Preemption (killing tasks) if a job is starved of its min or fair share for some time (HADOOP-4665)» Global scheduling optimization (HADOOP-4667)» FIFO pools (HADOOP-4803, HADOOP-5186)

Capacity Scheduler» Organizes jobs into queues» Queue shares as % s of cluster» FIFO scheduling within each queue» Supports preemption» http://hadoop.apache.org/core/docs/current/ capacity_scheduler.html

Thanks!» Fair scheduler included in Hadoop 0.19+ and in Cloudera s Distribution for Hadoop» Fair scheduler for Hadoop 0.17 and 0.18: http://issues.apache.org/jira/browse/hadoop-3746» Capacity scheduler included in Hadoop 0.19+» Docs: http://hadoop.apache.org/core/docs/current» My email: matei@cloudera.com