Capacity Scheduler Guide



Similar documents
CapacityScheduler Guide

Fair Scheduler. Table of contents

Survey on Job Schedulers in Hadoop Cluster

Hadoop Fair Scheduler Design Document

Introduction to Apache YARN Schedulers & Queues

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

How MapReduce Works 資碩一 戴睿宸

6. How MapReduce Works. Jari-Pekka Voutilainen

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

MapReduce. Tushar B. Kute,

Extreme Computing. Hadoop MapReduce in more detail.

The Improved Job Scheduling Algorithm of Hadoop Platform

Benchmarking Hadoop & HBase on Violin

Integration Of Virtualization With Hadoop Tools

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Integrating VoltDB with Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Data-intensive computing systems

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

Chapter 7. Using Hadoop Cluster and MapReduce

Job Scheduling with the Fair and Capacity Schedulers

Hadoop Setup Walkthrough

Job Scheduling for MapReduce

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Hadoop Architecture. Part 1

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Kiko> A personal job scheduler

Hadoop Scheduler w i t h Deadline Constraint

Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum

CycleServer Grid Engine Support Install Guide. version 1.25

HADOOP PERFORMANCE TUNING

Analysis of Information Management and Scheduling Technology in Hadoop

HADOOP CLUSTER SETUP GUIDE:

Survey on Scheduling Algorithm in MapReduce Framework

Apache Hadoop new way for the company to store and analyze big data

THE HADOOP DISTRIBUTED FILE SYSTEM

Load Balancing in Cluster

Policy-based Pre-Processing in Hadoop

Map Reduce & Hadoop Recommended Text:

Guide to the LBaaS plugin ver for Fuel

Research on Job Scheduling Algorithm in Hadoop

docs.hortonworks.com

HP SiteScope. Hadoop Cluster Monitoring Solution Template Best Practices. For the Windows, Solaris, and Linux operating systems

Resource Aware Scheduler for Storm. Software Design Document. Date: 09/18/2015

A very short Intro to Hadoop

7 Deadly Hadoop Misconfigurations. Kathleen Ting February 2013

Cloud Federation to Elastically Increase MapReduce Processing Resources

Configuring Informatica Data Vault to Work with Cloudera Hadoop Cluster

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

ACD Queues List Definitions. The ACD Queue list provides a summary about all available ACD Queues in a given VirtualPBX.

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Deploying Load balancing for Novell Border Manager Proxy using Session Failover feature of NBM and L4 Switch

How To Use Hadoop

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

University of Maryland. Tuesday, February 2, 2010

Grid Engine 6. Policies. BioTeam Inc.

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

MapReduce, Hadoop and Amazon AWS

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 4, July-Aug 2014

Dell Reference Configuration for Hortonworks Data Platform

Genome Analysis in a Dynamically Scaled Hybrid Cloud

Hadoop Security Design

MapReduce on YARN Job Execution

Monitoring Load-Balancing Services

YARN Apache Hadoop Next Generation Compute Platform

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Apache Hadoop. Alexandru Costan

OPERATING SYSTEMS SCHEDULING

YARN and how MapReduce works in Hadoop By Alex Holmes

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

MuleSoft Blueprint: Load Balancing Mule for Scalability and Availability

The Impact of Capacity Scheduler Configuration Settings on MapReduce Jobs

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Optimize the execution of local physics analysis workflows using Hadoop

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Keywords: Big Data, HDFS, Map Reduce, Hadoop

CURSO: ADMINISTRADOR PARA APACHE HADOOP

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

Nexus: A Common Substrate for Cluster Computing

Big Data Management and Security

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

IBM Software Hadoop Fundamentals

Lecture 3 Hadoop Technical Introduction CSE 490H

How to Run Spark Application

HPC performance applications on Virtual Clusters

HADOOP MOCK TEST HADOOP MOCK TEST

VMware vcloud Automation Center 6.0

Transcription:

Table of contents 1 Purpose...2 2 Features... 2 3 Picking a task to run...2 4 Installation...3 5 Configuration... 3 5.1 Using the Capacity Scheduler... 3 5.2 Setting up queues...3 5.3 Configuring properties for queues...3 5.4 Memory management...4 5.5 Job Initialization Parameters... 6 5.6 Reviewing the configuration of the Capacity Scheduler...6

1. Purpose This document describes the Capacity Scheduler, a pluggable Map/Reduce scheduler for Hadoop which provides a way to share large clusters. 2. Features The Capacity Scheduler supports the following features: Support for multiple queues, where a job is submitted to a queue. Queues are allocated a fraction of the capacity of the grid in the sense that a certain capacity of resources will be at their disposal. All jobs submitted to a queue will have access to the capacity allocated to the queue. Free resources can be allocated to any queue beyond it's capacity. When there is demand for these resources from queues running below capacity at a future point in time, as tasks scheduled on these resources complete, they will be assigned to jobs on queues running below the capacity. Queues optionally support job priorities (disabled by default). Within a queue, jobs with higher priority will have access to the queue's resources before jobs with lower priority. However, once a job is running, it will not be preempted for a higher priority job, though new tasks from the higher priority job will be preferentially scheduled. In order to prevent one or more users from monopolizing its resources, each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is competition for them. Support for memory-intensive jobs, wherein a job can optionally specify higher memory-requirements than the default, and the tasks of the job will only be run on TaskTrackers that have enough memory to spare. 3. Picking a task to run Note that many of these steps can be, and will be, enhanced over time to provide better algorithms. Whenever a TaskTracker is free, the Capacity Scheduler picks a queue which has most free space (whose ratio of # of running slots to capacity is the lowest). Once a queue is selected, the Scheduler picks a job in the queue. Jobs are sorted based on when they're submitted and their priorities (if the queue supports priorities). Jobs are considered in order, and a job is selected if its user is within the user-quota for the queue, i.e., the user is not already using queue resources above his/her limit. The Scheduler also makes Page 2

sure that there is enough free memory in the TaskTracker to tun the job's task, in case the job has special memory requirements. Once a job is selected, the Scheduler picks a task to run. This logic to pick a task remains unchanged from earlier versions. 4. Installation The Capacity Scheduler is available as a JAR file in the Hadoop tarball under the contrib/capacity-scheduler directory. The name of the JAR file would be on the lines of hadoop-*-capacity-scheduler.jar. You can also build the Scheduler from source by executing ant package, in which case it would be available under build/contrib/capacity-scheduler. To run the Capacity Scheduler in your Hadoop installation, you need to put it on the CLASSPATH. The easiest way is to copy the hadoop-*-capacity-scheduler.jar from to HADOOP_HOME/lib. Alternatively, you can modify HADOOP_CLASSPATH to include this jar, in conf/hadoop-env.sh. 5. Configuration 5.1. Using the Capacity Scheduler To make the Hadoop framework use the Capacity Scheduler, set up the following property in the site configuration: Property mapred.jobtracker.taskscheduler Value org.apache.hadoop.mapred.capacitytaskscheduler 5.2. Setting up queues You can define multiple queues to which users can submit jobs with the Capacity Scheduler. To define multiple queues, you should edit the site configuration for Hadoop and modify the mapred.queue.names property. You can also configure ACLs for controlling which users or groups have access to the queues. For more details, refer to Cluster Setup documentation. 5.3. Configuring properties for queues Page 3

The Capacity Scheduler can be configured with several properties for each queue that control the behavior of the Scheduler. This configuration is in the conf/capacity-scheduler.xml. By default, the configuration is set up for one queue, named default. To specify a property for a queue that is defined in the site configuration, you should use the property name as mapred.capacity-scheduler.queue.<queue-name>.<property-name>. For example, to define the property capacity for queue named research, you should specify the property name as mapred.capacity-scheduler.queue.research.capacity. The properties defined for queues and their descriptions are listed in the table below: Name Description mapred.capacity-scheduler.queue.<queue-name>.capacity Percentage of the number of slots in the cluster that are made to be available for jobs in this queue. The sum of capacities for all queues should be less than or equal 100. mapred.capacity-scheduler.queue.<queue-name>.supports-priority If true, priorities of jobs will be taken into account in scheduling decisions. mapred.capacity-scheduler.queue.<queue-name>.minimum-user-limit-percent Each enforces a limit on the percentage of resources allocated to a user at any given time, if there is competition for them. This user limit can vary between a minimum and maximum value. The former depends on the number of users who have submitted jobs, and the latter is set to this property value. For example, suppose the value of this property is 25. If two users have submitted jobs to a queue, no single user can use more than 50% of the queue resources. If a third user submits a job, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queue's resources. A value of 100 implies no user limits are imposed. 5.4. Memory management The Capacity Scheduler supports scheduling of tasks on a TaskTracker(TT) based on a job's memory requirements and the availability of RAM and Virtual Memory (VMEM) on the TT node. See the Hadoop Map/Reduce tutorial for details on how the TT monitors memory usage. Page 4

Currently the memory based scheduling is only supported in Linux platform. Memory-based scheduling works as follows: 1. The absence of any one or more of three config parameters or -1 being set as value of any of the parameters, mapred.tasktracker.vmem.reserved, mapred.task.default.maxvmem, or mapred.task.limit.maxvmem, disables memory-based scheduling, just as it disables memory monitoring for a TT. These config parameters are described in the Hadoop Map/Reduce tutorial. The value of mapred.tasktracker.vmem.reserved is obtained from the TT via its heartbeat. 2. If all the three mandatory parameters are set, the Scheduler enables VMEM-based scheduling. First, the Scheduler computes the free VMEM on the TT. This is the difference between the available VMEM on the TT (the node's total VMEM minus the offset, both of which are sent by the TT on each heartbeat)and the sum of VMs already allocated to running tasks (i.e., sum of the VMEM task-limits). Next, the Scheduler looks at the VMEM requirements for the job that's first in line to run. If the job's VMEM requirements are less than the available VMEM on the node, the job's task can be scheduled. If not, the Scheduler ensures that the TT does not get a task to run (provided the job has tasks to run). This way, the Scheduler ensures that jobs with high memory requirements are not starved, as eventually, the TT will have enough VMEM available. If the high-mem job does not have any task to run, the Scheduler moves on to the next job. 3. In addition to VMEM, the Capacity Scheduler can also consider RAM on the TT node. RAM is considered the same way as VMEM. TTs report the total RAM available on their node, and an offset. If both are set, the Scheduler computes the available RAM on the node. Next, the Scheduler figures out the RAM requirements of the job, if any. As with VMEM, users can optionally specify a RAM limit for their job (mapred.task.maxpmem, described in the Map/Reduce tutorial). The Scheduler also maintains a limit for this value (mapred.capacity-scheduler.task.default-pmem-percentage-in-vmem, described below). All these three values must be set for the Scheduler to schedule tasks based on RAM constraints. 4. The Scheduler ensures that jobs cannot ask for RAM or VMEM higher than configured limits. If this happens, the job is failed when it is submitted. As described above, the additional scheduler-based config parameters are as follows: Name Description mapred.capacity-scheduler.task.default-pmem-percentage-in-vmem A percentage of the default VMEM limit for jobs (mapred.task.default.maxvmem). This is the default RAM task-limit associated with a task. Unless overridden by a job's setting, this number defines the RAM task-limit. Page 5

mapred.capacity-scheduler.task.limit.maxpmem Configuration which provides an upper limit to maximum physical memory which can be specified by a job. If a job requires more physical memory than what is specified in this limit then the same is rejected. 5.5. Job Initialization Parameters Capacity scheduler lazily initializes the jobs before they are scheduled, for reducing the memory footprint on jobtracker. Following are the parameters, by which you can control the laziness of the job initialization. The following parameters can be configured in capacity-scheduler.xml Name Description mapred.capacity-scheduler.queue.<queue-name>.maximum-initialized-jobs-per-user Maximum number of which are allowed to be pre-initialized for a particular user in the queue. Once a job is scheduled, i.e. it starts running, then that job is not considered while scheduler computes the maximum job a user is allowed to initialize. mapred.capacity-scheduler.init-poll-interval mapred.capacity-scheduler.init-worker-threads Amount of time in miliseconds which is used to poll the scheduler job queue to look for jobs to be initialized. Number of worker threads which would be used by Initialization poller to initialize jobs in a set of queue. If number mentioned in property is equal to number of job queues then a thread is assigned jobs from one queue. If the number configured is lesser than number of queues, then a thread can get jobs from more than one queue which it initializes in a round robin fashion. If the number configured is greater than number of queues, then number of threads spawned would be equal to number of job queues. 5.6. Reviewing the configuration of the Capacity Scheduler Once the installation and configuration is completed, you can review it after starting the Map/Reduce cluster from the admin UI. Start the Map/Reduce cluster as usual. Open the JobTracker web UI. Page 6

The queues you have configured should be listed under the Scheduling Information section of the page. The properties for the queues should be visible in the Scheduling Information column against each queue. Page 7