Introduction to Apache YARN Schedulers & Queues In a nutshell, YARN was designed to address the many limitations (performance/scalability) embedded into Hadoop version 1 (MapReduce & HDFS). Some of the benefits of using Hadoop version 2 (and the YARN component) are: Scalability: YARN operates well on larger cluster setups. The scalability potential of Hadoop 1 was around 5,000 nodes. This limitation is mainly due to the fact that the JobTracker had to manage both, the jobs and the tasks. YARN circumvents these limitations by virtue of splitting the load via a resource manager and application master architecture and hence is capable of scaling to 10's of 1,000nds of nodes and 100ds of 1,000nds of tasks. In contrast to the JobTracker, each instance of a MapReduce application operates now via a dedicated application master that is active for the duration of the application cycle. Availability: With the JobTracker s responsibilities decomposed between the resource manager and the application master, with YARN, HA reflects a divide-and-conquer problem. In other words, provide HA for the resource manager and then for the YARN applications (on a per-application basis). So Hadoop version 2 supports HA for the resource manager and for the application master, respectively. Utilization: With Hadoop version 1, each TaskTracker was configured with a static allocation of fixed-size slots that were decomposed into the map and reduce slots at configuration time. A map slot could only be used to execute a map task while a reduce slot could only be used to process a reduce task. In YARN, a node manager manages a pool of resources (containers) rather than a fixed number of designated slots. A MapReduce application running on YARN does not encounter a scenario where a reduce task has to stall because only map slots are available (this used to happen with Hadoop 1). So if the resources (containers) to execute a task are available, the application will allocate them and hence run. Resources in YARN are fine grained, so an application can make a request for what it needs rather than for an indivisible slot that may either be too big (waste resources) or too small (may cause a failure). Multi-Tenancy: The major benefit of YARN though is that it opens up Hadoop to other types of distributed applications beyond MapReduce. So MapReduce is now just one YARN application among many others that can be executed on the same cluster. Actual Big Data data-lakes can now efficiently and effectively be deployed and utilized via YARN (YARN is also labeled as the data operating system). YARN Scheduler Options By default, there are currently 3 schedulers provided with YARN (FIFO, Capacity, & Fair). It has to be pointed out though that many companies utilize other in-house schedulers to further optimize workload processing. As an example, DHT has been involved in developing a Delay scheduler for certain Big Data workload scenarios. The simplest scheduler is FIFO. The FIFO Scheduler places applications in a queue and executes them in the order of submission. While the FIFO Scheduler is easy to understand and does not require any configuration, the statement can be made that FIFO is not suitable for a shared cluster setup. Hence, in a shared cluster environment, it is suggested to configure either the Capacity or the Fair scheduler, respectively. Both schedulers allow jobs to complete in a timely manner, balancing the requirements of long running jobs and ad-hoc tasks. The differences among the 3 schedulers is highlighted in Figure 1. Figure 1 shows that with FIFO, job 2 is blocked until job 1 either terminates or completes. With the Capacity scheduler, a separate, dedicated queue B allows job 2 to start executing as soon as the containers are allocated. It has to be pointed out though that with the Capacity scheduler, job 1 has the option (depending on the setup/configuration) to utilize all the cluster resources until job 2 becomes active. With the Fair scheduler, there is no need to reserve capacity as the scheduler dynamically balances resources among all the
running jobs (by default). As job 1 is first the only job running, all the resources are allocated to that task. As job 2 becomes active, (by default) half of the cluster resources are diverted to job 2 (labeled the fair share of resources). It has to be noted that there is a lag between the time the second job starts and when it receives its fair share, as the job has to wait for resources to free up as containers used by the first job complete and hence become available. After job 2 completes and surrenders the resources, job 1 can again utilize the full cluster capacity. This provides optimized cluster utilization as well as good response time for smaller jobs. Figure 1: Simple YARN Scheduler Comparison Picture courtesy of Safari Books Online The YARN Capacity Scheduler The fundamental unit of scheduling in YARN is the queue. Each queue assigned to the Capacity scheduler contains the following properties: A short queue name. A full queue path name. A list of associated child-queues and applications.
The guaranteed capacity of the queue. The maximum capacity of the queue. A list of active users and their corresponding resource allocation limits. The state of the queue. Access control lists (ACLs) governing access to the queue. The Capacity scheduler allows sharing cluster resources along organizational lines where each organization can be assigned a certain capacity of the overall cluster potential. Each organization can be setup with a dedicated queue that can utilize a certain fraction of the overall cluster capacity (the 100%). Queues may be further decomposed in a hierarchical fashion and hence allowing each organization to share its cluster allowance among different groups of users within the organization. Within a queue, applications can be scheduled either in a FIFO manner or can use the Dominant Resource Fairness (DRF) model. Further. elasticity and preemption is supported with the Capacity scheduler (see below). Designing the Queues The fundamental unit of scheduling in YARN is a queue. The capacity of each queue specifies the % of the cluster resources that are available for applications that are submitted to the queue. Queues can be setup in a hierarchy that reflects the organizational structure, resource requirements, and access restrictions required by the various entities that utilize cluster resources. For example, suppose that a company has 3 organizations (Engineering, Support, and Marketing) that utilize a shared cluster. The Engineering organization has 2 sub-teams (Development and QA), while Support has 2 sub-teams (Training and Services), and ultimately Marketing is decomposed into Sales and Advertising. Figure 2 shows the breakdown. Figure 2: Org Setup Picture courtesy of Pivotal Hierarchical Queue Characteristics In general, there are 2 types of queues: parent queues and leaf queues. The parent queues enable the management of resources across organizations and sub-organizations. They can contain either additional parent queues or leaf queues, respectively. They do not themselves accept any application submissions. The leaf queues are the queues that are placed below a parent queue and they do accept application requests. Leaf queues do not have any child queues and therefore do not have any configuration properties that end with ".queues". There is a top-level parent root queue that does not belong to any organization, but instead represents the cluster itself. Using parent and
leaf queues, one can specify the capacity allocations for various organizations and sub-organizations. It is also rather common to decompose the leaf nodes by job type such as batch, ETL, and/or ad-hoc processing. Scheduling Among Queues Hierarchical queues ensure that the guaranteed resources are first shared among the sub-queues of an organization prior to any remaining free resources being shared with queues that belong to other organizations. This setup enables each organization to have control over the utilization of its guaranteed resources. At each level in the hierarchy, every parent queue maintains the list of its child queues in a sorted manner (based on demand). The sorting of the queues is done via the currently used fraction of each queue s capacity (or the full-path queue names if the reserved capacity of any 2 queues is equal). The root queue governs how the cluster capacity has to be distributed among the first level of parent queues and invokes scheduling on each of its child queues. Every parent queue applies its capacity constraints to all of its child queues. Leaf queues hold the list of active applications (potentially from multiple users) and schedule resources in a FIFO (by default) manner, while at the same time adhering to the capacity limits specified for the individual users. Access-control lists (ACLs) can be used to restrict user access to queues. Application submission can really only happen at the leaf queue level, but an ACL restriction set on a parent queue will be applied to all of its descendant queues. Managing Cluster Capacity with Queues The Capacity Scheduler is designed to allow organizations to share compute clusters by using the very familiar notion of FIFO (by default) queues. Further, by default, YARN does not assign entire nodes to queues. Queues own a fraction of the capacity of the cluster and this specified queue capacity can be fulfilled from any number of nodes (in a dynamic fashion). Scheduling is the process of matching resource requirements of multiple applications from various users (submitted to different queues at multiple levels in the queue hierarchy) with the free capacity available on the nodes in the cluster. Figure 3: Cluster Capacity Configuration Picture courtesy of Pivotal Because total cluster capacity can vary, capacity configuration values are expressed as %. The capacity property can be used by administrators to allocate a percentage of cluster capacity to a queue. The following properties would divide the cluster resources between the Engineering, Support, and Marketing
organizations (Figure 3) in a 6:1:3 ratio (60%, 10%, and 30%). The sum of capacities at any level in the hierarchy must equal to 100%. Also, the capacity of an individual queue at any level in the hierarchy must be >= 1% (one cannot set a capacity to a value of 0). Resource Distribution Workflow During scheduling, queues at any level in the hierarchy are sorted in the order of their currently used capacity and available resources are distributed among them starting with the queues that are currently the most under-served once. In respect to capacities alone, the resource scheduling has the following workflow: The more under-served a queue is, the higher the priority it receives during resource allocation. The most under-served queue is the queue with the smallest ratio of used capacity to total cluster capacity. The used capacity of any parent queue is defined as the aggregate sum of used capacity for all of its descendant queues. The used capacity of a leaf queue equals to the amount of resources used by the allocated containers for all of the applications running in that queue. Once it is decided to assign to a parent queue the currently available free resources, further scheduling is done recursively to determine which child queue will receive the resources (based on the previously described concept of used capacities). Further scheduling happens inside each leaf queue to allocate resources to applications in a FIFO order (by default). This process is also dependent on locality, user level limits, and application limits, respectively. Once an application within a leaf queue is chosen, scheduling also happens within the application. Applications may have different priorities of resource requests. To ensure elasticity, capacity that is configured but not utilized by any other queue (due to a lack of demand) is automatically assigned to the queues that are in need of resources. Simple Resource Distribution Workflow Example Assuming a cluster with 100 nodes, each with 10GB of memory allocated to YARN containers, for a total cluster capacity of 1000GB (1TB of memory). According to the described configuration in Figure 3, Engineering is assigned 60% of the cluster capacity (an absolute capacity of 600GB). Similarly, Support is assigned 100GB while Marketing gets 300GB. Under the Engineering organization, capacity is distributed between Development and QA in a 1:4 ratio. Hence, Development gets 120GB while 480GB is assigned to QA. Now consider the following timeline of events (use Figure 3 as a reference): Initially, the entire Engineering queue is empty with no applications running, while the Support and Marketing queues are utilizing their full capacities. Users Bob and Tom now submit applications to the Development leaf queue. Their applications are elastic and hence can run with either all of the resources available in the cluster or with a subset of the cluster resources (depending upon the state of the resource-usage). Even though the Development queue is assigned only 120GB, Bob and Tom are each allowed to allocate 120GB ( for a total of 240GB). Again, this can happen despite the fact that the Development queue is configured with a capacity of 120GB. The Capacity scheduler allows elastic sharing of cluster resources for better utilization of available cluster resources. Since there are no other users in the Engineering queue at this time, Bob and Tom are allowed to use the available free resources. Next on the time-line, the assumption made is that users Amber, Amy and John submit more applications to the Development leaf queue. Even though the queue is restricted to 120GB, the overall used capacity in the queue now balloons to 600GB (essentially taking up all of the resources allocated to the QA leaf queue). User Terry now submits an application to the QA queue. With no free resources available, this application has to wait though.
Given that the Development queue is utilizing all of the available cluster resources, Terry may or may not be able to immediately get back the guaranteed capacity of his QA queue (that depends on whether preemption is enabled or not - by default preemption is disabled). As the applications of Bob, Tom, Amber, Amy, and John finish executing and the allocated resources become available, the newly available containers can now be allocated by Terry s application. This process will continue until the cluster stabilizes at the intended 1:4 resource usage ratio for the Development and QA leaf queues. Based on this simple example, one get a feel for that it is possible for aggressive users to continuously submit applications and therefore block others from allocating containers for their applications (except preemption is enabled). To further address this problem, the Capacity scheduler allows setting limits on the elastic growth of any queue. To illustrate, to restrict the Development leaf queue from monopolizing the Engineering parent queue's capacity (see Figure 3), one can set a the maximum- capacity property via: Property: yarn.scheduler.capacity.root.engineering.development.maximum- capacity. Value: 40 After applying the cap, users of the Development queue can still go beyond the 120GB capacity, but they will not be able to allocate more than 40% of the Engineering parent queue's capacity (40% of 600 GB = 240 GB). The capacity and maximum-capacity properties can be used to control sharing and elasticity behavior across the organizations and sub-organizations. One should always balance these properties to avoid strict limits that result in a subpar utilization behavior, as well as to avoid excessive cross-organizational sharing scenarios. The capacity and maximum capacity settings can be dynamically changed at run-time via -> yarn rmadmin -refreshqueues. Setting User Limits It has to be pointed out that the minimum-user-limit-percent property can be used to set the minimum % of resources allocated to each leaf queue user. To illustrate, to enable equal sharing of the Services leaf queue capacity among 5 users, one can set the minimum-user-limit property to 20%. This setting governs the minimum limit that any user s share of the queue capacity can shrink to. Irrespective of the limit, any user joining the queue can allocate more than his/her allocated share if there are idle resources available. The following scenario (see Table 1) shows how the queue resources are adjusted as users submit jobs to a queue with a minimum-user-limit-percent value of 20%: Table 1: User Limits on a Queue As a user's applications finish executing, other users with outstanding jobs start to reclaim that share. It has to be pointed out that despite the sharing feature among users, the FIFO application scheduling order embedded into the Capacity scheduler framework (the default) does not change. This basically guarantees that no user can
monopolize the queues by continuously submitting new applications. Applications (and hence the corresponding users) that are submitted first always get higher priorities than applications that are submitted later. The Capacity scheduler s leaf queues can further apply the user-limit-factor property to control user resource allocation scenarios. This property denotes the fraction of the queue capacity that any single user can consume up to a maximum value, regardless of whether or not there are idle resources available in the cluster. Property: yarn.scheduler.capacity.root.support.user-limit-factorvalue: 1 The default value of 1 implies that any single user in the queue can at max only occupy the queue s configured capacity. This prevents users in a single queue from monopolizing resources across all the queues in a cluster. Setting the value to 2 would restrict the queue's users to twice the queue s configured capacity. Setting it to a value of 0.5 would restrict any user from using resources beyond half of the queue capacity. These settings can also be dynamically changed at run-time via -> yarn rmadmin - refreshqueues. Application Reservations The Capacity scheduler is responsible for matching the free resources in the cluster with the resource requirements of an application. Many times, a scheduling cycle occurs where despite the fact there are free resources on a node, they are not sized large enough to satisfy the demand of an application (at the head of the queue) waiting for a resource. This typically happens with memory intensive applications where the resource demand for the containers exceeds the demands of typical application running in the cluster. Such a mismatch may lead to starving (memory) resource- intensive applications. The Capacity scheduler's reservations feature addresses this issue: When a node reports back with a free Container, the Capacity scheduler selects an appropriate queue to utilize the newly available resources (based on the capacity and maximum capacity settings). Within that selected queue, the Capacity scheduler scrutinizes the applications (in FIFO order along with the user limits). Once a needy application is identified, the Capacity scheduler determines if the requirements of that application can be met by the node s free capacity. If there is a size mismatch, the Capacity scheduler immediately creates a reservation on the node for the application s required container. Once a reservation is made for an application on a node, these resources are not used by the Capacity scheduler for any other queue, application, or container until the application reservation is fulfilled. The node on which a reservation is made reports back when enough containers are available such that the total free capacity on the node now matches the reservation size. When that occurs, the Capacity scheduler marks the reservation as fulfilled, removes it, and allocates a container on the node. In a scenario where another node fulfills the resource requirements of the starving application (and hence the application no longer needs the reserved capacity on the first node), the reservation is simply cancelled. To prevent the number of reservations from growing in an unbounded manner, and to avoid any potential scheduling deadlocks, the Capacity Scheduler maintains only 1 active reservation at a time on each node. Setting Application Limits In order to avoid systems thrashing that may be due to an unmanageable load (caused either by malicious users or by accident) the Capacity scheduler provides a static (configurable) limit parameter that governs the total number of concurrently active (both running and pending) applications. The maximum-applications configuration property is used to set this limit (the default value is 10,000). The limit for running applications in any specific queue equals to a fraction of the total limit and is proportional to the queue's capacity. This is a hard limit, which implies that once the limit is reached by a queue, any new applications submitted to that queue will be rejected. It has to be pointed out though that this limit can be explicitly overridden on a per-queue basis.
Preemption Figure 4: Preemption Workflow Picture courtesy of Hortonworks As mentioned previously, a scenario can occur where a queue has a guaranteed level of cluster resources configured to it, but has to wait to run applications because other queues are utilizing all of the available resources. If preemption is enabled, higher-priority applications do not have to wait if lower priority applications hog the available capacity. So with preemption enabled, under-served queues start to claim their allocated cluster resources almost immediately, without having to wait too long for other queues' applications to finish executing (there still is a short lag though). Preemption is governed by a set of capacity monitor policies that have to be enabled by setting the yarn.resourcemanager.scheduler.monitor.enable property to true. These capacity monitor policies apply preemption in configurable intervals based on the defined capacity allocations (in an as graceful as possible manner where containers only get killed as a last ditch effort). Figure 4 outlines the preemption workflow: CGroups & CPU Scheduling One can use CGroups to isolate CPU-heavy tasks in a Hadoop cluster. If one is using CPU scheduling, one should also use CGroups to constrain and manage CPU threads. If a cluster is not using CPU scheduling, CGroups should not be enabled. While enabling CPU scheduling, the queues are still used to allocate cluster resources, but both CPU and memory requirements are taken into consideration by using a Dominant Resource Fairness (DRF) scheduler model. In the DRF model, resource allocation accounts for the dominant resource required by a process. CPU-heavy processes receive more CPU cycles but less memory while memory-heavy processes (such as MapReduce) receive more memory but less CPU. The DRF scheduler is designed to fairly distribute the memory and the CPU resources among different types of processes in a mixed-workload cluster. The CGroups compliment CPU scheduling by providing CPU resource isolation. The feature allows setting limits on the amount of CPU resources granted to individual YARN containers and also allows governing the total amount of CPU resources used
by YARN processes. The CGroups are actually a Linux kernel feature. When the default resource calculator (DefaultResourceCalculator) is used, resources are allocated solely on memory. If CPU scheduling is enabled by using the Dominant Resource Calculator (DominantResourceCalculator - based on the Dominant Resource Fairness (DRF) model of resource allocation), queues are still used to allocate cluster resources, but both CPU and memory demand is scrutinized. Via the DRF model, resource allocation identifies the dominant resource required by a process. The Dominant Resource Calculator schedules both CPU-heavy and memory-heavy processes on the same node. So basically the DRF scheduler is designed to fairly distribute memory and CPU resources among different types of processes in a mixed-workload cluster. Node Labels Actual node labels can be assigned to cluster nodes. One can so associate node labels with capacity scheduler queues to specify which node label each queue is allowed to access. When a queue is associated with 1 or more node labels, all applications submitted by the queue run only on nodes with those specified labels. If no node label is assigned to a queue, the applications submitted by the queue can run on any node without a node label. To illustrate, the assumption made is that there is a cluster with a total of 8 nodes. The first 3 nodes (n1-n3) have a node label=x, the next 3 nodes (n4-n6) have a node label=y, and the final 2 nodes (n7, n8) do not have any node labels assigned. Further, the assumption made is that each node can run 10 containers and that the queue hierarchy is as depicted in Figure 5. Figure 5: Example Queue Hierarchy Picture courtesy of Pivotal The assumptions made are that queue a can access node labels x and y while queue b can only access node label y. By definition, nodes without labels can be accessed by all queues. Let's now consider the following example label configuration for the queues: capacity(a) = 40, capacity(a, label=x) = 100, capacity(a, label=y) = 50; capacity(b) = 60, capacity(b, label=y) = 50. This implies that queue a can access 40% of the resources on nodes without any labels, 100% of the resources on nodes with label=x, and 50% of the resources on nodes with label=y. Queue b can access 60% of the resources on nodes without any labels, and 50% of the resources on nodes with label=y. To do a little bit more simple math, capacity(a) + capacity(b) = 100 capacity(a, label=x) + capacity(b, label=x) (b cannot access label=x, it is 0) = 100 capacity(a, label=y) + capacity(b, label=y) = 100.
For child queues under the same parent queue, the sum of the capacity for each label should equal 100%. Similarly, one can set the capacities of the child queues a1, a2, and b1. a1 and a2: capacity(a.a1) = 40, capacity(a.a1, label=x) =30, capacity(a.a1, label=y) =50 capacity(a.a2) = 60, capacity(a.a2, label=x) =70, capacity(a.a2, label=y) =50 b1: capacity(b.b1) = 100 capacity(b.b1, label=y) = 100 Hence, for the a1 and a2 configurations, one can state that: capacity(a.a1) + capacity(a.a2) = 100 capacity(a.a1, label=x) + capacity(a.a2, label=x) = 100 capacity(a.a1, label=y) + capacity(a.a2, label=y) = 100. So how many resources can queue a1 access? Resources on nodes without any labels: Resource = 20 (total containers that can be allocated on nodes without label, in this case nodes n7, n8) * 40% (a.capacity) * 40% (a.a1.capacity) = 3.2 (containers). The YARN Fair Scheduler The goal of the Fair scheduler is to allocate resources so that all running applications (over time) receive the same share of resources. Figure 1 discloses how fair sharing operates for applications that are part of the same queue. It has to be pointed out though that fair sharing is normally configured to operate among queues (or pools). To illustrate this concept, the assumption made is that there are 2 users, A & B (see Figure 6), where each user is assigned to a separate queue. On the timeline, A starts a job (1) and receives all available resources as there are no other active jobs. After a certain epoch, B initiates a job (2) while A s job (1) is still running. After a time lag x, each job will utilize half the resources. As depicted in Figure 6, B initiates another job (3) while jobs 1 & 2 are still running. As B runs in its own queue, jobs 2 & 3 have to share half the cluster resources until job 2 terminates. At the same time, user A's job (1) is still executing with half the systems resources allocated to job 1. The result of all this is that the available resources are shared fairly among the users. Figure 6: Fair Scheduler - Queues & Users Picture courtesy of Safari Books Online
With the Fair scheduler, it is possible to assign weights to the queues and use the weights to calculate the fair share. Assuming 2 queues Production and Development. It is possible to configure the weights in a 60:40 ratio so that the Production queue has the potential to allocate more resources. If weights are not specified, in this example, both queues would get 50% of the available resources. Further, the queues can have different scheduling policies. If not specified, the fair scheduling policy is used. But the Fair scheduler also supports FIFO and the Dominant Resource Fairness (DRF) models at the queue leave level, respectively. As with the Capacity scheduler, the Fair scheduler supports queues that can be configured with minimum and maximum resources, as well as a maximum number of running applications. References 1. Hortonworks (www.hortonworks.com) 2. Pivotal (www.pivotal.io) 3. Apache (www.apache.org) 4. Safari Books Online (www.safaribooksonline.com) 5. Cloudera (www.cloudera.com)