Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs when jobs are concurrently executing on the same multicore node (there is a contention for allocated CPU time, shared caches, memory bus, memory controllers, etc.) and when jobs are concurrently accessing cluster interconnects as their processes communicate data between each other. The cluster network also has to be used by the cluster scheduler in a virtualized environment to migrate job virtual machines across the nodes. We argue that contention for cluster shared resources incurs severe degradation to workload performance and stability and hence must be addressed. We also found that the state-of-theart HPC cluster schedulers are not contention-aware. The goal of this work is the design, implementation and evaluation of a scheduling algorithm that optimizes shared resource contention in a virtualized HPC cluster environment. Depending on the particular cluster and workload needs, several optimization goals can be pursued. 1 Introduction Assume the target environment of a High-Performance Computing (HPC) cluster comprised of many (hundreds or even thousands) computational nodes. The nodes in the HPC cluster are connected through a cluster network and are managed by a resource allocation and scheduling algorithm as a whole. The algorithm decides what applications to run on what nodes in the cluster and how much resources should be allocated to every process within every running job. HPC cluster is a batch processing system. It executes a job at a time chosen by the cluster scheduler according to the requirements set upon job submission, defined scheduling policy and the availability of resources. That differs from, say, an interactive system where commands are executed when entered via the terminal or a transactional system, where the jobs are executed as soon as they are initiated by a transaction request from outside the cluster. The exact methods of managing the workload by the resource allocation and scheduling algorithm depend on whether the virtualization is supported within the cluster. If there is a virtual framework on the cluster nodes, then the algorithm schedules virtual appliances (VAs) of the applications rather than applications themselves. In a non-virtualized environment, the job scheduler cannot migrate workload processes between the cluster nodes. If it deems the internode rescheduling necessary, it may only do so by killing the process and spawning it on the new desired node, or wait for the natural termination of the process and then respawn it. In a virtualized environment, a dynamic migration of VAs between the nodes of a cluster is possible. A job submitted to the HPC cluster is typically a shell script which contains a program invocation and a set of attributes allowing cluster user to manage the job after submission and to request the resources necessary for the job execution. The attributes specify the duration of the job, offer control over when a job is eligible to be run, what happens to the output when it is completed and how the user is notified when it completes. One important attribute is the resource list. The list specifies the amount and type of resources needed by the job in order to execute. The cluster job can request a number of cluster nodes, processors, the amount of physical memory, the swap or the disk space. HPC cluster scheduler puts the job in a queue upon submission. The queue contains the jobs waiting for the execution on the cluster. Once the resources specified in the job submission script are available, and if the job is eligible to run according to the cluster policy, the scheduler starts the job and executes it for the duration specified in the submission script. If the job terminates before that time, scheduler will try to use the resources freed by the job termination to run other processes. However, it might be that no jobs will be eligible to run at that time, so, in general, the cluster user will be charged for the

time specified in the submission script. If the job needs more time to execute than is specified in the script, the scheduler might try to allocate additional resources to the job. It might not be able to do so, as different jobs might be already scheduled for execution immediately after. If that happens, scheduler can terminate the job before its natural completion. In both cases, it is essential for HPC cluster user to correctly predict the job execution time so that the user will not be charged for the unnecessary resources if the job terminates early and so that her job will not be killed by the cluster scheduler due to its extended execution time. Although most cluster management algorithms address shared resources like CPU, disk and network interface, there are other shared resources that become increasingly important on modern multicore machines, and that were not addressed by existing cluster management proposals. In particular, there is: Shared resource contention between the applications in the memory hierarchy of each cluster node. We assume all nodes to be multicore systems. In a multicore system (Figure 1), cores share parts of the memory hierarchy, which we term memory domains, compete for resources such as last-level caches (LLC), system request queues and memory controllers [3, 6]. Figure 1: A schematic view of a cluster node with four memory domains and four cores per domain. There are 16 cores in total, and a shared L3 cache per domain. Contention and overhead of accessing cluster interconnects (cluster network). It can occur when (a) cluster uses a file server to store the data for the cluster jobs, (b) several processes of the same job spread among cluster nodes would want to commu- Figure 3: Average time increase for the 8 process MPI jobs scheduled on 2 nodes (4 processes per node) relative to a schedule on one node. nicate their data between each other (cluster jobs are usually created using MPI, a Message Passing Interface, or other APIs that would allow their processes to exchange the data between each other, even if the processes are running on different machines), (c) cluster network also has to be used by the job scheduler in a virtualized environment to migrate virtual machines across the nodes, if necessary. 2 Why taking care of the shared resource contention is important? Shared resource contention can substantially affect the performance of a cluster job. Figure 2 shows the results of the experiments where two different sets of four MPI jobs (4 processes each) were running simultaneously on a cluster comprised of 2 nodes with 8 cores each. The applications shown in this section are benchmarks from The High Energy Physics (HEP) SPEC, The NAS Parallel Benchmarks (NPB), High Performance Computing Challenge (HPCC) benchmark, Intel MPI and SPEC MPI2007 suites. We evaluated scientific applications for two reasons. First, they are CPU-intensive and often suffer from contention. Second, they are representative of the workload, typically run on HPC clusters. Among those four MPI jobs, two used memory hierarchy of the node extensively and so, when put together on the node, can experience degradation due to contention for accessing memory resources of the machine. There are three unique ways to distribute the four MPI jobs (4 process each) across the two 8 core nodes, with respect to the pairs of co-run MPI jobs sharing the node. We ran the workloads in each of these schedules, recorded the average completion time for all applications in each workload, and labeled the schedule with the lowest average completion time as the best (this is the schedule where memory intensive jobs are separated on different nodes) and the one with the highest average completion time as 2

Figure 2: The performance degradation for contention-unaware cluster schedule relative to a contention-aware schedule for 2 workloads comprised of scientific, MPI jobs. the worst (two memory intensive jobs are put together on the node). Figure 2 shows the performance degradation for the worst schedule relative to the best one. The best schedule delivers an 11% better average completion time than the worst one. Performance of individual applications improves by as much as 33%. This data highlights the fact that the scheduling decisions within the cluster must be contention-aware in order to prevent performance degradation due to shared resource contention. Figure 3 shows the degradation that an MPI job suffers when its processes are forced to communicate between each other using cluster interconnect. The slowdown varies greatly from job to job, but it can be as high as 778% for some MPI applications. This stresses the importance of scheduling so as to reduce the communication through cluster interconnects as much as possible. 3 Cluster schedulers are NOT contention aware The types of resources needed by the job to execute and specified upon job submission vary with the system architecture, but none of them allow to specify fine grained description of resource requirements of the job (i.e. how sensitive the application is to memory resource contention or the internode exchange of the data). Because of that, the application may encounter shortage of actual computational resources allocated to it (e.g. cache space, memory controller bandwidth or internode interconnect bandwidth), even though the resource requirements specified during the job submission (the number of nodes, cores per node, memory and so on) are perfectly met. This will in turn result in the increased execution time for the contention-sensitive job and may lead to early termination of the job by the cluster scheduler if its execution time was incorrectly predicted in the submission script. The probability of an incorrect prediction increases in large HPC clusters, as they are often used by many users and each of them in general does not know which jobs will be executed concurrently on the cluster at a given time. Scheduling decisions that take into account cluster resource contention can significantly improve the effectiveness of the HPC cluster resulting in more jobs being run and quicker job turnaround. It is the job of the scheduler to use whatever freedom is available to schedule jobs in such a manner so as to maximize cluster performance and minimize the resources spent on it. 4 Our proposal: make the cluster schedulers contention-aware The goal of this work is the design, implementation and evaluation of a scheduling algorithm that optimizes shared resource contention in an HPC cluster. Depending on the particular cluster and workload needs, the following optimization goals can be pursued by the cluster scheduler: Stable performance of the overall system, fairness in shared resource contention degradation for all applications. Performance boost for the chosen (prioritized) jobs due to reduction in resource contention for them or complete isolation from the resource contention. Reduction in power consumption on the system by packing of applications on as few nodes as possible, thus providing a better solution in terms of power-performance trade-off. We intent to measure the improvement in terms of Energy Delay Product (EDP) for the cluster with contention-aware schedulers in comparison with the default scheduler/default scheduler with power savings on. Energy Delay Product is a common metric for energy/performance improvement [4]. Scalability. It is expected that the number of cluster nodes as well as the number of processor cores 3

within a single cluster node will continue to increase [2]. Any scheduling and resource allocation algorithms in such an environment should be highly scalable, because a centralized solution would result in delayed scheduling decisions and inability to respond to dynamic workloads. The efficiency of the scheduler is measured in the time it takes to make a complete scheduling decision for 10, 100, 1000, etc. jobs/processes. In centralized algorithm it will increase exponentially with the number of nodes and cores (number of potential scheduling entities), while decentralized approach will reduce the time by breaking the scheduling task into several subtasks which will be executed in parallel. We assume that each goal should be achieved under the following requirements: Maximizing overall workload performance (as long as it does not contradict the goal objective) Satisfying user resource constraints. User requirements are currently expressed via the number of desired dedicated nodes, cores or allocated memory. As we optimize the shared resource contention in cluster, we must make sure that we do not give the job fewer nodes, CPUs or less memory than the job requested, unless the job effectively uses fewer resources than it had requested. Ensuring that workload of each user is hurt due to contention only within certain predefined limits. 5 Design challenges In order to fulfil the optimization goals outlined above, we need to come up with the solutions to the following problems: 1) In a cluster environment, the scheduler generally runs the job if it is the next in queue and all the resources requested to it are available to assign to the job s processes. 1 This approach, however, assumes that the user knows what resources are necessary for the job to complete in the required amount of time (which must also be specified by the user). The existing schedulers allow users to post coarse-grained resource demands, like the number of execution cores, the maximum amount of main memory, disk or swap space that the job will use upon submission. All of this information, however, does 1 There could be of course exceptions from this general rule if, for instance, certain jobs are deemed high priority in which case they can prevent non-prioritized jobs to start before them. Another example would be a backfill scheduling policy: if the scheduler sees that the next job in the queue cannot start due to the lack of necessary resources, it can instead start the jobs that are located later in the queue to prevent resource wasting. not reflect how sensitive the job is to the resource contention from different jobs that will be simultaneously executing in the same cluster with the submitted one. As a result, neither users, nor the cluster scheduler are able to predict the actual execution time that the job will have within the particular cluster setup and workload. This can lead either to the overestimated execution times which results in increased charges for users, or underestimated times which result in early termination of the cluster jobs by the scheduler. To address this problem, a new set of contention descriptive metrics representing a finegrained information about each job s resource utilization and communication patterns needs to be provided both to the scheduler to help it make a scheduling decision and to the cluster users to properly describe the jobs they submit and to estimate the slowdown due to cluster sharing. Some of these metrics can be found in the previous work (Section 6), while others needs to be discovered. 2) The optimization goals outlined above can be potentially fulfilled together. For example, the scheduling task specified by the system administrator for the whole cluster (or, possibly, by the user for her submitted tasks only) could be boost the execution of the given subset of jobs while saving power for the rest as much as possible. How should we devise the algorithm so it could fulfil several optimization goals at the same time? Another interesting investigation would be the ability of the scheduler to dynamically detect, which optimization goal is the most beneficial for the current cluster workload and then dynamically switch between optimization goals as necessary. 3) To better optimize the cluster contention, the scheduler would need to co-schedule the jobs that do not compete for the shared resources. Hence, there is a tension to look ahead into the queue of submitted jobs: there could be, for instance, something at the tail of the queue that will result in better contention properties, but at the expense of skipping the queue order. How should we tradeoff the goals of fairness and contention management in this case? 4) When we have a queue of jobs as well as many jobs that are currently running on the cluster, what is the algorithm for creating assignments that answer the particular optimization goal(s) the scheduler is trying to accomplish: CPU and memory requirements, contention, power consumption, etc.? The combinations of jobs that we can create are many how do we find one quickly? 5) In the model we are proposing, there is an incentive to give the user less resources than they asked for if they do not effectively use them (for instance, if the user submitted the CPU-intensive job while requesting the whole dedicated node to it, the scheduler can still assign more jobs to the same node in case the submitted job effectively uses only one core on a multicore machine). This 4

could increase the resource utilization, but can cause the conflicts between colocated jobs and, as a result, slowdown due to shared resource contention. What incentives should we give to users to accept this kind of frivolity on the part of cluster scheduler? 6 What has been done so far? How can it help? In our previous work, we investigated ways of reducing resource contention within a mulicore machine (a cluster node) [3, 6]. Our methodology allowed us to identify the last-level cache miss rate as one of the most accurate predictors of the degree to which applications will suffer when co-scheduled. We used it to design and implement an OS scheduling algorithm called Distributed Intensity (DI). We showed experimentally that DI performs better than the default Linux scheduler, delivers much more stable execution times, and performs within a few percentage points of the theoretical optimal. DI separates memory intensive applications as far in the memory hierarchy of the machine as possible [3]. On many multicore systems, power consumption can be reduced if the workload is concentrated on a handful of chips, so that remaining chips can be brought into a low-power state. In order to determine whether threads should be clustered (to save power) or spread across chips (to avoid excessive contention) the scheduler must be able to predict to what extent threads will hurt each other s performance if clustered. We found that DI, with a slight modification, is able to make this decision very effectively which led to EDP improvement by as much as 80% relative to plain DI [3]. Koukis and Koziris [5] present the design and implementation of a gang-like scheduling algorithm aimed at improving the throughput of multiprogrammed workloads on multicore systems. The algorithm selects the processes to be co-scheduled so as not to saturate nor underutilize the memory bus or network link bandwidth. Its input data are acquired dynamically using hardware monitoring counters and a modified NIC firmware. The experimental setup in [5] assumed that all processes were spawned directly under the control of the Linux scheduler, using mpirun command. The authors then compared its performance with the default Linux scheduler (O(1) at the time [5] was written). While using an OS scheduler in an HPC cluster setup can be justified for a very small number of nodes, the industry-size clusters require state-of-the-art cluster schedulers (the cluster scheduler we experiment with is Maui [1]), to make scheduling decisions, since these schedulers support features like scalability, fulfilling user specified constraints, dynamic priorities, reservations, and fairshare capabilities, necessary for a big cluster operation and absent in the OS schedulers. Because of that, within our work, we mainly target at comparing the performance of our techniques with the state-of-the-art cluster schedulers on industry-scale clusters. 7 Summary In this paper, we experimentally showed that the contention for cluster shared resources between jobs within multicore nodes of an HPC cluster and the jobs accessing cluster interconnects can incur severe performance degradation to their execution time. This in turn could lead to the premature termination of a job by cluster scheduler, if job execution time was incorrectly specified in the job submission script. We have described how this motivates our project on the design and implementation of the contention aware cluster scheduler that can optimize HPC cluster contention in several ways: (1) fairness in the degradation caused by shared resource contention for all cluster jobs, (2) performance boost for the chosen (prioritized) jobs, (3) reduction in power consumption on the system by packing of cluster jobs on as few nodes as possible, (4) scalability of the contention-aware cluster algorithm for HPC clusters with large number of nodes/- cores per node. To fulfill these scheduling objectives, a new set of metrics needs to be found that models shared resource contention and represents a fine-grained information about each job s resource utilization and communication patterns. The last-level cache miss rate and the amount of traffic through network interface on the cluster node proposed in earlier work are examples of such metrics. The necessary information can be obtained with the performance counters within cluster nodes and extensive cluster interconnect monitoring between them. References [1] Maui scheduler administrator s guide. [Online] Available: http://www.clusterresources.com/products/maui/docs/mauiadmin.shtml. [2] Teraflops research chip. [Online] Available: http://en.wikipedia.org/wiki/teraflops Research Chip. [3] BLAGODUROV, S., ZHURAVLEV, S., AND FEDOROVA, A. Contention-aware scheduling on multicore systems. ACM Trans. Comput. Syst. 28 (December 2010), 8:1 8:45. [4] GONZALEZ, R., AND HOROWITZ, M. Energy dissipation in general purpose microprocessors, 1996. [5] KOUKIS, E., AND KOZIRIS, N. Memory and network bandwidth aware scheduling of multiprogrammed workloads on clusters of smps. In Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1 (2006), ICPADS 06, pp. 345 354. [6] ZHURAVLEV, S., BLAGODUROV, S., AND FEDOROVA, A. Addressing Contention on Multicore Processors via Scheduling. In ASPLOS (2010). 5