National Facility Job Management System

Transcription

1 National Facility Job Management System 1. Summary This document describes the job management system used by the NCI National Facility (NF) on their current systems. The system is based on a modified version of OpenPBS in the following ANUPBS will refer to this version. Depending on the platform, ANUPBS may have to interact with, and use features of, a native job management system. This interaction with other resource management systems and the necessary features of that system are also described. The critical feature of the job management system is the use job suspension/resumption to schedule the bulk of the work presented to the system (not only for handling high priority work). In particular a large fraction of the parallel jobs on the system have suspended smaller, longer jobs to run. The use of suspend/resume allows very high utilization (>95%) to be maintained even with an extremely diverse workload mix and while still respecting any political share allocations. This high utilization is achieved without bias towards any class of jobs such as those that can fill the scheduling holes created in systems based on backfill and reservations. Equally importantly, care is taken to ensure suspend/resume scheduling does not compromise the performance of any jobs. The use of suspend-resume has a number of implications that will be discussed in the following sections: since jobs will share nodes, there must be careful job process management, resource monitoring and limiting to ensure jobs dont impact each other, there are additional requirements in the areas of NUMA- and network topology-awareness, lightweight scalable operation etc to ensure that jobs are always given the opportunity to perform optimally, job paging/swapping must be carefully managed and, to meet policy and fairness scheduling goals as well as high utilization, the usual divide between scheduler and resource manager must be eroded or removed. 2. Background The National Facility provides high performance computing services to all Australian academics and government research agencies requiring large high performance compute resources. This broad charter has a number of practical implications in terms of the workload mix and how it can be serviced: the term large compute resource requirements is not limited to number of cpus per job it includes number of small jobs, amount of single node memory or disk, cost of licensed software etc. Jobs currently range from 600hr single processor Gaussian jobs requiring 800GB of node local scratch disk to tightly coupled but highly scalable 8000-cpu combustion simulations. A large fraction of the resources are consumed by climate simulations utilizing between 128 and 512 cpus. in terms of total cpu-hours, less than 10% is consumed by single node jobs (although they do constitute a large number of jobs) and usually more than 50% of the system is running jobs of greater than 64 cpus. The trend to larger parallel jobs at the National Facility has been relentless over the last 10 years and is expected to continue. the most difficult jobs to schedule are typified by VASP jobs using around cpus or more but requiring a runtime of 100 hours or more because of checkpointing difficulties. These jobs fragment cpu space for long periods of time making capability job starts difficult. experience has shown that partitioning the system along the lines of job types or resource requirements invariably leads to idle partitions at the same time as there are jobs queued for other partitions the number of users (and projects) are in the several hundreds with the user skill levels varying considerably the frequency of jobs trying to run amok (by trying to use greater resources than requested) can be quite high to optimize support for the varied workload, NCI NF systems are heterogeneous the amount of memory, swap and local disk and possibly even number of cpus varies across nodes which has implications for system scheduling the various NCI Partner organizations and access schemes have pre-determined shares of the system the scheduler must deliver those shares and deliver to priority projects regardless of the 1

2 characteristics of their jobs requests for allocations within each share heavily over-subscribe those shares there is an expectation that the system deliver close to 100% of available cpu-hours Over more than 15 years, NCI-NF staff have developed a management system that comes very close to overcoming these difficulties and meeting all these goals. Motivated by frustrations in managing a closed and inflexible vector parallel system, development on ANUPBS, began in 1997, prior to the existence of PBSPro and Torque. The system has been ported to a large variety of HPC systems: initial development occurred on a cluster of large (24 and 64 cpus) Solaris SMP nodes maturation on, and integration with the Quadrics components of, the Compaq/HP AlphaServer SC easy deployment on a number of Linux clusters at the National Facility and around Australia sophisticated NUMA-awareness development and integration with SGI Array Services on a cluster of 64-way SGI Altix systems further scalability enhancements and network topology awareness on core Sun/Oracle Constellation cluster 3. Resource Allocation The traditional, simple model of job management has involved a scheduler to decide which job to run next and an independent resource manager to allocate that job cpus. As discussed in 6, for sophisticated suspend/resume based scheduling, these two roles cannot be disentangled. Here we note that, independent of suspend-resume, complete job management also necessitates combining these roles. With a sufficiently diverse job mix and user base, all resources need to be managed carefully at the scheduling level. On NCI-NF systems, a resource currently means one of: cpus or nodes: users request number of cpus although for distributed jobs, this number must match whole nodes. There is at most one running job on each cpu at anytime scheduling is never based on load. memory: users request virtual memory but are allocated physical memory, i.e the total address space of all processes must fit in the free physical memory. The expectation is that HPC jobs are using most of their address space. ANUPBS has an option to (more expensively) evaluate job physical page use (including swap) and limit on that measure. swap space: this is not a resource users request but is one monitored and managed by ANUPBS local disks: see the discussion of jobfs in section 8. software licenses : a system-wide resource not monitored at the node level but still strictly monitored and managed by ANUPBS. local IO bandwidth: crudely allocated users nominate if their job is IO bound and the scheduler only allocates one such job per node. Not currently monitored. walltime: not a physically limited resource and hence not a hard a priori constraint on scheduling. However, walltime is very significant in deciding if and when suspension occurs see section 7. gpus: basic allocation functionality hindered by the lack of access control on GPUs User job requests must specify all the resources required default resource requests are deliberately limiting. The requests are also required to be reasonably accurate so that appropriate physical resources can be reserved for the job's real needs without undue waste. By necessity, the ANUPBS scheduler has the responsibility of allocating all these resources. (Of course it does not physically allocate any resources the allocation is theoretical in the sense that it assumes jobs never exceed their resource requests. See the next section for the physical allocation.). The 2

3 scheduler is aware of all the resources available on all nodes, both what is unallocated to jobs and what is actually unused by jobs, and constrains scheduling decisions in light of this availability and the resource requests of candidate queued jobs. This strict allocation process is essential because of: 1. the reasonably large number of subnode-size jobs sharing nodes and hence node resources while running, 2. the heterogeneity of the nodes of the system (the number of cpus and amounts of available memory, disk and swap space vary amongst the nodes) and 3. job suspension/resumption (see section 6) causing additional node resource sharing At a minimum, sufficient swap space and node local disk must be available to support all suspended and running jobs co-resident on each node. This scheduling constraint is imposed based on an appropriate mixture requested and actually used job resources. For example, if all jobs on a node are to be suspended to run on a single job requiring all the cpus of the node then the sum of the current jobs' actual usage and the new job's requested usage must be within the nodes capacity. Whenever there is a possibility of jobs running simultaneously on a node (each using a subset of cpus) in the future, resource requests (as opposed to current usage) must be used to determine total usage. 4. ANUPBS on the Compute Nodes A PBS job execution/management and node monitoring daemon called a MOM is run on every compute node. This daemon: initiates and cleans up all jobs on that node monitors jobs resource usage to enforce scheduling decisions and monitors node resources and activity in detail. Job initiation/completion: Under PBS, executing jobs exist as a shell either running the batch script or with tty connection to a user terminal session. PBS has no knowledge of the commands in the batch script and jobs are never simply a command (unlike under LSF). All jobs on the system are initiated by a MOM on a compute node allocated to the job with appropriate environment and limits set. When requested by the job, the MOM also initiates a directory on a node local filesystem see section 8. By default, job stdout and stderr is managed by the MOM on the node and returned to a global filesystem on job completion. At job completion, the MOM also stages files out of and then cleans up jobfs directories and removes leftover shared memory segments and /tmp files. Starting jobs on their allocated nodes means a node failure effects only those jobs actually running on that node. It also means that users not running MPI jobs do not need to use remote execution utilities in their scripts and their scripts have direct access to jobfs (see below). Job monitoring: While the scheduler does virtual allocation of resources, the MOM is responsible for ensuring those allocations are actually available by limiting all jobs to their requested resources. To keep the overhead of monitoring low, users processes are sampled approximately once per minute but more often during the initial phase of the job when resource usage is likely to be changing most rapidly. Total per-node job resource usage is determined (taking in to account threads, shared memory segments etc) and if it exceeds the request, the job is terminated. Swap space on every node allows the node to absorb overuse of memory until the next job monitoring cycle of the MOM. Hence, given the scheduling constraints, jobs sharing nodes are guaranteed not to catastrophically impact one another. Node monitoring: The MOM provides detailed node status information to PBS such as physically available and unused resources (like memory and disk) as well as actual cpu usage and paging rates. In addition to providing basic scheduling information for the scheduler, monitors can send alerts of exceptional states like unexpected memory or disk usage or excessive load or paging. 5. Suspension/resumption The primary mechanism used to run large parallel jobs is the suspension of smaller jobs. Since suspending a job is effectively just sending a SIGSTOP to the job's processes, suspended jobs remain resident on their execution nodes and these nodes are temporarily reallocated to the larger parallel job. On completion of the suspending job, the processes or the suspended are simply sent a SIGCONT. 3

4 The obvious concern about suspension/resumption is the possibility of excessive paging when a suspended job is replaced in memory by a job just starting. In reality, this is rarely a serious issue. Many suspensions never lead to any paging because the combined memory use of all jobs concerned is sufficiently small. Even when there is memory overcommitment, it can be constrained to an acceptable level by scheduling decisions (and MOM job limit enforcement). The other mitigating factor saviour is the relatively high performance of paging under Linux (which is likely to improve further with large page swapping). In the worst cases, only a couple of minutes of paging This lazy approach to claiming memory for an incoming job is much more efficient than the alternative of preemptively swapping suspended jobs before starting suspending jobs. Given the frequency of failures of jobs startups (due to user error) or over request of memory, preemptive swapping would induce an unnecessarily large amount of swapping. The astute reader will be aware of the importance of NUMA page placement on application performance and detect a possible issue with suspend/resume in this context. Ideal MPI performance is achieved when an MPI tasks are confined to a single core and their page allocations are all to the local NUMA memory of that core. In some circumstances, suspend-resume may lead to more off-node page allocations because particular NUMA nodes are full of suspended jobs. On the last two NF systems, the primary MPI library has been customized to provide memory binding by default for MPI jobs so that page allocations never go off node. In reality, suspend/resume is a secondary reason for introducing memory binding. Even without suspend/resume, memory binding has been shown to greatly improved the consistency of performance of large-scale MPI applications. 6. Scheduling Suspend/resume is integral to supporting capability and capacity usage as well as maintaining high overall system efficiency and utilization. Unlike a number of other systems, it is not simply a brute force approach to running high priority, parallel jobs. It is an essential component of virtually every job start decision including selecting between nominally equal priority jobs. In essence, the mechanism is a form of time-slicing or gang-scheduling at the job length timescale. In terms of improving overall system utilization, it is conceptually the equivalent of cheating at tetris by chopping the blocks (the cpu-walltime size of jobs) up to make them easier to pack. As discussed in section 3, the decision process first involves satisfying all physical resource constraints to ensure no overcommitment. Hard restrictions (like ensuring co-located jobs will never exhaust node swap space or local disk space) are supplemented with heuristics to avoid excessive paging when a new job starts on the same node as suspended jobs. Since ANUPBS is starting multiple jobs in a scheduling cycling, it is important that ANUPBS knows exactly which nodes each job runs on. Relying on some indirect control like setting job priorities and leaving the node selection decision up to an external resource management system, can lead to undesirable job placement because the system state seen by the second scheduler may be different to that seen by ANUPBS (jobs are constantly completing). Hence the requirement of allowing ANUPBS to fully specify the nodes allocated to a job. The real complexity of the scheduling process comes in trying to achieve some form of fairness (or adhering to share policy goals) in choosing which jobs, and when, to suspend. The NCI scheduler tries to ensure no jobs are starved (suspended or queued indefinitely) and to give roughly equitable access and turnaround to all users/projects. It is important to note that scheduling decisions (particularly suspension) are not based on static job priorities virtually all jobs are in the one queue with the same static priorities. The decisions are based on a number of dynamic factors including relative job walltime and ncpus requests, how long jobs have already been suspended, the relative number of cpus already in use by the respective users and projects, how close to completed the prospective suspendee jobs are, the fairshare status of the respective users and projects, A pairwise (suspender/suspendee) job comparison is made and given a numerical value based on these factors. Then a search is made over all suspendable jobs to select the best set of jobs (and hence nodes) based on this job suspendability score. Issues like network topology locality are reflected in this search procedure. To maintain efficient system use, an extra constraint of proper nesting is imposed a 4

5 suspended job must be entirely within the footprint of the suspending job. Of course, these are site specific scheduling goals and implementation details other sites may explicitly favour particular job types over others. Indeed, the NCI scheduler does have the flexibility of preferring jobs of specific users or projects over others but even this is not implemented as some absolute suspension priority. The critical point is that site policy impacts on how jobs are selected for suspension and hence how nodes are allocated to jobs. There are a number of obvious scheduling advantages to using suspension/resumption: there is no need to hold nodes or cpus idle to run parallel jobs they can be run at basically anytime given no other resource conflicts the scheduling algorithm does not introduce a bias toward or against particular job classes. Compare this with, for example, backfilling which favours short jobs of few cpus or the PSC model of draining the whole system once per day to schedule large jobs. The latter approach is to overcome the system fragmentation that always result from production use but it a) wastes a lot of cputime and b) cannot support jobs that are unable to checkpoint in that interval short debugging and testing jobs can be supported without reserving nodes There are, of course, disadvantages: possible excessive paging see previous section having too many suspended jobs and too few queued jobs can lead to situations where there are idle nodes despite having plenty of jobs on the system. Limiting job suspension when queues are short and providing a user-assisted job migration mechanism can often avoid this scenario. some users seem to prefer that jobs reside in queued state rather than suspended state. A little education is sometimes required to convince them that it is only the sum of the time spent in either of these two states that needs to be minimized and that preemptive scheduling is, on average, reducing that time. 7. Interaction with external resource management systems A number of proprietary high performance interconnects and message passing systems include some form of resource manager that is intimately tied to the MPI system. The resource manager typically spawns and manages MPI tasks across all nodes allocated to the job as well as providing any necessary privileged access to devices or mappings. Often the resource manager includes some sort of basic scheduling and node allocation functionality and will respond to requests from any user. To work with ANUPBS, the resource manager should a) support a mode of operation where only privileged processes can cause resource manager actions and b) within that mode, provide an API that allows a privileged process to: 1. provide an unprivileged user's job with access to the interconnect and message passing system 2. specify the cpus and/or nodes allocated to a job 3. suspend all processes in a job and make the cpus allocated to that job available to another 4. reattach a suspended job to its allocated cpus and resume the job 5. send a specified signal to all processes of a job The MPI library also needs to support suspend-resume actions by the resource manager, e.g. timeouts should be appropriately guarded. A quality resource manager will create a job container of the cpus/nodes allocated that a) all job processes are confined to and b) persists between multiple invocations of mpirun within a job. The resource manager or the associated MPI library should also support both process-to-cpu binding and, on NUMA nodes, process-to-numa-memory binding for the tasks of MPI jobs. Two resource management systems providing this functionality and that have been successfully integrated with ANUPBS are the Quadrics Resource Management System (RMS) and SGI Array Services and Message Passing Toolkit. Many open source MPI systems such as MPICH, LAM and Open MPI either are, or can be made, PBS-aware in the sense that they use PBS directly as their native resource manager and job launcher. 5

6 8. /jobfs scratch disk This section is included as an illustration of a feature not typically found in a job management system but which infiltrates all levels of the job management process. One of the largest impediments to efficient utilization of the NCI system is poor IO practices by users accessing global filesystems in many cases, frequent metadata dominated IO requests lead to greatly diminished IO performance for all jobs. Users are requested to utilize node local disks as much as possible and a large majority of the jobs running on the system now do so. Clearly, node local disk space must be carefully (and strictly) managed if it is to be a reliable job resource. On the NCI NF system: virtually all nodes are configured with a large /jobfs partition dedicated to job use only during job lifetime users must request the amount of this disk space required in their job submission the ANUPBS scheduler carefully allocates jobfs resources at the per-node level the ANUPBS node daemon: creates a writable /jobfs subdirectory for the job at job startup and adds environment variables to the job environment for access monitors the size of the /jobfs subdirectory while jobs are running terminating jobs that exceed requested usage cleans up the directory on job termination utilities are provided to transfer files to and from /jobfs and to monitor and access the filesystem interactively while a job is running A persistent node allocation for the lifetime of the job is essential for moving data to and from /jobfs subdirectories outside MPI program execution in distributed jobs. 9. System management A job management system must, of course, interact closely with all other aspects of system management. ANUPBS offers a number of features that specifically enhance this interaction including: draining specific nodes for some specified future time for system maintenance such as hardware or software updates suspending all (or a subset of) jobs to perform tasks such as updating and rebooting Lustre servers, correcting a component of the system interconnect or running diagnostic or verification tests. controlling jobs when external resources such as mass storage or database servers are scheduled for downtime or having problems. In addition, ANUPBS provides generic interfaces to external usage accounting systems and, in particular, supports the functionality of ANU's very sophisticated project- and shares-based accounting system, RASH. Indeed, the political context of NCI requires a sophisticated hierarchical shares based model. ANUPBS has evolved to providing shareholders the ability to select their own scheduling policies and to compose those policies within a hierarchical share model. In this context, suspend-resume is more an instrument for achieving 10. Summary The following list of main features summarizes the model used: job suspension/resumption is the critical to running the workload mix we are presented with to fully utilize the system, jobs share nodes constantly either running side-by-side each on a subset of node cpus and memory or with one job running on all cpus of the node while other jobs are suspended but still resident on the node to ensure jobs get reliable performance, all resource usage is carefully monitored and limited to the amounts requested by the job the scheduler is responsible for placing jobs on nodes such that resources should not be overcommitted in a scheduling/allocation sense and node execution daemons (MOMs) are responsible for ensuring resources are not overcommitted in an actual usage sense (i.e. the 6

7 scheduler ensures resources should not be over-allocated while the MOMs ensure resources are not over-allocated) the selection of nodes to be allocated to a job involves site policy and, hence, is the responsibility of the site scheduler. 7