Operational Numerical Weather Prediction Job Scheduling at the Petascale

Transcription

1 Operational Numerical Weather Prediction Job Scheduling at the Petascale Jason Coverston 1, Stephen Gombosi 2, Peter Johnsen 1, Per Nyberg 3, Thomas Lorenzen 4, Piush Patel 2, Scott Suchyta 2 1 Cray Inc., 380 Jackson Street, Suite 210, St. Paul, MN 55101, USA 2 Altair Engineering Inc., 1820 Big Beaver Road, Troy, MI 48083, USA 3 Cray Inc., 273 Ch. du Bord-du-Lac, Suite C, Pointe-Claire, QC, H9S 4L1, Canada 4 Danish Meteorological Institute, Lyngbyvej 100, DK 2100 Copenhagen E, Denmark Abstract: Several operational numerical weather prediction (NWP) centers will approach a petaflop of peak performance by early 2012 presenting several system operation challenges. An evolution in system utilization strategies along with advanced scheduling technologies are needed to exploit these breakthroughs in computational speed while improving the Quality of Service (QoS) and system utilization rates. The Cray XE6 supercomputer in conjunction with Altair PBS Professional provides a rich scheduling environment designed to support and maximize the specific features of the Cray architecture. Advantages of this model include avoidance of system thrashing, increased predictability in the scheduling model, and reliable and repeatable runtimes benefiting both operational and research users. Keywords: Job scheduling, petascale computing, numerical weather prediction, Cray XE6, Altair PBS Professional 1. Introduction Several operational numerical weather prediction (NWP) centers will approach a petaflop of peak performance by early 2012 presenting several system operation challenges. An evolution in system utilization strategies along with advanced scheduling technologies are needed to exploit these breakthroughs in computational speed while improving the Quality of Service (QoS) and system utilization rates. An operational NWP workload is composed of a large number of programs which typically have dependencies on one another and must be completed within a defined period of time. Performance requirements are characterized along several dimensions, including application performance for both single large deterministic forecasts and multi-member ensemble prediction systems, as well as QoS for operational and research workloads. In particular, the QoS for operations is focused on guaranteed resources and execution times to meet fixed production schedules. As processor core counts and main memory sizes continue to grow, system administrators face a number of challenges such as achieving reliable runtimes to meet forecast schedules, maximizing system utilization with a mixed operational and research workload, and maintaining QoS in environments with increasing fault rates. Users also face challenges as their jobs and resource requirements grow. Ultimately the system administrator s ability to efficiently schedule resources is highly dependent on accurate information from users. Traditional scheduling strategies examine job priorities at single points in time. The system will have no advance knowledge of the arrival of a high WP-XE Page 1 of 12

2 priority job and will allocate the necessary resources through the suspension, checkpoint, swap or killing of lower priority job(s). This is an invasive approach and can result in system thrashing and the appearance of artificially high system utilization. In addition, with main memory on petascale systems exceeding hundreds of terabytes, the time to checkpoint or swap large portions of memory can be significant and will reduce the predictability of the scheduling model. Memory resident solutions such as suspend/resume carry cost implications for additional memory, and process-level scheduling can result in jitter for large scale applications. 2. Characterization of Operational NWP Environment Operational NWP suites are composed of a range of forecasting systems including regional and global modeling, regional environmental modeling, seasonal coupled ocean-atmosphere climate modeling and wave modeling. Both deterministic and ensemble prediction systems (EPS) can be implemented in each case, as illustrated in Figure 1. While the forecast models will be the largest single components, a substantial number of supporting pre- and post-processing jobs are essential for the successful creation of a forecast product. Furthermore, in the case of an EPS all members must complete before the overall job can proceed. Daily operational runs begin at pre-determined times based on the arrival and processing of observational data. Unscheduled delays or emergency response models can occur and will impact the operational schedule. HPC resources are typically shared with an unpredictable research workload. The HPC system s ability to schedule and complete high priority tasks in a timely manner is therefore essential in the daily production of forecast products and environmental emergency response models. The job scheduler should provide priority-based scheduling for operational jobs so that timecritical tasks are completed in predictable execution times and are not subjected to unexpected and undesirable delays. The HPC resource management system should ensure that the necessary resources are available when needed and are guaranteed for the duration of the operational job. Figure 1: Range of Application Requirements from Deterministic to Ensemble Prediction Systems 3. Cray XE6 Supercomputer & PBS Professional Resource and Scheduling Environment The Cray XE6 system provides a rich scheduling environment that is designed to support and maximize the specific features of the architecture. The Application Level Placement Scheduler (ALPS) [2] is a Craydeveloped tool that provides application placement, launch and management functionality for all applications whether interactive or batch. For batch jobs, it works cooperatively with PBS Professional which makes scheduling decisions and enforces policy. PBS Professional will guarantee the availability WP-XE Page 2 of 12

3 of resources reserved in advance for the operational workload, while using backfilling to optimize the usage of non-reserved resources. Figure 2: Cray ALPS and PBS Professional Resource Management System At the application level, the primary strategy on the Cray XE6 system is to provide the necessary resources for an application to execute with minimal, or preferably no system intervention. This strategy is a key component to the reasoning behind a lightweight kernel on compute nodes. Similar to the effects of OS jitter, any negative influence on scaling through additional overhead will reduce application performance and overall system efficiency. As such, scheduling at the process level is avoided. In contrast with systems composed of large SMP nodes, an MPP does not share the resources within a node and there is no contention for resources. Once an application is started on a set of nodes, all the resources of these nodes are fully dedicated to this application. In the optimal case, these resources are allocated until the application completes and there is no need for process level management. Benefits of this model include avoidance of system thrashing, increased predictability in the scheduling model, and reliable and repeatable runtimes benefiting both operational and research users. Forward-looking scheduling strategies minimize system thrashing and maximize efficiency. Two key technologies are advance reservation and backfilling. Advance reservation is a proactive approach to system scheduling that provides the scheduler with information about future high priority resource requirements; the scheduler can guarantee QoS for the high priority operational workload. Backfilling is a scheduling feature that optimizes space sharing by examining the available resources against outstanding job requests. Operational and research workloads can then be managed together, providing both long-range and urgent scheduling abilities, while minimizing the need for invasive or harmful scheduling. Preemptive scheduling, another available technology, is only used to launch unplanned critical jobs by allowing the scheduler to suspend, checkpoint or kill lower priority running jobs in order to free-up necessary resources. For preemptive scheduling to be effective, it is recommended that the application provides a method for periodic internal checkpointing and the scheduler to be configured to restart the job when the unplanned critical job exits the system. This method is employed by a number of leading NWP centers. WP-XE Page 3 of 12

4 4. Advance Reservation and Backfill Features Advance reservation [3] is a feature that uses execution time (i.e., walltime) predictions supplied by the user to provide a temporal QoS. Resources can be reserved to provide both planned and urgent scheduling abilities. As a proactive approach to manage an operational workload it obviates the need for invasive or harmful scheduling under normal conditions. The availability and duration of resources are guaranteed for the planned operational workload. In addition, express queues can provide the capability to ensure a high priority or emergency job is ranked first over all jobs for which resources have not been reserved in advance. Advance reservations may be created by authorized users via the pbs_rsub [4] command. The creator of the reservation requests the number and type of resources to be reserved using the same syntax used on job submission, as well as the time period for which those resources are to be reserved. Once the reservation is confirmed by the scheduler, jobs may be submitted to it as if it were a normal queue. Existing queued jobs may also be moved into the reservation. Standing reservations extend the advance reservation capability by allowing the authorized user to create recurring reservations. They are also created with the pbs_rsub command with the addition of a recurrence rule specified with a -r argument. Recurrence rules are expressed in the standard icalendar (RFC 2445) syntax used by most calendaring software. Such reservations are extremely useful for scheduling time-critical work that must run at regular intervals, such as periodic runs of forecast models. By default, submission to the reservation is restricted to the user who created it. The creating user may explicitly permit others to submit to the reservation by specifying an authorized user and/or group list to pbs_rsub. The ability to create reservations may be restricted by the system administrators to specific users, groups, and/or originating hosts, or it may be disabled altogether. Backfill [3] significantly improves resource utilization, turnaround for smaller jobs, and overall system throughput by packing lower priority jobs into scheduling gaps. If there are insufficient resources (number of nodes, processors, and wallclock time) to run the next highest priority job, an attempt to backfill jobs that are further down the priority queue will be performed. The aggressiveness of the backfill algorithm may be controlled through the use of the strict_ordering configuration file parameter in conjunction with the backfill_depth server attribute. If the strict_ordering parameter is set in the scheduler configuration file, only jobs that will not affect the running of the top priority queued job(s) will be backfilled. The number of top jobs that are guaranteed not to be delayed by backfill is specified by the backfill_depth server parameter, which may be set by the administrator with the qmgr command. If not explicitly set, backfill_depth defaults to 1. Jobs that require a small number of nodes and/or short amount of wallclock time can be backfilled more quickly than jobs requiring more resources. For this reason, accurate specification of resource requirements on a job will typically improve its turnaround time significantly (in addition to allowing the scheduler to optimize overall system throughput). Backfill is enabled by setting the scheduler configuration file parameter backfill to true, and may be used in conjunction with any of the scheduling algorithms in the PBS Professional scheduler (e.g., fairshare, tunable formula, resource/priority sorting, preemptive scheduling). Many sites running grand challenge problems set the scheduler to favor large/long-running jobs and rely on backfill to schedule shorter jobs. WP-XE Page 4 of 12

5 Figure 3: Advance Reservation and Backfill Functionalities 5. Job Array Feature A job array [4] represents a collection of jobs, or sub-jobs, which only differ by a single index parameter, analogous to an EPS. The job array allows the user a mechanism for grouping related work, making it possible to submit, query, modify and display the set as a single unit. It also offers a way to possibly improve performance because the batch system can use certain known aspects of the collection for speedup. Sub-jobs are subjected to the same scheduling policies (e.g., fairshare, tunable formula and resource/priority sorting) as individual jobs. Any user can submit a job array by using the qsub command. The user will supply a range, at b submission, which will be used to describe the sub-job indices. The range could be continuous (e.g., 1 to 100), or have a step function (e.g., every fifth up to 100). The index number is available in the subjob s execution environment. By default, the submission of a job array is limited to 10,000 sub-jobs. The configurable attribute, max_array_size [3], allows an administrator to limit the maximum number of sub-jobs to be submitted within a single job array. 6. Scheduling with PBS Professional The PBS Professional scheduler is highly configurable, permitting sites to easily implement custom scheduling behavior. The scheduler may be configured to implement entirely different policies for prime or nonprime time or for dedicated time. This configurability allows PBS Professional to maintain extremely high system utilization levels and overall throughput while providing quick turnaround of high-priority jobs. WP-XE Page 5 of 12

6 Unlike older workload management systems which primarily schedule jobs into fixed execution slots, PBS Professional is a resource-based scheduler. A resource in PBS Professional is an entity that can be described to the scheduler by an integer, a floating point number, a time, a space (i.e., something that can be described in bytes or words), a string, or an array of strings. While certain resources such as CPU or memory are built into PBS, sites may also define their own resources at will. Once defined, such site resources are treated identically to built-in resources for scheduling purposes. Such site resources can be used to implement support for specialized hardware such as GPUs, to permit the scheduling of thirdparty software licenses, or to customize scheduling behavior for individual jobs. Like many other schedulers, PBS Professional organizes incoming work into one or more queues. The scheduler may be configured either to aggregate work in all the queues into a common scheduling pool, or to process each queue individually in order of priority or on a round-robin basis. Sites can configure the PBS Professional scheduler to order jobs for execution in a variety of ways: jobs may be selected on a first-in first-out basis by a fair-share scheduling algorithm, by a hierarchical sort based on up to 20 resources, or by a tunable formula a Python expression incorporating any resource known to PBS Professional. Backfill, preemption and reservations may be used in conjunction with any of the scheduling algorithms in the PBS Professional scheduler. 7. Demonstration (Advance Reservation and Backfill) A simple demonstration of a simulated operational schedule on a Cray XT5 system with 64 compute nodes (768 AMD Opteron cores) illustrates the usage of PBS Professional recurring advance reservation (i.e., standing reservations) and backfill functionalities. Although this example is executed on a Cray XT5 system, the Cray XT5 and Cray XE6 supercomputers share the same system software environment. Two standing reservations are defined to reserve computing resources for two operational forecast cycles, 00Z and 12Z. These queues are defined for the same time every day at 21:00 and 21:30 for the next 30 days. Operational batch jobs that will initiate Weather and Research Forecast (WRF) forecasts on 14 Cray XT5 nodes are then submitted to these queues which hold the jobs until the specified run times. The resources will be available starting at 21:00 and 21:30 for these jobs and will be released for other purposes at the end of the reservation period. The PBS Professional pbs_rsub [4] command is used for the creation of the reservation, and qsub [4] is used to submit the job to the reservation. The pbs_rstat and qstat commands show the status of jobs in the reservation queues: > export PBS_TZID=America/Detroit > pbs_rsub -lmppwidth=8192 -lmpparch=xt -R lwalltime=00:16:00 -r "FREQ=DAILY;COUNT=30" S sdb UNCONFIRMED > pbs_rsub -lmppwidth=8192 -lmpparch=xt -R lwalltime=00:16:00 -r "FREQ=DAILY;COUNT=30" S sdb UNCONFIRMED > pbs_rstat Name Queue User State Start / Duration / End S sd S pjj@nid0 CO Today 21:00 / 960 / Today 21:16 WP-XE Page 6 of 12

7 S sd S CO Today 21:30 / 960 / Today 21:46 > qsub -q S lmppwidth=8192 N WRF_op_00Z qsub.script sdb > qsub -q S lmppwidth=8192 N WRF_op_12Z qsub.script sdb sdb WRF_op_00Z Pjj 0 Q S sdb WRF_op_12Z Pjj 0 Q S Other research jobs can be submitted at any time and will run as long as resources are available for the entire length of the job. Again, for this demonstration, various WRF research jobs are submitted to the regular work queue called workq. At 20:50 a WRF research job that requests 48 compute nodes for a period of 30 minutes. This job will launch immediately. While the 00Z operational job will start at 21:00, there are enough resources to ensure that there is no contention. With a total of 64 compute nodes and an advance reservation for 14 nodes, 50 nodes are available for other jobs while the 00Z job is executing. Another WRF research job is immediately submitted requesting 10 nodes for a period of 12 minutes. This job will not be launched however since the necessary resources are not available for the time requested. While 16 computes nodes are currently free, the advance reservation will commence in roughly 10 minutes and require 14 compute nodes. Given the current state of queued jobs, this WRF research job will start after the 00Z operational job is completed since there is a 14-minute window between the advance reservation slots. The queue status now shows the two upcoming advance reservations, the executing Research1_576 job and the queued Research2_120 job sdb WRF_op_00Z pjj 0 Q S sdb WRF_op_12Z pjj 0 Q S sdb Research1_576 pjj 00:00:01 R workq sdb Research2_120 pjj 0 Q workq At 21:00 the 00Z operational job is started. WP-XE Page 7 of 12

8 sdb WRF_op_00Z pjj 00:00:00 R S sdb WRF_op_12Z pjj 0 Q S sdb Research1_576 pjj 00:00:01 R workq sdb Research2_120 pjj 0 Q workq Once the 00Z operational job completes, the resources are freed; the Research2_120 job is launched and the 00Z reserve queue is reset for the next day s cycle sdb WRF_op_12Z pjj 0 Q S sdb Research1_576 pjj 00:00:02 R workq sdb Research2_120 pjj 00:00:00 R workq > pbs_rstat Name Queue User State Start / Duration / End S sd S pjj@nid0 CO Fri 21:00 / 960 / Fri 21:16 S sd S pjj@nid0 CO Today 21:30 / 960 / Today 21:46 At 21:30, research job 2 has completed and the 12Z operational job started sdb WRF_op_12Z pjj 00:00:00 R S sdb Research1_576 pjj 00:00:02 R workq The operational jobs in this demonstration actually submit further jobs to the advance reservation queue to run pre- and post-processing programs as well as the actual WRF forecast. Queue status, during the operational run, shows the forecast job as well as a post-processing program, all under control of advance reservation queue S sdb WRF_op_12Z pjj 00:00:00 R S sdb WRFmidwest pjj 00:00:01 R S sdb Wrf_PPS_d01 pjj 00:00:00 R S An overview of the workflow is shown in the following figure. Note that the job placement on compute node is for illustration purposes only. The actual placement is determined by ALPS. WP-XE Page 8 of 12

9 Figure 4: Demonstration of Advance Reservation and Backfill Functionalities 8. Demonstration (Job Array) This demonstration will illustrate the submission, monitoring and management of a job array (climate ensembles) on a Cray XT5 system with 64 compute nodes (768 AMD Opteron cores). Although this example is executed on a Cray XT5 system, the Cray XT5 and Cray XE6 supercomputers share the same system software environment. The PBS Professional qsub command is used for submitting a job array. The syntax is similar to an individual job, but with an additional argument (-J) for defining the job array range which can be continuous or as step function. > qsub -lmppwidth=8 -J1-25 qsub.script [].sdb WP-XE Page 9 of 12

10 Once the job array is successfully submitted, the qstat command with additional arguments (i.e., -p, -J, -t) will be used to monitor the status of the job array and the sub-jobs [].sdb qsub.script jcovers 0 B workq >qstat -p Job id Name User % done S Queue [].sdb qsub.script jcovers 20 B workq -t [].sdb qsub.script jcovers 0 B Workq [1].sdb qsub.script jcovers 0:10:01 X workq [2].sdb qsub.script jcovers 0:09:56 X workq [3].sdb qsub.script jcovers 0:10:02 X workq [4].sdb qsub.script jcovers 0:09:58 X workq [5].sdb qsub.script jcovers 0:10:13 X workq [6].sdb qsub.script jcovers 0:09:20 R workq [7].sdb qsub.script jcovers 0:09:20 R workq [8].sdb qsub.script jcovers 0:09:20 R workq [9].sdb qsub.script jcovers 0:09:20 R workq [10].sdb qsub.script jcovers 0:09:20 R workq [11].sdb qsub.script jcovers 0:09:20 R workq [12].sdb qsub.script jcovers 0:06:00 R workq [13].sdb qsub.script jcovers 0:06:00 R workq [14].sdb qsub.script jcovers 0:06:00 R workq [15].sdb qsub.script jcovers 0:03:00 R workq [16].sdb qsub.script jcovers 0:03:00 R workq [17].sdb qsub.script jcovers 0:03:00 R workq [18].sdb qsub.script jcovers 0:01:30 R workq [19].sdb qsub.script jcovers 0:01:30 R workq [20].sdb qsub.script jcovers 0:01:00 R workq [21].sdb qsub.script jcovers 0:00:50 R workq WP-XE Page 10 of 12

11 [22].sdb qsub.script jcovers 0:00:04 R workq [23].sdb qsub.script jcovers 0:00:04 R workq [24].sdb qsub.script jcovers 0:00:04 R workq [25].sdb qsub.script jcovers 0:00:04 R workq The PBS Professional qdel [4] command is used for terminating a job array, sub-job or job array range. When terminating a job array, all sub-jobs will receive the termination signal. > qdel [].sdb 9. Operational Job Scheduling at the Danish Meteorological Institute The Danish Meteorological Institute (DMI) is responsible for serving the meteorological needs of society within the kingdom of Denmark (Denmark, the Faroes and Greenland) including territorial waters and airspace. This entails monitoring weather, climate and environmental conditions in the atmosphere, on the land and at sea. The primary aim of these activities is to safeguard human life and property, as well as to provide a foundation for economic and environmental planning especially within aviation, shipping and road traffic. DMI s current operational system is composed of 2 independent XT5s with a total performance of 40 Teraflops, tightly integrated through an external Lustre global shared file system. The dual XT5 configuration offers complete redundancy and resiliency for operational and backup capabilities to support DMI operational NWP Mission. DMI s production workload is representative of most operational NWP centers with forecast models along with many dependent types of products at its core. The ranges of operational products are updated several times during day and night to generate the best quality products possible. The full production scheme of numerical weather forecasts and associated products is run to a tight schedule, with forecasters and customers expecting delivery of updated products at certain deadlines. The system is shared between operational and research and development, with the former having maximum priority via PBS Professional advance reservations. DMI has developed a local toolbox, the Cray Advance Reservation Scheduling or cars setup, to extend the functionality of advance reservations to deal with unscheduled high priority jobs without the need for invasive resource preemption. This design provides guaranteed resources for production at both predefined and unscheduled times ensuring timely delivery of forecast products. This work is described in the paper Producing Weather Forecasts on Time in Denmark Using PBS Professional by Thomas Lorenzen et al [5]. Key elements of cars are described in the following paragraphs. The basic approach of the cars framework is to over-allocate a small number of resource blocks in time to account for runtime jitter and minor production disturbances. When the production chain completes, cars finishes by releasing remaining reserved resources for the benefit of research and development jobs to fill in the gaps using the backfill feature. Production time slots nominally occupy three hours of numerical compute production time. This will leave 3 hours of non-production time before the next time slot. That time will in cases of production disturbances be used to catch up before the next scheduled runs to minimize delays to future production. Reservations are made in a back-to-back fashion, where each reservation spans the full sixhour time slot until the next scheduled forecast. In the case of production delays, part or all of the WP-XE Page 11 of 12

12 extended time will be used. The reservation will be released by cars as soon as the production chain completes. The cars framework has been in mostly unattended operation at DMI for nearly three years and has successfully fulfilled DMI s operational scheduling requirements. Areas for further improvement have been identified and investigation is ongoing to increase the flexibility of the cars setup and the underlying PBS Professional reservations facility. Readers are encouraged to read the full paper. 10. Conclusions An evolution in system utilization strategies along with advanced scheduling technologies are needed to exploit the breakthroughs in computational speed available to operational NWP centers, improve the QoS to operational and research users, and improve overall system utilization rates. The Cray XE6 supercomputer in conjunction with PBS Professional provides a rich scheduling environment designed to support and maximize the specific features of the Cray architecture. High priority operational tasks are completed in a timely manner to meet the requirements of daily forecast products and unscheduled environmental emergency response models. In addition, overall HPC resources are efficiently scheduled to maintain a high level of overall system utilization. References [1] Cray Online Customer Documentation, [2] Workload Management and Application Placement for the Cray Linux Environment, [3] PBS Professional 10.4 Administrator s Guide, [4] PBS Professional 10.4 User s Guide, [5] Producing Weather Forecasts on Time in Denmark Using PBS Professional, Thomas Lorenzen (Danish Meteorological Institute), Thor Olason (Danish Meteorological Institute), Frithjov Iversen (Cray Inc.), Paolo Palazzi (Cray Inc.), Cray Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of the copyright owners. Cray is a registered trademark, and the Cray logo, Cray XE6, Cray XT6, Cray XT5, Cray XT, Cray XT4 and Cray XT3 are trademarks of Cray Inc. Other product and service names mentioned herein are the trademarks of their respective owners. WP-XE Page 12 of 12