Using Reservations to Implement Fixed Duration Node Allotment with PBS Professional

Transcription

1 Using Reservations to Implement Fixed Duration Node Allotment with PBS Professional Brajesh Pande Senior Computer Engineer Computer Centre IIT Kanpur Kanpur, UP India Manoj Soni Technical Consultant Altair India Chintels Techo Park, Kailash Colony New Delhi , India Dario Dorella PBS Works Technical Specialist Altair Engineering c/o Environment Park,Via Livorno Torino,Italy Abbreviations: HPC High Performance Computing, CPU Central Processing Unit, ILO Integrated lights out, GPU Graphical Processing Unit, VLAN Virtual Local Area Network, GbPS Giga bits per second, GBPS Giga bytes per second, CMU Cluster Management Utility, HP Hewlett Packard, SAN storage area network Keywords: High Performance Computing, Schedulers, PBS Professional, reservations, resource allocation, node allotment Abstract Indian Institute of Technology Kanpur has a High Performance Computing Facility that was in the top 500 list [1] of June PBS professional, a part of PBS-Works [2] solution from Altair, finds interesting use in implementing a sequential queue on this facility. A user who submits a job to this queue is given the non-interactive use of an 8-core-compute-node for a fixed period of time. The envisioned use of this allotment is to allow the user to run any number of sequential jobs on the given node with the condition that the latest ending time of all such jobs be the same. It is expected that this queue be used only for sequential jobs yet clever parallel use of the queue are not ruled out. The desired functionality is achieved by creating a fixed duration reservation for the user after the first job is submitted to the queue. The first and subsequent jobs of the same user on this queue are run on the same reservation. We detail this implementation and touch upon some generalizations of this concept. We also briefly describe the facility, its queuing structure and other provisions that leverage PBS Professional functionality to ensure the smooth functioning of the facility Introduction A High Performance Computing (HPC) machine typically consists of several compute nodes connected together through some interconnect. The compute nodes could be multiple CPU / Core machines or even GPU based machines while interconnects could be the memory paths or the network with varying speeds or a combination of both. Multiple CPUs / CORE machines use the memory paths for intra-node communication and the network for inter-node communication. Additional hardware and various complex tools become evermore necessary as the number of nodes in the facility increase. Thus these machines also have service processors for ILO management, System Deployment Tools, Tools for Monitoring and Management of Nodes, some Head nodes for access and utility services apart from the necessary CPUs, software and operating systems. These machines also have a scheduler, the focus of this paper that manages job distribution on the nodes. Schedulers view the machine as a collection of resources. A user uses the services of a scheduler to request for resources satisfying some desired criteria such as the number of compute nodes required, the amount of memory on the nodes, the duration for which the nodes are to be given and the licenses for software that are needed. Resources can not only be mapped to physical objects but can also be virtualized and defined during the configuration of the scheduler and can also be dynamically discovered. If the requested resources are available, and the policy configured on the scheduler permits allocation, the available resources are released to the user. The scheduler restricts the usage of resources by configured policy dictates. Schedulers maintain job queues with attributes that can be linked to the resources. The job queues and the linkage between the resources and the queues are defined during the configuration of the queues. Jobs that are submitted by the user (resource requests) to a queue are either finally scheduled on some resources by the scheduler based on the tunable internal algorithms that take the policies into account or else wait in the queue. 1

2 Since configuring the policies, the queues, and the resources on a scheduler can be a complex task, not all desired functionality from the scheduler can be accomplished by simple specifications. Good schedulers recognize this and provide hooks and interfaces that allow the scheduler to meet, with varying degrees of satisfaction, the user requirements after some programming and effort. Such functionality is driven by user needs. If such needs are general enough they can be over a period of time be incorporated within the framework of specifications of the scheduler. This paper focuses on a functionality desired for the HPC facility at IIT Kanpur that uses PBS Professional as its scheduler. It describes the HPC facility in brief, talks about its queue structures and the regular use of PBS Professional in the facility. It then goes on to enunciate the need for the more complex functionality. This functionality implements what the authors have named a sequential queue. A user who submits a job to this queue is given the non-interactive use of an 8-core-compute-node for a fixed period of time. The envisioned use of this allotment is to allow the user to run any number of sequential jobs on the given node with the condition that the latest ending time of all such jobs be the same. This paper also provides a solution for the implementation of this functionality through the use of interfaces provides by PBS Professional. Main HPC Facility at IIT Kanpur Figure 1, shows the architecture of the HPC facility at IIT Kanpur [3]. This facility has 368 compute nodes and 4 management nodes and 4 utility nodes. The blade type compute nodes are 2 Nehalem Intel-CPUs with 4 cores each on a BL-280c-G6 server and 48 GB of RAM. Each node has a 146GB small form factor Serial AT Attachment (SATA) disk for holding the boot images, operating systems and some necessary software. These nodes are connected together by several networks. The first network that connects these nodes is a boot network that is primarily used for booting the systems, and managerial functionality. The second network connecting them is a managerial network that connects ILO system and the console of various machines. These networks are represented by a single VLAN shown in the figure as Ethernet & ILO. All the nodes are also connected by a QDR-Infiniband network from QLogic that has a speed of 40 GbPS. This is a very fast interconnect and separation of management concerns from user I/O concerns makes for a fast through put of the users of the HPC system. Three file systems that use Modular Smart Array of disks (MSA) P2000 G3 FC from HP are also present in the entire facility. These disks use the Fibre-Channel Network to connect to SAN switches that finally server data for the file systems. The first file system of 7.4 TB is a NFS file system that is used for administrative purpose such as holding boot images and for miscellaneous storage purposes. The second file system is a Lustre based file system that has a high read and write throughput (around 3.4 GBPS read and 2.4 GBPS write throughput). This file system is called the scratch file system and has 40 TB of capacity. There is a slower home file system of 60 TB again on Lustre (1.7 GBPS read and 1.3 GBPS write throughput). Two different Lustre file systems separate out user concerns between unlimited scratch-space during compute operations versus quota based slower home file systems. 2

3 Figure1: The Architecture of HPC Facility at IIT Kanpur The operating system on the machines is RHEL5.4. The cluster uses CMU tool from HP for management and monitoring of the systems and deployment of operating system and software on the nodes. It uses PBS Professional as the scheduler for jobs submitted by the user. PBS Analytics and Compute Manager have also been deployed on the cluster The machine is mostly used by students, faculty and researchers of IIT Kanpur. Some users from other academic institutions also use the facility. Typically users come from almost all departments of the institute and use the facility for computational purposes through home grown software, libraries and compilers as well as available third party software that include but are not limited to: Gaussian, TurboMole, Ansys, Matlab, Accelrys, CPMD, WRF, Metis, MedeaVasp, Laamps, Tinker, NAG parallel libraries and compilers, GAMS, gcc, INTEL suite of compilers, pgi compiler from PORTLAND, many MPI libraries variations of mvapich2 available from QLogic, openmp, and flavors of mpich2. This machine has a peak theoretical performance of TFLOP and a maximum performance of TFLOP and was ranked 369 in the Top 500 list of June A specific usage scenario of the machine is shown in Figure 2 that was prepared by pulling out data with the help of PBS-Analytics. It shows that some reservations apart from the main queues are also in use and that the average wait time for jobs is increasing indicating a need for more compute nodes. Use of reservations in this facility is discussed more later. 3

4 Figure: 2 Typical usage of the Queues of the HPC Facility Use of PBS Professional in the HPC Facility: Currently the system uses PBS Professional version 11.2 which was upgraded from an earlier 10.4 release. A PBS failover server is also deployed for high availability. The system is divided into several queues as shown in the Table 1. These will be discussed in brief till the focus shifts to the main queue under discussion that is the sequential queue labeled seq in the table. 4

5 TABLE I QUEUES OF THE HPC FACILITY AT IIT KANPUR Queue Total Available Min Max Nodes Min Max Cores Allowed CPU / Nodes Allowed Allowed Wall Time large Max / 72 hrs medium Max / 96 hrs small Max/ 120 hrs workq hrs/48 hrs test 3 any any Max / Max seq 16 1 any Max / 120 hrs Table 1 shows the queues, total nodes available in each queue, minimum and maximum nodes allowed to a for a job, minimum and maximum cores allowed for a job and the maximum CPU and wall time allowed on the node. The idea behind these restrictions stems from the fact that given a reasonably sized problem, most HPC problems do not scale up beyond a certain number of nodes. Restrictions on wall time are necessary for giving fair access of the queues to users. There are no special privileges for any user of the cluster currently. Provision for allowing jobs with more number of cores or wall-time is possible but is reserved with the policy making body for HPC usage. The large queue allows for more number of cores but lesser wall time, the medium allows lesser cores but more wall time compared to the large queue and the small queue allows even lesser cores but more wall time than the medium queue. This arrangement is well suited for scaling studies. Work queue (workq) is a special queue that is allows both interactive as well as batch jobs. The idea behind this queue is that users would land on this queue for day to day activities and would be placed in a round robin fashion on the nodes of the work queue. This load balances the interactive usage of the queue. This is achieved by defining a static consumable resource on each node that is used when the node is in use. Allocation of the node of the work queue is done after sorting on the available numbers of this resource on a node. The round robin allocation of the batch processing in the work queue is a bit trickier and is implemented by a runtime hook that also makes the work queue an interactive queue. The implementation of the work queue more or less follows standard implementation examples in the PBS Professional manual with the necessary programming for the round robin allocation. The total number of round robin resources associated with the work queue decide the amount to which this queue can be loaded. The test queue is reserved for testing of new software and implementations. The sequential queue (seq) is the main focus of this paper and is discussed in more detail in the next section. The Sequential Queue requirements The need for the sequential queue arises from the fact that though HPC machines are mostly used for parallel applications sequential needs of users cannot be ruled out especially if the users belong to diverse computing environments. There may be legacy code as well as some code that is inherently sequential. Thus provision has to be made for such a queue. An obvious solution to this is to assign a core to any user who wants to use the sequential queue. The problem with this simplistic allocation policy is that the assigned core belongs to a machine that has memory which is shared by the other cores. If the job that has been allotted the single core uses large amounts of memory it would ultimately unfairly use the memory resources allowed to a job on another core on the same node. This is clearly an undesirable situation and would ultimately slow down all jobs that contend for memory on this node. In nodes that are severely affected almost all the memory and even the swap space would be used up and this would result in hanging the node. A user using memory within limits would also be penalized with non performance in this scenario. One of the ways out of this dilemma would be to restrict the memory use of a single job on the sequential queue. If we do a calculation with 48 GB of memory amongst 8 cores and also take into account the memory reserved for the operating system then the amount of memory that could be allowed in such a scenario would be less than 6GB per core. This seems a reasonable amount of memory for a sequential job. However situations where sequential jobs need more memory do exist in high performance computing environments. This is especially true for some of the more complex research problems that are investigated in the HPC facility. Another, solution which seems to suggest itself is to allocate the entire node to a 5

6 sequential job. However, this would be a waste of computing cores if the job uses only limited amount of memory. A reasonable optimization between these conflicting goals was need. This conflict of core versus memory utilization would still arise in parallel submissions but the severity would be less as it is likely that at least one of the resources would be fully utilized. Using Reservations for Sequential Queue Implementation This section details the implementation of the sequential queue at the HPC facility of IIT Kanpur. As brought out in the previous section, conflicting issues of core usage versus memory usages were observed in trying to come up with any scheme of allotting nodes / cores to the sequential queue. The trick in engineering an optimal solution, suited to the needs of the HPC environment, lay in taking the solution of assigning the entire node to the sequential job for a fixed period of time and also allowing the same user to submit more jobs to the sequential queue with the condition that all new jobs would run on the same node. This ensures that that the latest ending time of all jobs submitted to the sequential queue is the same. This was the first solution that was acceptable and was implemented with the wall time restriction on the allocation of the node. Better solutions to this problem could be possible with better placement of jobs and optimization of memory usages. Having outlined the usage policy it was now the task to implement it. There were no ready made mechanisms in PBS Professional that could provide for the exact functionality requested. Thus an implementation for this kind of a queue lay in the use of the mechanics of reservations that are provided by PBS. Reservations allow the user to book resources that would be used subsequently by the user. Once a reservation is granted the user can move some of the jobs into the reservation. Reservations themselves are queues and can be in several states such as Confirmed, Running and Unconfirmed. In a sense reservations once granted are a set of resources that can be thought of as a queue into which a job could be submitted or even moved from another queue. Thus to implement the desired functionality of sequential queue a reservation for a fixed duration is created and jobs are moved into the reservation from the queue if it is in a confirmed state and the job belongs to the user who has created the reservation. The algorithm for doing this is sketched in Algorithm 1. Algorithm: 1 In every Scheduling Cycle or through a periodic cron-job DO 1. REPEAT for a job in the sequential queue Find the user who has submitted the job IF user has an existing reservation THEN IF reservation is in Confirmed or Running State THEN Move the job into the reservation from the sequential queue ENDIF ELSE IF user does not have an existing reservation THEN Append the user to a list of users with no reservations if the user is not already present in the list ENDIF UNTIL all jobs in the sequential queue have been processed 2. Create a reservation of a fixed duration for all users in the list of users with no reservations It is worth noting that for users without reservations only the necessary reservations are created without attempting to move the job into the respective reservations. This task is reserved for the next execution of Algorithm 1 which can be through a cron-job or in a scheduling cycle. Also, for a clean separation of concerns the task is divided 6

7 into two parts i) that of moving jobs into reservations and ii) of creating reservations. This also minimizes the risk of out of order job movements into reservations. The actual code implementing the above algorithm takes into account efficiency of implementation, error checking and logging needs. It can be a bit of a challenge to code this algorithm efficiently. An algorithm closely tied to Algorithm 1 above is that of deleting idle reservations. This simply deletes any reservation that was idle for some k previous execution of the algorithm and is also currently idle. This algorithm can also be tied to the scheduling cycle or to a periodic cron-job. This has also been implemented with a current value of k as 1. In the current implementation all algorithms have been implemented through shell scripts using sed and awk and standard commands provided by PBS Professional for manipulation of reservations and queues. Post Implementation Observations and Generalizations for Future Studies Though several prototypes were made before arriving at the algorithms above, the algorithms outlined above seem to be stable in the current HPC environment and provide the desired functionality. There are enough hooks and interfaces in PBS to smoothly implement the algorithms. In the implementation it is also possible to limit the maximum number of jobs that are moved into the reservation though no upper limit needs be kept on the sequential queue itself. Though the original algorithms were designed keeping in mind that users would use it for submissions of sequential jobs it is observed that users can also submit a parallel job through a single submission of a sequential job that internally uses openmp or even uses mpich that works up-to 8 cores. Is this a violation of the desired functionality? Well the answer is both yes and no as there seems to be no method to force the submissions to be strictly sequential and also allow the entire memory to be used as well as allow all the cores to be used if memory permits. However, this implementation seems to meet the conflicting objectives and works satisfactorily. Clever solutions to the problem posed are not ruled out. It would also be interesting to generalize the problem a bit more. Supposing that the nodes allotted through reservation were more than one per user then it would be interesting to speculate as to how the moving of the jobs from the sequential queue to the reservations happens. Would jobs be moved in a round robin fashion amongst reservations or would one reservation be packed to a job limit before the next reservation is used? Could an additional memory based heuristic help deciding this scheduling problem? Could a best fit of memory requirements be used in conjunction with the reservation packing algorithms? These questions may require careful investigation and further study. Conclusions This paper describes the HPC facility at IIT Kanpur. It states the problem of core utilization versus memory utilization in node allocation to users of sequential services. This problem could still arise in parallel submissions but would be less severe. A solution to this problem is presented through the use of fixed duration reservations. The mechanisms of reservations serves the current goals of the HPC environment adequately enough, however, better solutions to the problem could be possible. The problem becomes interesting when an attempt is made to generalize the specification. Several clever generalization and specifications of the problem may be possible. Acknowledgements The authors would like to thank Scott, Sam G and Rajiv Jaisankar of Altair Engineering for valuable inputs to the PBS Pro based solutions implemented in the HPC facility of IIT Kanpur. Department of Science and Technology and IIT Kanpur have funded the HPC cluster. They would also like to thank Prof. Amalendu Chandra the Head of the Computer Centre of IIT Kanpur. Help of Vairavan Mani for a short duration at IIT Kanpur is duly acknowledged. The user community of the HPC facility at IIT Kanpur also deserves due thanks for their patience while the various versions of the solutions were being implemented. The authors also wish to thank engineers Ram Mohan Shrivastava and Abhishek Kesarwani from HP for their support to this activity. REFERENCES [1] [2] [3] Poster Presentation by hpcsupport at the Symposium on HPC Applications March at IIT Kanpur 7

8 8