PBS Professional Job Scheduler at TCS: Six Sigma- Level Delivery Process and Its Features

PBS Professional Job Scheduler at TCS: Six Sigma- Bhadraiah Karnam Analyst Tata Consultancy Services Whitefield Road Bangalore 560066 Level Delivery Process and Its Features Hari Krishna Thotakura Analyst Tata Consultancy Services Whitefield Road Bangalore 560066 Thean Mani Rajan K Analyst Tata Consultancy Services Whitefield Road Bangalore 560066 Keywords: PBS, six sigma, delivery process, batch jobs Abbreviations: TCS - Tata Consultancy Services PBS Pro - PBS Professional Abstract PBS Pro job scheduler is used at TCS for scheduling High Performance Computing batch jobs. In production environment, submitting batch solver job, completing it successfully without non-application errors and delivering the results as early as possible is critical to the quality. In the present paper, the strategies used for increasing the quality of delivery process and avoiding defects are detailed. We also discuss about managing multiple applications, multiple software licenses, multiple computational resources and multiple user groups with multiple requirements using PBS Pro. The presentation also provides the details about managing queues, resource utilization, and six sigma level of process capability. Introduction In the industry environment, submission of batch mode tasks is not only regular, but also frequent and repeated. Such tasks are performed by different groups deploying a variety of compute clusters and different applications. These tasks, referred to as job, are often a part of the process wherein analysis is performed at the cluster and the output is delivered. This process is referred to as delivery process. The computational hardware resources are deployed round the clock in which both current and non-current jobs are submitted. The non-current jobs are queued in line and scheduled for execution upon the availability of its satisfying conditions. This process of scheduling the jobs in queue, maintaining the current list of compute resources, checking for the necessary resources such as application license and all other related tasks are performed by Workload Management System PBS Pro. In this paper, we discuss different aspects of our production environment, delivery process, six sigma methodologies and how we deploy PBS Pro in achieving a continuously-improving process capability. Process Approach A desired result is achieved more efficiently when activities and related resources are managed as a process [1]. The customers of this delivery process are the end-users who submit their jobs in the PBS Pro queue. These users are the analysts in their respective CAE application software. Their primary requirement is to submit the job in a simple manner and achieve the requisite output from the respective application without any non-application errors, with the least wait time. The job process begins with the submission of job. From there, PBS Pro takes over by maintaining the job in the queue to check for the availability of various resources requested, and once resources are available, it schedules for execution along with an email notification. Input files are copied to the destination cluster compute node and prepared for execution. Batch mode execution of the application begins and the output files are copied back into the selected location. The job completes with an email notification to the user. Six Sigma Methodology for Process Improvement Six Sigma is a disciplined, data-driven approach for eliminating defects in any process and is widely used in several industries, including manufacturing and service-based industries. The name, Six Sigma, is derived from the fact that this methodology drives towards six standard deviations between the mean and the nearest specification limit. We use Six Sigma as a continuous process improvement methodology. Six sigma methodology includes two basic sub-methodologies: DMAIC and DMADV. The Six Sigma Define, Measure, Analyze, Improve and Control (DMAIC)) approach is applied to existing processes falling Simulation Driven Innovation 1

below specification limits and looking for incremental improvement. The Six Sigma Define, Measure, Analyze, Design and Verify (DMADV) approach is used to develop new processes or products at six sigma quality levels. Literature published on each of these Six Sigma methodologies, including their sub-systems, are available [2, 3 and 4]. This paper highlights the usage of different tools used as a part of Six Sigma methods to design and improve the Job process. Delivery Process The basic objective is to avoid any non-application errors while a job is submitted and run. Additionally, it is also expected that wait time for jobs in any queue is reduced and at the same time, the overall cluster utilization is increased. The job submission process needs to be very simple and aligned for CAE application users. PBS Pro provides qsub command for job submission. We use the custom job submission script, which not only simplifies the job submission process, but also encapsulates deployment of all the application features, serving all levels of CAE users. For any complicated CAE application to have the job submission very simple, the user is asked to provide the bare minimum of details such as only the input file name. All the other details such as queue name, number of cores, email can be left to default values or calculated automatically. Hence, the foremost task of job submission begins by executing the application script command that performs all the preprocessing for the job and submits it in the PBS Pro queue using qsub. Minimum of training is required to be imparted to the users in educating them about submission script commands for various applications. PBS Pro being a very generic batch system software, it is flexible to simplify the job submission process and also to build multiple processes around this job submission process. A graphic representation is shown in Fig.1. Job submission process is closely coupled with application process, queuing process, hardware cluster-dependent process, and resources requested by the job. Besides these, there are other mutuallyrelated processes that revolve around these processes. PBS Pro provides functionalities for separation of roles, allowing different processes to be decoupled, work in conjunction and influence each other in shaping up the overall delivery process. These processes are listed as priority management, software and hardware administration, generation of utilization reports, quality management processes, user group management processes and dynamic policies for resource planning, management and control. Figure 1: Delivery Process - Job and Associated Processes Having defined the Job process and other processes created around it, the next step is to apply six sigma tools to all the processes. Six Sigma Tools Applied to Delivery Process Simulation Driven Innovation 2

A Fish-Bone diagram (or Ishikawa Diagram) is a tool that is extensively used to arrive at all the root causes for a given problem. Typically, the causes are identified through discussions and brain-storming sessions that include the end-users, PBS administrators and system administrators. A Fish-Bone diagram for a job error is shown in Figure 2. The next step is to perform a Failure Modes and Effects Analysis (FMEA) in order to determine various failure modes and their impact. The FMEA helps in prioritizing various failure modes based on their impact (severity), occurrence and detection. The Table I shows a FMEA with three different failure modes and their classification. Figure 2: Fish Bone Diagram for Job Errors Table I: Failure Modes and Effects Analysis for Job errors Potential Failure Mode Potential Effects of Failure Severity Rating Occurrence Rating Detection Rating Recommended Actions Incorrect number of cores chosen The job will never run High High Low Configure PBS Pro submission script that corrects the number of cores based on queue and group policy. Too many nodes attached to a queue without any jobs Wastage of resources in one queue and high wait period in another queue High Medium Medium Typically this happens over the weekend. Create/modify queues for this period. Hard disk crashed in a node A running job will fail. If not removed from queue, many further jobs will High High High Configure PBS MOM and submission script to automatically take a node out of queue, if disk partition is not available. Simulation Driven Innovation 3

enter this node and fail. The Fish-Bone analysis and the FMEA lead to a set of action items in order to eliminate all the identified potential causes of errors. However, implementing all the recommended action items may neither be necessarily feasible nor be cost-effective. Typically, a Effort-Benefit matrix is used to arrive at the most cost-effective action items. The PBS Pro scheduler's in-built options enable the implementation of solutions for job error in a simple, time-bound and cost-effective manner. Figure 3 shows the Effort-Benefit matrix without and with PBS Pro. Figure 3: Effort Benefit Matrix Without and With PBS Pro Applications are associated with queue and the job submission script uses all the information to calculate the list of resource request. A flow chart representation of submission script function is shown below in Fig 4. Figure 4: Flow Chart for Submission Process Similarly, custom-made script commands, which use PBS Pro commands (qstat, qalter, pbsnodes, qmgr, etc) as backend, are deployed for checking job status, managing individual jobs, managing priority, checking cluster health, monitoring nodes availability and implementing policy. Simulation Driven Innovation 4

Queue Management In a production cluster such as the one being considered here, the work load varies across queues, applications, user groups and users. In this scenario, wait time for a given work load and resources have to be minimized. The work load in the clusters for over a period of two months is shown in Figure 5. Therefore, the queues must be configured in such a way that the computational resources are not wasted idling in one queue while a job is waiting in another queue. Further, the jobs have different run times. Figure 6A shows the run times for the jobs over the same period of two months as considered for Figure 5. In this scenario, it is possible that a job that has only one hour run time, might have to wait for completion of another job that has 24 hrs of run time. This is highly undesirable and therefore, different queues must be configured for such cases based on the run time. Figure 5: Load on the cluster for a period of two months for different applications Fig 6A shows that there are times when there are no jobs with 1 hour run time, and hence providing dedicated resources for such a queue is inappropriate. PBS Pro provides options to accommodate such conflicting requirements by providing the administrators with multiple options to resolve the scenario. One of the options is to create two queues (production and express) and assign both the queues to the same set of computational resources. Any job that is submitted in an express queue can be assigned a default higher priority such that the wait-time is minimized. Figure 6B shows the minimized wait times. PBS Pro provides generic functionality to the users for requesting a variety of resources such as total number of cores, name of the queue, run time, memory, scratch space, dynamic custom resources such as application licenses. While implementing those, the list of resource request varies for different applications, compute clusters, user groups and queues. Hardware is tuned for the application, and the performance of application depends on the hardware and the size of the problem. Different sizes of problems are addressed by different queues. For the same application, different hardware have different command line switch even if the installation location and the operating system is same. Hence, different queues are deployed for different hardware. Simulation Driven Innovation 5

Figure 6: No. of jobs over a period of two months classified on (A) run time and (B) wait time The result of the above approach is the combination of queues such as app1_large & app1_small, group1_production & group2_production, app2_express & app2_test, app3_hardware1 & app3_hardware2, app4_licenseserver1 & app4_licenseserver2 and hardware3_queue1 & hardware3_queue2 as shown here in Figure 7. All these are the issues addressed by PBS Pro configuration of server, scheduler and the submission script. Using PBS Pro, different hardware is attached to distinct or multiple queues and applications. Continuous Process Improvement Compute 1-100 Queue1 Applicaton1 & 2 Cluster1 Compute75-125 Queue1, 2, & Test Queue 3 Application 1-4 Comupte 101-200 Queue2 Applicaton3 & 4 Figure 7: A cluster configured with different queues and applications Six Sigma is a continuous improvement methodology and all the requirements are continuously monitored. It involves generation of different daily, weekly and monthly reports and validating that there are no errors and or unacceptable wait times. PBS Pro provides various options to generate different metrics in an automated way such that the efforts spent on performing this continuous monitoring are minimized. The methodology discussed in this paper is being continuously used in order to achieve a delivery process that is error-free. For a quality metric that defines any job with a wait time of more than 24 hrs in express/test queue as defective, the sigma metric is calculated over 6 for a period of one month. For a quality metric that defines any job that exits for a non-application related error as defective, the sigma metric is calculated as 5.6 over a span of two years. Conclusion As a process based approach, we begun from simplicity of job process and then built up the delivery process. The relationship between all the processes and the role of PBS Pro in contributing to their seamless execution has been explained. The Six Sigma methodology and its different tools, Fish-Bone diagram, FMEA and Effort Benefit matrix are applied to the delivery process in conjunction with various PBS Pro [6] features in reducing the job errors and for effective queue management. The focus of continuous process improvement and measurement of delivery process capability has been explained. Acknowledgements We acknowledge the kind support of TCS management and TCS HPC administrators for this paper. References 1. A Simple Guide to Implementing Quality Management in Service Organizations. A Cracking the Case of ISO 9001:2008 for Service. - Charles A Cianfrani and John E (Jack) West 2. Six Sigma Demystified A Self Teaching Guide. Paul Keller McGrawHill Simulation Driven Innovation 6

3. Six Sigma in Software Industries: some case studies and observations. International Journal of Six Sigma and Competitive Advantage, Inderscience Publishers, Volume 2, Number 3/2006. Rupa Mahanti and Jiju Antony 4. In Search of Six Sigma: 99.9997% Defect Free. Industry Week, October 1, 1990 Brian M Cook 5. PBS Professional Administrators Guide 6. www.pbsworks.com Simulation Driven Innovation 7