PBS Professional Job Scheduler at TCS: Six Sigma- Level Delivery Process and Its Features



Similar documents
PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

Six Sigma DMAIC Model and its Synergy with ITIL and CMMI

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 4, July-Aug 2014

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

Chapter 2: Getting Started

CAE DATA & PROCESS MANAGEMENT WITH ANSA

Quick Tutorial for Portable Batch System (PBS)

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Chapter 8: Project Quality Management

Applying Six Sigma Concepts, Techniques and Method for Service Management: Business and IT Service Management (BSM & ITSM)

Job scheduler details

LSKA 2010 Survey Report Job Scheduler

Project Quality Management. Project Management for IT

PRODUCT OVERVIEW SUITE DEALS. Combine our award-winning products for complete performance monitoring and optimization, and cost effective solutions.

A Case Study in Software Enhancements as Six Sigma Process Improvements: Simulating Productivity Savings

A High Performance Computing Scheduling and Resource Management Primer

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform

Directions for VMware Ready Testing for Application Software

BROCADE PERFORMANCE MANAGEMENT SOLUTIONS

Infasme Support. Incident Management Process. [Version 1.0]

CLASS SPECIFICATION Systems Support Analyst II

Ready Time Observations

An Oracle White Paper June, Strategies for Scalable, Smarter Monitoring using Oracle Enterprise Manager Cloud Control 12c

Delivering Quality in Software Performance and Scalability Testing

The Importance of Software License Server Monitoring

Application of Predictive Analytics for Better Alignment of Business and IT

DiskBoss. File & Disk Manager. Version 2.0. Dec Flexense Ltd. info@flexense.com. File Integrity Monitor

MEASURING FOR PROBLEM MANAGEMENT

Cloud Bursting with SLURM and Bright Cluster Manager. Martijn de Vries CTO

PLUMgrid Toolbox: Tools to Install, Operate and Monitor Your Virtual Network Infrastructure

Evolution from the Traditional Data Center to Exalogic: An Operational Perspective

The Power of Two: Combining Lean Six Sigma and BPM

IT Service Management with System Center Service Manager

SECTION 4 TESTING & QUALITY CONTROL

Adaptive Resource Optimizer For Optimal High Performance Compute Resource Utilization

Proactive Performance Monitoring Using Metric Extensions and SPA

Web Application s Performance Testing

CROSS INDUSTRY PegaRULES Process Commander. Bringing Insight and Streamlining Change with the PegaRULES Process Simulator

Case Study I: A Database Service

The Methodology Behind the Dell SQL Server Advisor Tool

Resource Management and Job Scheduling

A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing

IT2404 Systems Analysis and Design (Compulsory)

Introduction to WebSphere Administration

Recovery Management. Release Data: March 18, Prepared by: Thomas Bronack

Real-Time Scheduling 1 / 39

vrealize Operations Manager User Guide

Miami University RedHawk Cluster Working with batch jobs on the Cluster

CLASS SPECIFICATION Systems Support Analyst I

Brillig Systems Making Projects Successful

SOFTWARE MANAGEMENT PROGRAM. Software Testing Checklist

Reduce QA Cost by Improving Productivity & Test Optimization

A Six Sigma Approach for Software Process Improvements and its Implementation

Process Description Incident/Request. HUIT Process Description v6.docx February 12, 2013 Version 6

Response Time Analysis

High Availability Design Patterns

Performance Testing. Slow data transfer rate may be inherent in hardware but can also result from software-related problems, such as:

Job Scheduling with Moab Cluster Suite

Leveraging CMMI framework for Engineering Services

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

Lean Six Sigma Training The DMAIC Story. Unit 6: Glossary Page 6-1

High-Performance Reservoir Risk Assessment (Jacta Cluster)

This presentation provides an overview of the architecture of the IBM Workload Deployer product.

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE

Grid Scheduling Dictionary of Terms and Keywords

Reliability Analysis A Tool Set for. Aron Brall

Motivations. spm adolfo villafiorita - introduction to software project management

Bitrix Site Manager ASP.NET. Installation Guide

Purchasing Success for the Service Sector: Using Lean & Six Sigma.

OOA of Railway Ticket Reservation System

Appendix A Core Concepts in SQL Server High Availability and Replication

DiskPulse DISK CHANGE MONITOR

ABAQUS High Performance Computing Environment at Nokia

WebSphere Application Server V6.1 Extended Deployment: Overview and Architecture

RS MDM. Integration Guide. Riversand

Load Balancing in cloud computing

Analysis of IP Network for different Quality of Service

Key Requirements for a Job Scheduling and Workload Automation Solution

DevOps. Production Operations - The Last Mile of a DevOps Strategy

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

MATLAB Distributed Computing Server Licensing Guide

1 Define-Measure-Analyze- Improve-Control (DMAIC)

The Complete Performance Solution for Microsoft SQL Server

HP Service Manager. Software Version: 9.34 For the supported Windows and UNIX operating systems. Processes and Best Practices Guide

Paper Robert Bonham, Gregory A. Smith, SAS Institute Inc., Cary NC

Unit 1: Introduction to Quality Management

For more information about UC4 products please visit Automation Within, Around, and Beyond Oracle E-Business Suite

Cisco TelePresence Management Suite Extension for Microsoft Exchange Version 4.0.1

Windows Server Performance Monitoring

Problem Management Fermilab Process and Procedure

A CP Scheduler for High-Performance Computers

EView/400i Management Pack for Systems Center Operations Manager (SCOM)

AIMMS The Network License Server

Transcription:

PBS Professional Job Scheduler at TCS: Six Sigma- Bhadraiah Karnam Analyst Tata Consultancy Services Whitefield Road Bangalore 560066 Level Delivery Process and Its Features Hari Krishna Thotakura Analyst Tata Consultancy Services Whitefield Road Bangalore 560066 Thean Mani Rajan K Analyst Tata Consultancy Services Whitefield Road Bangalore 560066 Keywords: PBS, six sigma, delivery process, batch jobs Abbreviations: TCS - Tata Consultancy Services PBS Pro - PBS Professional Abstract PBS Pro job scheduler is used at TCS for scheduling High Performance Computing batch jobs. In production environment, submitting batch solver job, completing it successfully without non-application errors and delivering the results as early as possible is critical to the quality. In the present paper, the strategies used for increasing the quality of delivery process and avoiding defects are detailed. We also discuss about managing multiple applications, multiple software licenses, multiple computational resources and multiple user groups with multiple requirements using PBS Pro. The presentation also provides the details about managing queues, resource utilization, and six sigma level of process capability. Introduction In the industry environment, submission of batch mode tasks is not only regular, but also frequent and repeated. Such tasks are performed by different groups deploying a variety of compute clusters and different applications. These tasks, referred to as job, are often a part of the process wherein analysis is performed at the cluster and the output is delivered. This process is referred to as delivery process. The computational hardware resources are deployed round the clock in which both current and non-current jobs are submitted. The non-current jobs are queued in line and scheduled for execution upon the availability of its satisfying conditions. This process of scheduling the jobs in queue, maintaining the current list of compute resources, checking for the necessary resources such as application license and all other related tasks are performed by Workload Management System PBS Pro. In this paper, we discuss different aspects of our production environment, delivery process, six sigma methodologies and how we deploy PBS Pro in achieving a continuously-improving process capability. Process Approach A desired result is achieved more efficiently when activities and related resources are managed as a process [1]. The customers of this delivery process are the end-users who submit their jobs in the PBS Pro queue. These users are the analysts in their respective CAE application software. Their primary requirement is to submit the job in a simple manner and achieve the requisite output from the respective application without any non-application errors, with the least wait time. The job process begins with the submission of job. From there, PBS Pro takes over by maintaining the job in the queue to check for the availability of various resources requested, and once resources are available, it schedules for execution along with an email notification. Input files are copied to the destination cluster compute node and prepared for execution. Batch mode execution of the application begins and the output files are copied back into the selected location. The job completes with an email notification to the user. Six Sigma Methodology for Process Improvement Six Sigma is a disciplined, data-driven approach for eliminating defects in any process and is widely used in several industries, including manufacturing and service-based industries. The name, Six Sigma, is derived from the fact that this methodology drives towards six standard deviations between the mean and the nearest specification limit. We use Six Sigma as a continuous process improvement methodology. Six sigma methodology includes two basic sub-methodologies: DMAIC and DMADV. The Six Sigma Define, Measure, Analyze, Improve and Control (DMAIC)) approach is applied to existing processes falling Simulation Driven Innovation 1

below specification limits and looking for incremental improvement. The Six Sigma Define, Measure, Analyze, Design and Verify (DMADV) approach is used to develop new processes or products at six sigma quality levels. Literature published on each of these Six Sigma methodologies, including their sub-systems, are available [2, 3 and 4]. This paper highlights the usage of different tools used as a part of Six Sigma methods to design and improve the Job process. Delivery Process The basic objective is to avoid any non-application errors while a job is submitted and run. Additionally, it is also expected that wait time for jobs in any queue is reduced and at the same time, the overall cluster utilization is increased. The job submission process needs to be very simple and aligned for CAE application users. PBS Pro provides qsub command for job submission. We use the custom job submission script, which not only simplifies the job submission process, but also encapsulates deployment of all the application features, serving all levels of CAE users. For any complicated CAE application to have the job submission very simple, the user is asked to provide the bare minimum of details such as only the input file name. All the other details such as queue name, number of cores, email can be left to default values or calculated automatically. Hence, the foremost task of job submission begins by executing the application script command that performs all the preprocessing for the job and submits it in the PBS Pro queue using qsub. Minimum of training is required to be imparted to the users in educating them about submission script commands for various applications. PBS Pro being a very generic batch system software, it is flexible to simplify the job submission process and also to build multiple processes around this job submission process. A graphic representation is shown in Fig.1. Job submission process is closely coupled with application process, queuing process, hardware cluster-dependent process, and resources requested by the job. Besides these, there are other mutuallyrelated processes that revolve around these processes. PBS Pro provides functionalities for separation of roles, allowing different processes to be decoupled, work in conjunction and influence each other in shaping up the overall delivery process. These processes are listed as priority management, software and hardware administration, generation of utilization reports, quality management processes, user group management processes and dynamic policies for resource planning, management and control. Figure 1: Delivery Process - Job and Associated Processes Having defined the Job process and other processes created around it, the next step is to apply six sigma tools to all the processes. Six Sigma Tools Applied to Delivery Process Simulation Driven Innovation 2

A Fish-Bone diagram (or Ishikawa Diagram) is a tool that is extensively used to arrive at all the root causes for a given problem. Typically, the causes are identified through discussions and brain-storming sessions that include the end-users, PBS administrators and system administrators. A Fish-Bone diagram for a job error is shown in Figure 2. The next step is to perform a Failure Modes and Effects Analysis (FMEA) in order to determine various failure modes and their impact. The FMEA helps in prioritizing various failure modes based on their impact (severity), occurrence and detection. The Table I shows a FMEA with three different failure modes and their classification. Figure 2: Fish Bone Diagram for Job Errors Table I: Failure Modes and Effects Analysis for Job errors Potential Failure Mode Potential Effects of Failure Severity Rating Occurrence Rating Detection Rating Recommended Actions Incorrect number of cores chosen The job will never run High High Low Configure PBS Pro submission script that corrects the number of cores based on queue and group policy. Too many nodes attached to a queue without any jobs Wastage of resources in one queue and high wait period in another queue High Medium Medium Typically this happens over the weekend. Create/modify queues for this period. Hard disk crashed in a node A running job will fail. If not removed from queue, many further jobs will High High High Configure PBS MOM and submission script to automatically take a node out of queue, if disk partition is not available. Simulation Driven Innovation 3

enter this node and fail. The Fish-Bone analysis and the FMEA lead to a set of action items in order to eliminate all the identified potential causes of errors. However, implementing all the recommended action items may neither be necessarily feasible nor be cost-effective. Typically, a Effort-Benefit matrix is used to arrive at the most cost-effective action items. The PBS Pro scheduler's in-built options enable the implementation of solutions for job error in a simple, time-bound and cost-effective manner. Figure 3 shows the Effort-Benefit matrix without and with PBS Pro. Figure 3: Effort Benefit Matrix Without and With PBS Pro Applications are associated with queue and the job submission script uses all the information to calculate the list of resource request. A flow chart representation of submission script function is shown below in Fig 4. Figure 4: Flow Chart for Submission Process Similarly, custom-made script commands, which use PBS Pro commands (qstat, qalter, pbsnodes, qmgr, etc) as backend, are deployed for checking job status, managing individual jobs, managing priority, checking cluster health, monitoring nodes availability and implementing policy. Simulation Driven Innovation 4

Queue Management In a production cluster such as the one being considered here, the work load varies across queues, applications, user groups and users. In this scenario, wait time for a given work load and resources have to be minimized. The work load in the clusters for over a period of two months is shown in Figure 5. Therefore, the queues must be configured in such a way that the computational resources are not wasted idling in one queue while a job is waiting in another queue. Further, the jobs have different run times. Figure 6A shows the run times for the jobs over the same period of two months as considered for Figure 5. In this scenario, it is possible that a job that has only one hour run time, might have to wait for completion of another job that has 24 hrs of run time. This is highly undesirable and therefore, different queues must be configured for such cases based on the run time. Figure 5: Load on the cluster for a period of two months for different applications Fig 6A shows that there are times when there are no jobs with 1 hour run time, and hence providing dedicated resources for such a queue is inappropriate. PBS Pro provides options to accommodate such conflicting requirements by providing the administrators with multiple options to resolve the scenario. One of the options is to create two queues (production and express) and assign both the queues to the same set of computational resources. Any job that is submitted in an express queue can be assigned a default higher priority such that the wait-time is minimized. Figure 6B shows the minimized wait times. PBS Pro provides generic functionality to the users for requesting a variety of resources such as total number of cores, name of the queue, run time, memory, scratch space, dynamic custom resources such as application licenses. While implementing those, the list of resource request varies for different applications, compute clusters, user groups and queues. Hardware is tuned for the application, and the performance of application depends on the hardware and the size of the problem. Different sizes of problems are addressed by different queues. For the same application, different hardware have different command line switch even if the installation location and the operating system is same. Hence, different queues are deployed for different hardware. Simulation Driven Innovation 5

Figure 6: No. of jobs over a period of two months classified on (A) run time and (B) wait time The result of the above approach is the combination of queues such as app1_large & app1_small, group1_production & group2_production, app2_express & app2_test, app3_hardware1 & app3_hardware2, app4_licenseserver1 & app4_licenseserver2 and hardware3_queue1 & hardware3_queue2 as shown here in Figure 7. All these are the issues addressed by PBS Pro configuration of server, scheduler and the submission script. Using PBS Pro, different hardware is attached to distinct or multiple queues and applications. Continuous Process Improvement Compute 1-100 Queue1 Applicaton1 & 2 Cluster1 Compute75-125 Queue1, 2, & Test Queue 3 Application 1-4 Comupte 101-200 Queue2 Applicaton3 & 4 Figure 7: A cluster configured with different queues and applications Six Sigma is a continuous improvement methodology and all the requirements are continuously monitored. It involves generation of different daily, weekly and monthly reports and validating that there are no errors and or unacceptable wait times. PBS Pro provides various options to generate different metrics in an automated way such that the efforts spent on performing this continuous monitoring are minimized. The methodology discussed in this paper is being continuously used in order to achieve a delivery process that is error-free. For a quality metric that defines any job with a wait time of more than 24 hrs in express/test queue as defective, the sigma metric is calculated over 6 for a period of one month. For a quality metric that defines any job that exits for a non-application related error as defective, the sigma metric is calculated as 5.6 over a span of two years. Conclusion As a process based approach, we begun from simplicity of job process and then built up the delivery process. The relationship between all the processes and the role of PBS Pro in contributing to their seamless execution has been explained. The Six Sigma methodology and its different tools, Fish-Bone diagram, FMEA and Effort Benefit matrix are applied to the delivery process in conjunction with various PBS Pro [6] features in reducing the job errors and for effective queue management. The focus of continuous process improvement and measurement of delivery process capability has been explained. Acknowledgements We acknowledge the kind support of TCS management and TCS HPC administrators for this paper. References 1. A Simple Guide to Implementing Quality Management in Service Organizations. A Cracking the Case of ISO 9001:2008 for Service. - Charles A Cianfrani and John E (Jack) West 2. Six Sigma Demystified A Self Teaching Guide. Paul Keller McGrawHill Simulation Driven Innovation 6

3. Six Sigma in Software Industries: some case studies and observations. International Journal of Six Sigma and Competitive Advantage, Inderscience Publishers, Volume 2, Number 3/2006. Rupa Mahanti and Jiju Antony 4. In Search of Six Sigma: 99.9997% Defect Free. Industry Week, October 1, 1990 Brian M Cook 5. PBS Professional Administrators Guide 6. www.pbsworks.com Simulation Driven Innovation 7