Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Similar documents

Resource Management and Job Scheduling

PBS + Maui Scheduler

Job scheduler details

LSKA 2010 Survey Report Job Scheduler

Job Scheduling with Moab Cluster Suite

Maui Administrator's Guide

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

Miami University RedHawk Cluster Working with batch jobs on the Cluster

Grid Scheduling Dictionary of Terms and Keywords

LoadLeveler Overview. January 30-31, IBM Storage & Technology Group. IBM HPC Developer TIFR, Mumbai

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform

The Moab Scheduler. Dan Mazur, McGill HPC Aug 23, 2013

High-Performance Reservoir Risk Assessment (Jacta Cluster)

SAS Grid: Grid Scheduling Policy and Resource Allocation Adam H. Diaz, IBM Platform Computing, Research Triangle Park, NC

A High Performance Computing Scheduling and Resource Management Primer

Martinos Center Compute Clusters

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Introduction to Sun Grid Engine (SGE)

Introduction to Apache YARN Schedulers & Queues

Grid Engine Training Introduction

Quick Tutorial for Portable Batch System (PBS)

Apache Hama Design Document v0.6

Job Scheduling Explained More than you ever want to know about how jobs get scheduled on WestGrid systems...

Cobalt: An Open Source Platform for HPC System Software Research

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Survey on Job Schedulers in Hadoop Cluster

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Hadoop Fair Scheduler Design Document

Moab and TORQUE Highlights CUG 2015

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

Understanding Server Configuration Parameters and Their Effect on Server Statistics

Chapter 2: Getting Started

System Software for High Performance Computing. Joe Izraelevitz

Batch Scripts for RA & Mio

Resource Models: Batch Scheduling

How to control Resource allocation on pseries multi MCM system

Running applications on the Cray XC30 4/12/2015

Optimizing Shared Resource Contention in HPC Clusters

Sun Grid Engine, a new scheduler for EGEE middleware

Using Parallel Computing to Run Multiple Jobs

Sun Grid Engine, a new scheduler for EGEE

Job Scheduling on a Large UV Chad Vizino SGI User Group Conference May Pittsburgh Supercomputing Center

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Scheduling Algorithms for Dynamic Workload

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

Grid Computing in SAS 9.4 Third Edition

Ra - Batch Scripts. Timothy H. Kaiser, Ph.D. tkaiser@mines.edu

Fair Scheduler. Table of contents

Chapter 1 - Web Server Management and Cluster Topology

Understanding Task Scheduler FIGURE Task Scheduler. The error reporting screen.

Monitoring system for OpenPBS environment

The Improved Job Scheduling Algorithm of Hadoop Platform

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

Simplest Scalable Architecture

Platform LSF Version 9 Release 1.2. Foundations SC

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Hack the Gibson. John Fitzpatrick Luke Jennings. Exploiting Supercomputers. 44Con Edition September Public EXTERNAL

The Importance of Software License Server Monitoring

General Overview. Slurm Training15. Alfred Gil & Jordi Blasco (HPCNow!)

HOD Scheduler. Table of contents

Resource Aware Scheduler for Storm. Software Design Document. Date: 09/18/2015

How To Understand And Understand An Operating System In C Programming

Batch Scheduling: A Fresh Approach

Hodor and Bran - Job Scheduling and PBS Scripts

HPC-Nutzer Informationsaustausch. The Workload Management System LSF

Grid Engine Users Guide p1 Edition

A highly configurable and efficient simulator for job schedulers on supercomputers

High-Performance Computing

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

CA Nimsoft Monitor. Probe Guide for Active Directory Server. ad_server v1.4 series

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

An Introduction to High Performance Computing in the Department

The SUN ONE Grid Engine BATCH SYSTEM

Comparative Study of Distributed Resource Management Systems SGE, LSF, PBS Pro, and LoadLeveler

Informix Performance Tuning using: SQLTrace, Remote DBA Monitoring and Yellowfin BI by Lester Knutsen and Mike Walker! Webcast on July 2, 2013!

Appendix E. Captioning Manager system requirements. Installing the Captioning Manager

Oracle Grid Engine. User Guide Release 6.2 Update 7 E

Transcription:

PBS INTERNALS

PBS & TORQUE PBS (Portable Batch System)-software system for managing system resources on workstations, SMP systems, MPPs and vector computers. It was based on Network Queuing System (NQS) 1986 NASA first on Cray and then ported to other architectures TORQUE Resource Manager (Tera-scale Opensource Resource and QUEue manager) -an open source version of PBS, providing control over: batch jobs distributed compute nodes

Batch Systems provide a mechanism for submitting, launching, and tracking jobs on a shared resource provide centralized access to distributed resources allow users a single system image in terms of the management of their jobs and theaggregate compute resources available.

A Batch System qsub, qdel, qstat Job, start, stop, status Batch Server (pbs_server) Batch server and cluster Configuration, Job queue, State table Node, job, start, stop Scheduler and additional cluster Configuration Job, start, stop, status status Scheduler Execution host (pbs_mom) Execution host (pbs_mom) Execution host (pbs_mom) Execution host (pbs_mom)

TORQUE Features TORQUE provides enhancements over standard OpenPBS in the following areas: scalability fault tolerance scheduling interface usability

TORQUE Benefits Initiate and manage serial and parallel batch jobs remotely (create, route, execute, modify and/or delete jobs) Define and implement resource policies that determine how much of each resource can be used by a job Apply jobs to resources across multiple servers to accelerate job completion time Collects information about the nodes within the cluster to determine which are in use and which are available.

PBS Structure General components (daemons) A resource manager pbs_server A scheduler pbs_sched Many executors pbs_mom moms(machine Oriented Mini-servers) PBS provide an API to communicate with the server and another one to interface the moms

PBS Components Job server (pbs_server) provides the basic batch services receiving/creating a batch job modifying the job protecting the job against system crashes running the job. cancelling a job logs information about jobs for accounting keeps track of all nodes and jobs

PBS Components Job Executor (pbs_mom) receives a copy of the job from the job server sets the job into execution creates a new session as identical user returns the job's output to the user. Job Scheduler (pbs_sched) runs site's policy controlling which job is run and where and when it is run PBS allows each site to useits own scheduler Currently Maui Scheduler is usedin NAR

PBS Job Session From a user s point of view Determine resource requirements(cpu time, memory, number of CPUs/node) for a job and write a batch script. Submit thescript to the queuing system. Wait for the job to be scheduled and ran. Get the results.

PBS Job Session From the queuing system point of view User submitsthe job with qsubcommand PBS places the job into a queue based on its resource requests and runs the job when those resources become available The job runs until it either completes or exceeds one of its resource request limits PBS copies the job s output into the directory from which the job was submitted and optionally notifies the user via email that the job has ended

PBS Job Workflow When a scheduling interval starts Schedulerasks pbs_serverthe state of the nodes and of any jobs. Scheduler query moms for determining available resources (memory, cpu load, etc.) Scheduler examines job queues attempts to schedule any eligible jobs If there are enough resources free,schedulerallocates resources for job, returning job id and resource list to pbs_server for execution pbs_servercontacts the pbs_momon the first node assigned to the job (That pbs_mom is called the mother superior). The mother superiorexecutes the jobs scripts submitted by the user(it alsomonitorsresourceusage of child processes and reportsback to server.) When a PBS heartbeat happens The pbs_serverwill contact eachpbs_momand ask the status of its node.

Scheduler Problem The default TORQUEscheduler(pbs_sched) is very basic will provide poor utilization lacks some important features like backfill scheduler advance reservation Solution => MAUI Scheduler

MAUI MAUI is a scheduler (not a resource manager!) developed by the MAUI High Performance Computing Center (MHPCC)as an alternative to the default Loadleveler scheduler in their IBM SP(Scalable POWERparallel) environment. Ported to PBS by using the appropriate API provided by PBS.

MAUI Features Scheduling behavior is constrained by way of throttling policies Both soft and hard limits used Applied to each iteration Three main algorithms used Backfill Priority Fairshare Reservations for high priority jobs More control parameters on users Commands for querying the scheduler

MAUIPhilosophy Maui is particularly concerned about scheduling multiprocessor jobs How do you arrange a matching set of processors to be simultaneously available for a single job? Maui tries to plan the execution of such jobs at a particular time when it expects sufficient processors to be available -on the basis of the job maximum walltime parameters. It establishes reservationson a set of processors for a job ensuring all the processors are free at the planned time

MAUIPhilosophy As the reservations take effect, more and more processors become idle as the planned job time approaches A scheme called backfilltries to exploit these idle processors by running short single/few processor jobs out of priority orderin the gaps Maximum efficiency is achieved by scheduling big jobs firstand running small jobs in the gaps Maui really cares about walltimes

Scheduling a Job in MAUI Jobs are submitted into a pool of jobs. Forget about queues, MAUI considers all jobs. Each job has a priority number calculated. The highest priority is executed first. 4500 4005 3203 20700-300 Maui scans through all the jobs and nodes: When a job is submitted. When a job completes. And at periodic intervals.

Scheduling Objects MAUIfunctions by manipulating these five elementary objects Jobs Nodes Reservations QOS(Quality of service) structures Policies

Scheduling Objects A jobconsists of one or more requirements, each of which requests a number of resources of a given type. A nodeis a collection of resources with a particular set of associated attributes. Policiesare generally specified via a configurationfile and serve to control how and when jobs start. The MAUIScheduler allows administrators fine grain control over QOSlevels on a per user, group and account basis.

Backfill Backfill is a scheduling optimization Allows some jobs to be run 'out of order' so long as they do not delay the highest priority jobs in the queue Offers significant scheduler performance improvement

Backfill Algorithm Uses wallclocklimit = an estimation of the wall time (or elapsed time) from job start to job finish to find holes ( amounts of time dedicated to a job that has already finished or is waiting); it fills these holes with execution of other jobs in the queue Better estimates of wallclocklimit will increase the amount of improvement backfill scheduling can provide for your jobs

Priority Algorithm Default is trivial FIFO but is weighted and combined based on a range of job related components These components have subcomponents such as user, group, priority, QoS, etc. Weight valuesareconfigurable parameters and values for each component is calculated from subcomponents listedabove

Components of a Job s Priority CRED = Credentials, e.g user or group name, submission queue,... FS = Fairshare, e.g considers historical usage of user, group,... RES = Resources, e.g. Number of nodes requested, length of job,.. SERV = Service, e.gtime job has been queued, TARGET = Target, e.gjobs must run within two days. USAGE = Usage e.gtime consumed by jobs running now. Each component is weighted and summed to form the priority, PRIORITY = CREDWEIGHT * CREDComp + FSWEIGHT * FSComp +... A common mistake is to leave say FSWEIGHT at 0 having configured FS.

Example Subcomponents CRED components are static contributions to the overall priority number. e.g username, groupname, submission queue. Config Attribute Value Summary CREDWEIGHT 10 Component Weight USERWEIGHT 20 Subcomponent s Weight USERCFG[user] PRIORITY=1000 Static Priority for user CLASSWEIGHT 5 Subcomponent s Weight CLASSCFG[queue] PRIORITY=10000 Static Priority for queue PRIORITY = CREDWEIGHT * (CREDComp) + FSWEIGHT * FSComp +... CREDComp = USERWEIGHT * (USERCFG[user] priority) + CLASSWEIGHT * (CLASSCFG[queue] priority) +...

Fairshare A mechanism which allows historical resource utilization information to be incorporated into job feasibility and priority decisions Composed of several parts which handle historical information: fairsharewindows, impact, usageandtarget All partsare configurable parameters Purpose of fairshareis to steer existing workload

Fairshare Components FairshareWindows Actually they are timeframes Lengthandnumberof windowstobe considered whileevaluatinghistoricalinformationcan be controlledbyfsinterval andfsdepth parameters Impact Defines importance of each window for fairshare evaluation Can be configured using FSDECAY parameter

Fairshare Components Usage Defines metric to be used for fairshare evaluation 2 typesof usagemetrics(controlledbyfspolicy parameter) 1. Dedicatedusagetrackswhat the scheduler assigns to the job 2. Consumed usage tracks what the job actually uses Target Expectedusagevaluefor a user, grouporclass (queue) used for fairdhare evaluation Determined and configured by administrator

Example Fairshare Calculation FSPOLICY = consumed usage FSDEPTH=4 FSDECAY=0.5 FSINTERVAL=24h 100 75 50 25 0 Assigned walltime Used walltime 100 75 75 43 20 25 15 20 0-24 hours 24-48 hours 48-72 hours 72-96 hours

Example Fairshare Calculation Then, a comparison between targetand usage for the user, group or class gives the contribution to the job s overall priority. Difference or ratio can be used in this comparison (here ratio is used) PRIORITY = CREDWEIGHT * (CREDComp) + FSWEIGHT * FSComp +... FSComp = FSUSERWEIGHT * (1 user s fsusage/user s fstarget) + FSGROUPWEIGHT * (1 - group s fsuage/group s fstarget) +...

Soft and Hard Limits Hard and soft limit specification allow a site to balance both fairness and utilization on a given system They provide scheduling flexibility Soft limits more constraining limits in effect in heavily loaded situations and reflect tight fairness constraints Hard limits more flexible limits specify how flexible the scheduler can be in selecting jobs when there are idle resources available after all jobs meeting the tighter soft limits have been started

Example of Soft and Hard Limits cenga jobs MAXJOB=2,4 Job Slot cengb jobs MAXJOB=4,5 BLOCKED IDLE RUNNING 6 5 4 3 2 1 12 10 9 5 11 8 7 6 4 3 2 1 12 11 10 9 8 7 5 6 4 3 2 11 12 9 8 7 5

Advance Reservations An advance reservation is the mechanism by which MAUIguarantees the availability of a set of resources at a particular time Every reservation consists of 3 major components a list of resources a timeframe an access control list Additionally, a reservation may also have a number of optional attributes controlling its behavior and interaction with other aspects of scheduling

Advance Reservations Access control list (ACL): determines who or what can use the reserved resources. While reservation ACL's allowparticular jobs to utilize reserved resources, they do not force the job to utilize these resources. Maui will attempt to locate the best possible combination of available resources whether these are reserved or unreserved.

Preemption Basically preemption is pausing a low priority(or time consuming) job when a higher priority(or quickly finishing) job requests some resources Can be controlledmanuallyorbyusinga preemptive backfill policy Typeof preemptiondetermineshowthe PREEMPTEE(paused/low priority) job continues whenthe PREEMPTOR(highpriority) job is finished

Types of preemption Job Requeue Preempteejobs are terminated and returned to the job queue. Job Suspend Preempteejobs stop executing but remain in memory. While a suspended job frees up processor resources, it may continue to consume swap and/or other resources. Job Checkpoint Preemptee jobs save off their current states(checkpointing) and terminates. Whenresourcesbecomeavailableagain, thecheckpointedjob is restarted and it resumes execution from its checkpoint. Resource manager should support it for job checkpointing to work Torque supports checkpointing by using a kernel level package called BLCR (Berkeley Lab Checkpoint/Restart)

Thank you Anyquestions?