Job scheduler details



Similar documents
Resource Management and Job Scheduling

Ra - Batch Scripts. Timothy H. Kaiser, Ph.D. tkaiser@mines.edu

Batch Scripts for RA & Mio

Miami University RedHawk Cluster Working with batch jobs on the Cluster

Job Scheduling with Moab Cluster Suite

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

Juropa. Batch Usage Introduction. May 2014 Chrysovalantis Paschoulas

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Quick Tutorial for Portable Batch System (PBS)

Running applications on the Cray XC30 4/12/2015

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

The Moab Scheduler. Dan Mazur, McGill HPC Aug 23, 2013

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

NYUAD HPC Center Running Jobs

Introduction to Sun Grid Engine (SGE)

Job Scheduling Explained More than you ever want to know about how jobs get scheduled on WestGrid systems...

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

A High Performance Computing Scheduling and Resource Management Primer

Hodor and Bran - Job Scheduling and PBS Scripts

HPC at IU Overview. Abhinav Thota Research Technologies Indiana University

Using the Yale HPC Clusters

Chapter 2: Getting Started

Martinos Center Compute Clusters

Using Parallel Computing to Run Multiple Jobs

PBS Professional Job Scheduler at TCS: Six Sigma- Level Delivery Process and Its Features

LSKA 2010 Survey Report Job Scheduler

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

Guillimin HPC Users Meeting. Bryan Caron

Grid 101. Grid 101. Josh Hegie.

Submitting batch jobs Slurm on ecgate. Xavi Abellan User Support Section

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

Scheduling in SAS 9.4 Second Edition

Getting Started with HPC

Scheduling in SAS 9.3

NEC HPC-Linux-Cluster

High-Performance Reservoir Risk Assessment (Jacta Cluster)

SquidNet. Command Line Interface. Network Render Manager. Version 1.2

Grid Engine 6. Troubleshooting. BioTeam Inc.

GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems

Sun Grid Engine, a new scheduler for EGEE

Grid Engine Users Guide p1 Edition

Streamline Computing Linux Cluster User Training. ( Nottingham University)

SGE Roll: Users Guide. Version Edition

HPC-Nutzer Informationsaustausch. The Workload Management System LSF

PBS + Maui Scheduler

Ten Reasons to Switch from Maui Cluster Scheduler to Moab HPC Suite Comparison Brief

Maui Administrator's Guide

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform

Aqua Connect Load Balancer User Manual (Mac)

CHAPTER 5 IMPLEMENTATION OF THE PROPOSED GRID NETWORK MONITORING SYSTEM IN CRB

Adaptive Resource Optimizer For Optimal High Performance Compute Resource Utilization

A highly configurable and efficient simulator for job schedulers on supercomputers

Submitting and Running Jobs on the Cray XT5

Heterogeneous Clustering- Operational and User Impacts

Key Requirements for a Job Scheduling and Workload Automation Solution

ITG Software Engineering

8/15/2014. Best (and more) General Information. Staying Informed. Staying Informed. Staying Informed-System Status

Grid Computing in SAS 9.4 Third Edition

Caltech Center for Advanced Computing Research System Guide: MRI2 Cluster (zwicky) January 2014

Optimizing Shared Resource Contention in HPC Clusters

Understanding Server Configuration Parameters and Their Effect on Server Statistics

Beyond Windows: Using the Linux Servers and the Grid

LoadLeveler Overview. January 30-31, IBM Storage & Technology Group. IBM HPC Developer TIFR, Mumbai

Windows Server 2012 Server Manager

Understand Troubleshooting Methodology

The SUN ONE Grid Engine BATCH SYSTEM

Understanding Task Scheduler FIGURE Task Scheduler. The error reporting screen.

Cluster Computing With R

Table of Contents INTRODUCTION Prerequisites... 3 Audience... 3 Report Metrics... 3

vrealize Operations Manager Customization and Administration Guide

Performance Tuning and Optimizing SQL Databases 2016

Monitoring Oracle Enterprise Performance Management System Release Deployments from Oracle Enterprise Manager 12c

MEETING THE CHALLENGES OF COMPLEXITY AND SCALE FOR MANUFACTURING WORKFLOWS

Analyzing Network Servers. Disk Space Utilization Analysis. DiskBoss - Data Management Solution

OLCF Best Practices (and More) Bill Renaud OLCF User Assistance Group

The Seven Keys to Successfully Managing Compute Workloads at Financial Institutions

Oracle Database 11 g Performance Tuning. Recipes. Sam R. Alapati Darl Kuhn Bill Padfield. Apress*

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center

Monitoring Microsoft Exchange to Improve Performance and Availability

Running a Workflow on a PowerCenter Grid

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

Grid Engine Training Introduction

Ready Time Observations

Efficient cluster computing

Moab and TORQUE Highlights CUG 2015

SAS Grid Manager Testing and Benchmarking Best Practices for SAS Intelligence Platform

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology

Obelisk: Summoning Minions on a HPC Cluster

Cost-Efficient Job Scheduling Strategies

Oracle Grid Engine. User Guide Release 6.2 Update 7 E

New Issues and New Capabilities in HPC Scheduling with the Maui Scheduler

User s Manual

Rocoto. HWRF Python Scripts Training Miami, FL November 19, 2015

Transcription:

Job scheduler details Advanced Computing Center for Research & Education (ACCRE) Job scheduler details 1 / 25

Outline 1 Batch queue system overview 2 Torque and Moab 3 Submitting jobs (ACCRE) Job scheduler details 2 / 25

Goals of a batch management system Management goals Define cluster mission objectives and performance criteria Evaluate current and historical cluster performance Administrative goals Maximize utilization and cluster responsiveness Tune fairness policies and workload distribution Automate time-consuming tasks Trouble-shoot job and resource failures Integrate new hardware and cluster services into the batch system End user goals Manage current workload Identify available resources Minimize workload response time Track historical usage Identify effectiveness of prior submissions vanderbilt-logo (ACCRE) Job scheduler details 3 / 25

Components of a batch system Master Node the node that pbs server is running. Depending on the needs of the systems, a master node may be dedicated to this task or may fulfill the roles of other components as well. Submit/Interactive Nodes provide an entry point to the system for users to be able to manage their workload. users are able to submit and track their jobs, do testing and troubleshooting environment problems. these nodes will have client commands (e.g., qsub, qhold, etc) available. (ACCRE) Job scheduler details 4 / 25

Components of a batch system Compute Nodes work horses of the system. execute submitted jobs. each compute node, pbs mom will be running to start, kill and manage submitted jobs. communicates with pbs server on the master node. Depending on the needs of the systems, a compute node may double as the master node (or more). Resources can include high-speed networks, storage systems, license managers, etc. Availability of these resources is limited and need to be managed intelligently to promote fairness and increased utilization. (ACCRE) Job scheduler details 5 / 25

Basic job flow Life cycle of a job: Creation Submit script is written to hold all of the parameters of a job. (e.g. how long a job should run (i.e., walltime), what resources are necessary to run and what to execute) Submission (with qsub) Once submitted, the policies set by the administration and technical staff of the site will dictate the priority of the job and therefore, when it will start executing. Execution (query with qstat) Finalization vanderbilt-logo (ACCRE) Job scheduler details 6 / 25

Job states The job s state indicates its current status and eligibility for execution Pre-execution states State idle hold staged Definition job is currently queued and eligible to run but is not executing (also, notqueued) the job is idle and is not eligible to run due to a user, admin, or batch system hold (also, batchhold, systemhold, userhold) job has been migrated to another scheduler but has not yet started executing (ACCRE) Job scheduler details 7 / 25

Job states starting Execution states running suspended canceling the batch system has attempted to start the job and the job is currently performing pre-start tasks which may including provisioning resources, staging data, executing system pre-launch scripts, etc. job is currently executing the user application job was running but has been suspended by the scheduler or an admin. The user application is still in place on the allocated compute resources but it is not executing job has been canceled and is in process of cleaning up Post-Execution states completed removed vacated the job has completed running without failure the job has run to its requested walltime successfully but has been canceled by the scheduler or resource manager due to exceeding its walltime or violating another policy. This includes jobs which have been canceled by users or admins either before or after a job has started. the job was canceled after partial execution due to a system failure (ACCRE) Job scheduler details 8 / 25

Batch Queuing System Policy-based intelligence engine that integrates scheduling, managing, monitoring, and reporting of cluster workloads. Goal: guarantee to met the service levels while maximizing job throughput Software suite: TORQUE/PBS resource manager Moab/Maui workload manager (ACCRE) Job scheduler details 9 / 25

Resource manager vs. workload manager A resource manager manages a queue of batch jobs for a single cluster. The resource manager contains the job launch facility as well as a simple FIFO job queue. A workload manager is a scheduler that ties a number of resource managers together into one domain. This allows a job to be submitted from one machine and run on a different cluster. The workload manager also implements the policies that govern job priority (e.g., fair-share), job limits, and consolidates resource collection and accounting. (ACCRE) Job scheduler details 10 / 25

Torque and Moab What is Torque s job as the resource manager. Accepting and starting jobs across a batch farm. Cancelling jobs. Monitoring the state of jobs. Collecting returned outputs. What is Moab s job? Moab makes all the decisions. Should a job be started asking questions like: Is there enough resource to start the job? Given all the jobs I could start which one should I start? Moab runs a scheduling iteration: When a job is submitted. When a job ends. At regular configurable intervals. vanderbilt-logo (ACCRE) Job scheduler details 11 / 25

Torque and Moab (ACCRE) Job scheduler details 12 / 25

Detailed job flow Script submitted to Torque (using qsub) specifying required resources Moab periodically retrieves from Torque list of potential jobs, available node resources, etc. Moab prioritizes jobs in idle queue When resources become available, Moab tells Torque to execute certain jobs on particular nodes Torque dispatches jobs to the PBS MOMs (machine oriented miniserver) running on the compute nodes - pbs mom is the process that starts the job script Job status changes reported back to Moab, information updated Moab sends further instructions to Torque Moab updates occur roughly every 2 minutes (configurable) vanderbilt-logo (ACCRE) Job scheduler details 13 / 25

The queue Queue divided into 3 subqueues: active - running eligible - idle, but waiting to run blocked - idle, held, deferred A job can be blocked for several reasons, e. g., requested resources not available reserved nodes offline cluster policy: e.g. user has maximum of 10 jobs in eligible queue user places intentional hold Moab supports four distinct types of holds, user holds, system holds, batch holds, and defer holds (ACCRE) Job scheduler details 14 / 25

Hold queue User hold (qhold) System hold: hold by system admin Batch hold No resources System limits Resource manager failure Policy violation... Job deferred: first deferred for some amount of time, then put into hold At this time, it will be allowed back into the idle queue and again considered for scheduling. (several opportunities) (ACCRE) Job scheduler details 15 / 25

Job priority Each submitted job has a priority Job with the highest priority gets executed first The priority of a job is calculated based on a number of factors: Credential Fairshare Resource Service Target service Usage Job attribute relative importance of certain groups or accounts favor jobs based on short term historical usage weighting jobs by the amount of resources requested specifies which service metrics are of greatest value to the site takes into account job scheduling performance targets applies to active jobs only allow the incorporation of job attributes into a job s priority Priority=CREDWEIGHT*CREDComp+FSWEIGHT*FSComp+... check job priority: mdiag -p (ACCRE) Job scheduler details 16 / 25

Fairshare Fairshare scheduling helps steer a system toward usage targets by adjusting job priorities based on short term historical usage. Moab s fairshare can target usage percentages, ceilings, floors or caps for users, groups, accounts etc. mdiag -f (ACCRE) Job scheduler details 17 / 25

Fairshare example FSINTERVAL 12:00:00 FSDEPTH 4 FSDECAY 0.5 FSPOLICY DEDICATEDPS # all users should have a fs target of 10% USERCFG[DEFAULT] FSTARGET=10.0 # user john should get 20% CPU usage USERCFG[john] FSTARGET=20.0 # reduce staff priority if group usage exceed 15% GROUPCFG[staff] FSTARGET=15.0- # give group orion additional priority if usage drops below 25.7% GROUPCFG[orion] FSTARGET=25.7+ FSUSERWEIGHT 10 FSGROUPWEIGHT 100 (ACCRE) Job scheduler details 18 / 25

Fairshare example INTERVAL 12:00:00 FSDEPTH 4 FSDECAY 0.5 Interval Total User john Usage Total Cluster Usage 0 60 110 1 0 125 2 10 100 3 50 150 Usage = 60 + 0.5 1 0 + 0.5 2 10 + 0.5 3 50 110 + 0.5 1 125 + 0.5 2 100 + 0.5 3 150 (ACCRE) Job scheduler details 19 / 25

Submitting jobs You create a job submission script (a shell script containing the set of commands you want run on some set of cluster compute nodes) Submit to job scheduler: qsub[options] job submission script It returns a job id upon successful submission. You will need the id to monitor, deleting the job. Jobs are queued up till the system is ready to run it (requested resources become available) The scheduler selects which jobs to run, when, and where, according to a predetermined site policy meant to balance competing user needs and to maximize efficient use of the cluster resources (ACCRE) Job scheduler details 20 / 25

Writing job script #!/bin/bash #first line defines shell #PBS -M my.address@vanderbilt.edu #send status/progress emails #PBS -m bae #email at beginning, abort, & end #PBS -l nodes=1:ppn=1 #resources (-l) required for job #PBS -l walltime=00:05:00 #REQUIRED! Estimated wall clock #format: (hh:mm:ss) #PBS -l mem=1000mb #PBS -o myjob.output # send stdout to myjob.output #PBS -j oe # join stdout/err to myjob.output myexecutable input1 input2 (ACCRE) Job scheduler details 21 / 25

PBS Environmental variables Variable PBS O HOME PBS O HOST PBS O QUEUE PBS O WORKDIR PBS JOBID PBS NODEFILE Description the HOME environmental variable of the submitter the name of the host upon which the qsub command is running the name of the original queue to which the job was submitted the absolute path of the current working directory of the qsub command the job identifier assigned to the job by the batch system the name of the file contain the list of nodes assigned to the job (for parallel jobs) (ACCRE) Job scheduler details 22 / 25

Useful commands command qsub [options] <pbs script> qdel, qhold,qrls <jobid> qstat q showq pbsnodes -l -a checkjob -v <jobid(s)> mdiag -f mdiag -v -p mdiag -v -j <jobid> tracejob -n <#days> <jobid> description submit job for execution remove, hold, release hold view job(s) status view queue status with nice format view queue status view nodes & attributes view job(s) status check fairshare check job priority resource summary trace job history (ACCRE) Job scheduler details 23 / 25

Checking cluster utilization Overview charts: Number and percentage of active compute CPUs Number and percentage of active compute nodes Number of active, eligible, and blocked jobs http://www.accre.vanderbilt.edu/?page id=767 (ACCRE) Job scheduler details 24 / 25

Scheduler Etiquette Scheduler policies: http://www.accre.vanderbilt.edu/?page id=89 Limits on: Number of jobs in queue Maximum and minimum job lengths Memory usage We place special restrictions when necessary. (ACCRE) Job scheduler details 25 / 25