The Moab Scheduler. Dan Mazur, McGill HPC daniel.mazur@mcgill.ca Aug 23, 2013



Similar documents
Job Scheduling Explained More than you ever want to know about how jobs get scheduled on WestGrid systems...

Job Scheduling with Moab Cluster Suite

Miami University RedHawk Cluster Working with batch jobs on the Cluster

Job scheduler details

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Martinos Center Compute Clusters

Ra - Batch Scripts. Timothy H. Kaiser, Ph.D. tkaiser@mines.edu

GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Quick Tutorial for Portable Batch System (PBS)

Juropa. Batch Usage Introduction. May 2014 Chrysovalantis Paschoulas

Batch Scripts for RA & Mio

Guillimin HPC Users Meeting. Bryan Caron

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

8/15/2014. Best (and more) General Information. Staying Informed. Staying Informed. Staying Informed-System Status

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

A High Performance Computing Scheduling and Resource Management Primer

Adaptive Resource Optimizer For Optimal High Performance Compute Resource Utilization

Optimizing Shared Resource Contention in HPC Clusters

Resource Models: Batch Scheduling

Improved job reporting

HPC-Nutzer Informationsaustausch. The Workload Management System LSF

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

UMass High Performance Computing Center

Genome Analysis in a Dynamically Scaled Hybrid Cloud

Resource Management and Job Scheduling

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

Understanding IBM Lotus Domino server clustering

Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS?

Analysis of VDI Storage Performance During Bootstorm

Improving Compute Farm Efficiency for EDA

CIT 470: Advanced Network and System Administration. Topics. Performance Monitoring. Performance Monitoring

NYUAD HPC Center Running Jobs

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

HPC at IU Overview. Abhinav Thota Research Technologies Indiana University

DETERMINATION OF THE PERFORMANCE

SLURM Workload Manager

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

Hodor and Bran - Job Scheduling and PBS Scripts

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

Using Parallel Computing to Run Multiple Jobs

Kiko> A personal job scheduler

Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum

Discovering the Petascale User Experience in Scheduling Diverse Scientific Applications: Initial Efforts towards Resource Simulation

Multifaceted Resource Management for Dealing with Heterogeneous Workloads in Virtualized Data Centers

NEC HPC-Linux-Cluster

Efficient Load Balancing using VM Migration by QEMU-KVM

OLCF Best Practices (and More) Bill Renaud OLCF User Assistance Group

A CP Scheduler for High-Performance Computers

A highly configurable and efficient simulator for job schedulers on supercomputers

Overlapping Data Transfer With Application Execution on Clusters

The Hadoop Distributed File System

Grid Engine 6. Policies. BioTeam Inc.

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform

New Issues and New Capabilities in HPC Scheduling with the Maui Scheduler

School of Business: Printing Guide

Moab and TORQUE Highlights CUG 2015

Chapter 2: Getting Started

Scheduling Algorithms for Dynamic Workload

An Oracle White Paper August Beginner's Guide to Oracle Grid Engine 6.2

Guideline for stresstest Page 1 of 6. Stress test

Survey on Job Schedulers in Hadoop Cluster

Matlab on a Supercomputer

Load Balancing in cloud computing

PTC System Monitor Solution Training

Broadening Moab/TORQUE for Expanding User Needs

Rethinking SIMD Vectorization for In-Memory Databases

Large system usage HOW TO. George Magklaras PhD Biotek/NCMM IT USIT Research Computing Services

Cobalt: An Open Source Platform for HPC System Software Research

WHITEPAPER. Making the most of SQL Backup Pro

Table of Contents. Cisco How Does Load Balancing Work?

HiBench Introduction. Carson Wang Software & Services Group

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

Getting Started with HPC

Hadoop Architecture. Part 1

International Journal of Advance Research in Computer Science and Management Studies

A New Quality of Service (QoS) Policy for Lustre Utilizing the Lustre Network Request Scheduler (NRS) Framework

Technical Bulletin. Arista LANZ Overview. Overview

OVERVIEW. Microsoft Project terms and definitions

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 4, July-Aug 2014

MuleSoft Blueprint: Load Balancing Mule for Scalability and Availability

Elevator Simulation and Scheduling: Automated Guided Vehicles in a Hospital

Microsoft SQL Server OLTP Best Practice

technical tips and tricks

An introduction to compute resources in Biostatistics. Chris Scheller

How to handle Out-of-Memory issue

CHAPTER 1 INTRODUCTION

NetBackup Performance Tuning on Windows

CIT 668: System Architecture. Performance Testing

In-memory Tables Technology overview and solutions

Energy-aware job scheduler for highperformance

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Research on Job Scheduling Algorithm in Hadoop

ENTERPRISE INFRASTRUCTURE CONFIGURATION GUIDE

The Importance of Software License Server Monitoring

Managing your Domino Clusters

An Approach to Load Balancing In Cloud Computing

Transcription:

The Moab Scheduler Dan Mazur, McGill HPC daniel.mazur@mcgill.ca Aug 23, 2013 1

Outline Fair Resource Sharing Fairness Priority Maximizing resource usage MAXPS fairness policy Minimizing queue times Should I split up my long duration job? Should I use procs=36 or nodes=3:ppn=12? Out of Memory 2

Job Scheduling Tetris Time Each colour = one job Some jobs can be split on the cores axis Unused cores Cores 3

Job Scheduling Tetris Time Unused cores Cores 4

Job Scheduling Tetris Lower priority High priority Time Low priority Unused cores Cores 5

Job Scheduling Tetris Time Backfill (small, low priority job can run when higher priority jobs can't) Unused cores Cores 6

Job Scheduling Tetris Job cannot be split horizontally (e.g. nodes=m:ppn=n instead of procs=p) Time Cores 7

Scheduling Considerations Maximize use of resources Cores are kept busy Maximize throughput of jobs Fairness Ensure users have access to their allocations (Fairshare) Avoiding monopolization from one user/group (MAXPS) 8

Priority Moab sorts jobs by priority (showq -i) Runs jobs from the list until a job cannot be run immediately Moab computes the earliest this job can run Runs jobs that can finish before the highest priority job will start (backfill) Time Priority Cores 9

Priority Factors On Guillimin: Time in Queue (weight = 1) FairShare (i.e. group's historical usage) (weight = 5) In total 41 factors affecting priority are documented in Moab Priority Queue time component FairShare component Time in queue 11

Fair Share Fair Share - Priority based on account's (i.e. group's) recent historical usage Most heavily weighted component of priority on Guillimin Looks at past 30 days Weighted. Yesterday's usage more important than usage 3 weeks ago. Fair Share target usage = your allocation 12

Fair Share Guillimin: Fairshare decay = 0.9 Fairshare interval = 1 day Fairshare depth = 30 days 13

Showstart Showstart command attempts to predict job start time does not know about jobs with higher priority that haven't been submitted yet, but will run before your job does not know about jobs that will be cancelled or finish before their walltime does not know about increases to job walltime usually very optimistic and inaccurate 14

MAXPS We limit the number of outstanding processor seconds a group can schedule Tetris: Limit on total area your group can use Fairness: Prevents accumulation of queue time priority for jobs that are beyond a group's quota Default MAXPS = 900 core days (soft), 1800 core days (hard) 900 core days = 30 cores x 30 days Soft limit - "blocked due to MAXPS limit exceeded" until outstanding scheduled processor time is reduced below MAXPS or the cluster has no other jobs to run Hard limit - The job will not run 15

MAXPS blocked - What can I do? Use the command 'showq -w acct=abc-123-aa' Which running and idle jobs from your group are using up the 900 core days (default) MAXPS window? Cancel large jobs with low priority to your research Contact greedy group members Sometimes a single job violates the MAXPS limitation Do you need that much walltime / that many cores? Can job be split into several smaller jobs 16

Splitting jobs in time and cores Users want short queue times to achieve fast time-tosolution Caveats about the following information Based on our aggregate data, not controlled experimentation No control for dependencies between jobs, group priority, etc. Seeking qualitative insight, not quantitative conclusions All axes and colours are logarithmic 17

Should I split up my long job? (Splitting in time) Long duration job = long queue time Short duration job = short queue time Should I split up my long job into shorter jobs to get a faster time-to-solution? 18

Should I split up my long job? Long duration job = long queue time Short duration job = short queue time Should I split up my long job into shorter jobs to get a faster time-to-solution? Submit jobs with a chain of dependencies Jobs don't accumulate queue time priority until all dependencies are resolved We would need sum of all queue times of partial jobs to be less than the queue time of the full job Note: Without dependencies, users can burst well above allocation for short durations by submitting lots of short duration jobs If embarrassingly parallel, splitting up your jobs is usually a great idea! 19

Single Core Jobs Multi-core jobs Compare the slope of the (solid) trendline to the slope of the (dotted) queue time = requested walltime line 20

Should I split up my long job? Almost always, the sum of the queue times for the partial jobs will be longer than the queue time for the full job Do not split up your long job Do enable checkpointing on your long job Tip: One last checkpoint msub -l signal=sighup@2:00 21

Procs or nodes:ppn? (Splitting in cores) nodes:ppn = Better hardware performance Minimize network traffic Minimize chance of failure procs = Less time in queue Job can be split up to fit in awkward spaces How can you get the fastest time-to-solution? 22

Jobs submitted with procs (white trendline) Jobs submitted with nodes:ppn (yellow trendline) 23

Procs or nodes:ppn? Depends strongly on application and current cluster load Example: 10,000 core hour job (big job - 100 cores for 4 days) spends ~6 extra hours in the queue using nodes:ppn instead of procs embarrassingly parallel -> use procs lots of network communication -> use nodes:ppn Example: 70 core hours (small job - 6 cores for 12 hours) about the same queue time using nodes:ppn or procs use nodes:ppn to get better hardware performance Example: 10 core hours (very small job - 1 core for 10 hours) Very small jobs more likely to run immediately with procs than with nodes:ppn if resources aren't available, the wait times are similar 24

Procs or nodes:ppn? Use nodes:ppn for most jobs For big jobs (~10,000 core hours) with embarrassing parallelism (little or no network communication), results several hours sooner with procs Also consider splitting tasks into separate jobs For very small jobs (~10 core hours), greater likelihood of running immediately (backfilling) with procs 25

Out of Memory Moab seems to have improved its algorithm for detecting memory overuse Some previously working jobs will now *correctly* be killed Use -M email@address moab option to get notified PBS Job Id: ########.gm-1r14-n05.guillimin.clumeq.ca Job Name: JobName Exec host: QQ-#r##-n##/# job deleted Job deleted at request of root@gm-1r14-n05.guillimin.clumeq.ca job ######## exceeded MEM usage hard limit ([MB Used per reserved core] > [MB Limit per core]) 26

Out of Memory We also have our own scripts to detect out of memory jobs Our scripts will always send an email: Subject: Job 12770997 terminated due to excessive memory usage Your job 12770997 was using a total of 34682176 kb of memory on node sw-2r15-n02. 27

Summary Today we learned: How priority is assigned to jobs How fair share priority is calculated How Moab uses priority to decide which job to run How backfilling works That you should not split up big jobs to save queue time That you should sometimes use procs instead of nodes:ppn That Moab is now more accurate in killing out-of-memory jobs 28

Questions What questions do you have? guillimin@calculquebec.ca 29