The Moab Scheduler. Dan Mazur, McGill HPC Aug 23, 2013

Size: px

Start display at page:

Download "The Moab Scheduler. Dan Mazur, McGill HPC daniel.mazur@mcgill.ca Aug 23, 2013"

Brice Shelton
8 years ago
Views:

1 The Moab Scheduler Dan Mazur, McGill HPC Aug 23,

2 Outline Fair Resource Sharing Fairness Priority Maximizing resource usage MAXPS fairness policy Minimizing queue times Should I split up my long duration job? Should I use procs=36 or nodes=3:ppn=12? Out of Memory 2

Minimizing queue times Should I split up my long

3 Job Scheduling Tetris Time Each colour = one job Some jobs can be split on the cores axis Unused cores Cores 3

4 Job Scheduling Tetris Time Unused cores Cores 4

5 Job Scheduling Tetris Lower priority High priority Time Low priority Unused cores Cores 5

6 Job Scheduling Tetris Time Backfill (small, low priority job can run when higher priority jobs can't) Unused cores Cores 6

7 Job Scheduling Tetris Job cannot be split horizontally (e.g. nodes=m:ppn=n instead of procs=p) Time Cores 7

8 Scheduling Considerations Maximize use of resources Cores are kept busy Maximize throughput of jobs Fairness Ensure users have access to their allocations (Fairshare) Avoiding monopolization from one user/group (MAXPS) 8

Fairness Ensure users have access to their allocations

9 Priority Moab sorts jobs by priority (showq -i) Runs jobs from the list until a job cannot be run immediately Moab computes the earliest this job can run Runs jobs that can finish before the highest priority job will start (backfill) Time Priority Cores 9

the earliest this job can run Runs jobs that can finish before

10 Priority Factors On Guillimin: Time in Queue (weight = 1) FairShare (i.e. group's historical usage) (weight = 5) In total 41 factors affecting priority are documented in Moab Priority Queue time component FairShare component Time in queue 11

11 Fair Share Fair Share - Priority based on account's (i.e. group's) recent historical usage Most heavily weighted component of priority on Guillimin Looks at past 30 days Weighted. Yesterday's usage more important than usage 3 weeks ago. Fair Share target usage = your allocation 12

12 Fair Share Guillimin: Fairshare decay = 0.9 Fairshare interval = 1 day Fairshare depth = 30 days 13

13 Showstart Showstart command attempts to predict job start time does not know about jobs with higher priority that haven't been submitted yet, but will run before your job does not know about jobs that will be cancelled or finish before their walltime does not know about increases to job walltime usually very optimistic and inaccurate 14

job does not know about jobs that will be cancelled or finish before their walltime

14 MAXPS We limit the number of outstanding processor seconds a group can schedule Tetris: Limit on total area your group can use Fairness: Prevents accumulation of queue time priority for jobs that are beyond a group's quota Default MAXPS = 900 core days (soft), 1800 core days (hard) 900 core days = 30 cores x 30 days Soft limit - "blocked due to MAXPS limit exceeded" until outstanding scheduled processor time is reduced below MAXPS or the cluster has no other jobs to run Hard limit - The job will not run 15

(soft), 1800 core days (hard) 900 core days = 30 cores x 30 days Soft limit - "blocked due to MAXPS limit exceeded" until

15 MAXPS blocked - What can I do? Use the command 'showq -w acct=abc-123-aa' Which running and idle jobs from your group are using up the 900 core days (default) MAXPS window? Cancel large jobs with low priority to your research Contact greedy group members Sometimes a single job violates the MAXPS limitation Do you need that much walltime / that many cores? Can job be split into several smaller jobs 16

the 900 core days (default) MAXPS window?

16 Splitting jobs in time and cores Users want short queue times to achieve fast time-tosolution Caveats about the following information Based on our aggregate data, not controlled experimentation No control for dependencies between jobs, group priority, etc. Seeking qualitative insight, not quantitative conclusions All axes and colours are logarithmic 17

controlled experimentation No control for dependencies between jobs, group priority, etc.

17 Should I split up my long job? (Splitting in time) Long duration job = long queue time Short duration job = short queue time Should I split up my long job into shorter jobs to get a faster time-to-solution? 18

18 Should I split up my long job? Long duration job = long queue time Short duration job = short queue time Should I split up my long job into shorter jobs to get a faster time-to-solution? Submit jobs with a chain of dependencies Jobs don't accumulate queue time priority until all dependencies are resolved We would need sum of all queue times of partial jobs to be less than the queue time of the full job Note: Without dependencies, users can burst well above allocation for short durations by submitting lots of short duration jobs If embarrassingly parallel, splitting up your jobs is usually a great idea! 19

19 Single Core Jobs Multi-core jobs Compare the slope of the (solid) trendline to the slope of the (dotted) queue time = requested walltime line 20

20 Should I split up my long job? Almost always, the sum of the queue times for the partial jobs will be longer than the queue time for the full job Do not split up your long job Do enable checkpointing on your long job Tip: One last checkpoint msub -l signal=sighup@2:00 21

be longer than the queue time for the full job Do not split up your

21 Procs or nodes:ppn? (Splitting in cores) nodes:ppn = Better hardware performance Minimize network traffic Minimize chance of failure procs = Less time in queue Job can be split up to fit in awkward spaces How can you get the fastest time-to-solution? 22

22 Jobs submitted with procs (white trendline) Jobs submitted with nodes:ppn (yellow trendline) 23

23 Procs or nodes:ppn? Depends strongly on application and current cluster load Example: 10,000 core hour job (big job cores for 4 days) spends ~6 extra hours in the queue using nodes:ppn instead of procs embarrassingly parallel -> use procs lots of network communication -> use nodes:ppn Example: 70 core hours (small job - 6 cores for 12 hours) about the same queue time using nodes:ppn or procs use nodes:ppn to get better hardware performance Example: 10 core hours (very small job - 1 core for 10 hours) Very small jobs more likely to run immediately with procs than with nodes:ppn if resources aren't available, the wait times are similar 24

24 Procs or nodes:ppn? Use nodes:ppn for most jobs For big jobs (~10,000 core hours) with embarrassing parallelism (little or no network communication), results several hours sooner with procs Also consider splitting tasks into separate jobs For very small jobs (~10 core hours), greater likelihood of running immediately (backfilling) with procs 25

25 Out of Memory Moab seems to have improved its algorithm for detecting memory overuse Some previously working jobs will now *correctly* be killed Use -M moab option to get notified PBS Job Id: ########.gm-1r14-n05.guillimin.clumeq.ca Job Name: JobName Exec host: QQ-#r##-n##/# job deleted Job deleted at request of job ######## exceeded MEM usage hard limit ([MB Used per reserved core] > [MB Limit per core]) 26

26 Out of Memory We also have our own scripts to detect out of memory jobs Our scripts will always send an Subject: Job terminated due to excessive memory usage Your job was using a total of kb of memory on node sw-2r15-n02. 27

27 Summary Today we learned: How priority is assigned to jobs How fair share priority is calculated How Moab uses priority to decide which job to run How backfilling works That you should not split up big jobs to save queue time That you should sometimes use procs instead of nodes:ppn That Moab is now more accurate in killing out-of-memory jobs 28

28 Questions What questions do you have? 29

Job Scheduling Explained More than you ever want to know about how jobs get scheduled on WestGrid systems...

Job Scheduling Explained More than you ever want to know about how jobs get scheduled on WestGrid systems... Martin Siegert, SFU Cluster Myths There are so many jobs in the queue - it will take ages until