HPC-Nutzer Informationsaustausch. The Workload Management System LSF

Size: px

Start display at page:

Download "HPC-Nutzer Informationsaustausch. The Workload Management System LSF"

Ophelia Lang
8 years ago
Views:

1 HPC-Nutzer Informationsaustausch The Workload Management System LSF

2 Content Cluster facts Job submission esub messages Scheduling strategies Tools and security Future plans 2 von 10

3 Some facts about the cluster about 2,000 hosts (1,400 from BULL installation in 2011) about 35,000 cores hosts have between 12 and 128 cores and 24 to 2,048 GB memory 10,000 to several 100,000 jobs in the queues waiting times depending on job size and filling level of the cluster Between several minutes up to several weeks about 150 active users, ten times more seen on the batch system over the time about 240 million cpu hours per year (BULL installation) planning new additional installations in 2016 and von 10

level of the cluster Between several minutes up to several weeks about 150 active users, ten times more seen on the batch system

4 Submission bsub < jobscript vs. bsub [options] command please use jobscripts -> helps debugging due to easier reading the verification script describes rules for jobs project management additional resources (bcs, phi, gpu ) and informs the user in various cases about decisions made regular review of long pending jobs done by us 4 von 10

reading the verification script describes rules for jobs project management

5 esub messages mail related messages is not a valid address, please use a correct address! The mailbox is out of quota, prohibitive messages Using a loginshell (-L) is not allowed! Setting requeue values (-Q) is not allowed! Stacksize must not be bigger then requested memory size. You seem to use hpcwork but did not request it. project related messages The project is not (yet) active!... The project has used up all of its quota 5 von 10

Setting requeue values (-Q) is not allowed! Stacksize must not be bigger then requested memory size.

6 Scheduling strategies / rules 1 setting default values if not given by user jobname, output file, memory limit, wall time limit put jobs into the right queue depending on the request exception GPU-queue request additional resources depending on the job type BCS boards for bcs jobs resource according to requested project (ih exclusive) additional settings e.g. for mic jobs on phi nodes describe job through the job description (-Jd), which gets parsed 6 von 10

resources depending on the job type BCS boards for bcs jobs resource according to requested project (ih exclusive)

7 Scheduling strategies / rules 2 JARA jobs per default scheduled to mpi-s able to define automatically a ptile depending on the memory request need to request -m mpi-l or -a bcs for hosts with larger memory set jobs exclusive with 32 or more slots requested job cannot be disturbed by other, eventually misbehaving jobs jobs get priority depending on various factors (fairshare) increases while pending decreases with running jobs from same LSF user group initial priority defined by granted quota (rwth / jara) or provided hardware (ih fairshare) jobs of projects / users, which are over quota, get scheduled to the low queue cron job (run once a day) takes care of switching jobs between queues 7 von 10

on various factors (fairshare) increases while pending decreases with running jobs from same LSF user group initial priority defined by granted quota (rwth / jara) or provided

8 Tools and security no detailed job information for jobs of users of other user groups (projects) but also lack of summary information in jara-, rwth-, queues -> use tools jarajobs, rwthjobs, lecturejobs, r_batch_usage gives a per month summary of the usage of the batch system add -q to see your batch quota add -p <projectid> to see usage of the project, if you are allowed to submit to it 8 von 10

lecturejobs, r_batch_usage gives a per month summary of the usage of the batch system add -q to see

9 Future plans upgrade to LSF change of concepts from slots to tasks still need to fully evaluate consequences activation of cpu and memory affinity scheduling bad jobs do not influence good jobs that much anymore should increase performance of jobs activation of power aware scheduling jobs running at lower frequencies reduce power costs reward with less accounted core hours reporting for projects (first ih, later also jara, rwth, ) lower the threshold for exclusivity of jobs 9 von 10

scheduling bad jobs do not influence good jobs that much anymore should increase performance of jobs activation of power

10 Many thanks for your attention!

Informationsaustausch für Nutzer des Aachener HPC Clusters

Informationsaustausch für Nutzer des Aachener HPC Clusters Paul Kapinos, Marcus Wagner - 21.05.2015 Informationsaustausch für Nutzer des Aachener HPC Clusters Agenda (The RWTH Compute cluster) Project-based