Using the general purpose computing clusters at EPFL

Using the general purpose computing clusters at EPFL scitas.epfl.ch March 5, 2015

1 / 1 Bellatrix Frontend at bellatrix.epfl.ch 16 x 2.2 GHz cores per node 424 nodes with 32GB Infiniband QDR network The batch system is SLURM RedHat 6.5

Castor I Frontend at castor.epfl.ch I 16 x 2.6 GHz cores per node I 50 nodes with 64GB I 2 nodes with 256GB I For sequential jobs (Matlab etc.) I The batch system is SLURM I RedHat 6.5 2/1

3 / 1 Deneb Frontend at deneb.epfl.ch 16 x 2.6 GHz cores per node 376 nodes with 64GB 8 nodes with 256GB 2 nodes with 512GB and 32 cores 16 nodes with 4 Nvidia K40 GPUs Infiniband QDR network The batch system is SLURM RedHat 6.5

4 / 1 Storage /home filesystem has per user quotas will be backed up for important things (source code, results and theses) shared by all clusters /scratch high performance temporary space is not backed up is organised by laboratory local to cluster

5 / 1 Connection Start the X server (automatic on a Mac) Open a terminal ssh -Y username@castor.epfl.ch Try the following commands: id pwd quota ls /scratch/<group>/<username>

6 / 1 The batch system Goal: to take a list of jobs and execute them when appropriate resources become available SCITAS uses SLURM on its clusters: http://slurm.schedmd.com The configuration depends on the purpose of the cluster (serial vs parallel)

7 / 1 sbatch The fundamental command is sbatch sbatch submits jobs to the batch system Suggested workflow: create a short job-script submit it to the batch system

8 / 1 sbatch - exercise Copy the first two examples to your home directory cp /scratch/examples/ex1.run. cp /scratch/examples/ex2.run. Open the file ex1.run with your editor of choice

9 / 1 ex1.run #!/bin/bash #SBATCH --workdir /scratch/<group>/<username> #SBATCH --nodes 1 #SBATCH --ntasks 1 #SBATCH --cpus-per-task 1 #SBATCH --mem 1024 sleep 10 echo "hello from $(hostname)" sleep 10

10 / 1 ex1.run #SBATCH is a directive to the batch system --nodes 1 the number of nodes to use - on Castor this is limited to 1 --ntasks 1 the number of tasks (in an MPI sense) to run per job --cpu-per-task 1 the number of cores per aforementioned task --mem 4096 the memory required per node in MB --time 12:00:00 --time 2-6 the time required # 12 hours # two days and six hours

11 / 1 Running ex1.run The job is assigned a default runtime of 15 minutes $ sbatch ex1.run Submitted batch job 439 $ cat /scratch/<group>/<username>/slurm-439.out hello from c03

12 / 1 What went on? sacct -j <JOB_ID> sacct -l -j <JOB_ID> Or more usefully: Sjob <JOB ID>

13 / 1 Cancelling jobs To cancel a specific job: scancel <JOB_ID> To cancel all your jobs: scancel -u <username>

14 / 1 ex2.run #!/bin/bash #SBATCH --workdir /scratch/<group>/<username> #SBATCH --nodes 1 #SBATCH --ntasks 1 #SBATCH --cpus-per-task 8 #SBATCH --mem 122880 #SBATCH --time 00:30:00 #SBATCH --mail-type=all #SBATCH --mail-user=<email_address> /scratch/examples/linpack/runme_1_45k

15 / 1 What s going on? squeue squeue -u <username> Squeue Sjob <JOB_ID> scontrol -d show job <JOB_ID> sinfo

16 / 1 squeue and Squeue squeue Job states: Pending, Resources, Priority, Running squeue grep <JOB_ID> squeue -j <JOB_ID> Squeue <JOB_ID>

17 / 1 Sjob $ Sjob <JOB_ID> JobID JobName Cluster Account Partition Timelimit User Group ------------ ---------- ---------- ---------- ---------- ---------- --------- --------- 31006 ex1.run castor scitas-ge serial 00:15:00 jmenu scitas-ge 31006.batch batch castor scitas-ge Submit Eligible Start End ------------------- ------------------- ------------------- ------------------- 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:56:08 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:56:08 Elapsed ExitCode State ---------- -------- ---------- 00:00:20 0:0 COMPLETED 00:00:20 0:0 COMPLETED NCPUS NTasks NodeList UserCPU SystemCPU AveCPU MaxVMSize ---------- -------- --------------- ---------- ---------- ---------- ---------- 1 c04 00:00:00 00:00.001 1 1 c04 00:00:00 00:00.001 00:00:00 207016K

18 / 1 scontrol $ scontrol -d show job <JOB_ID> $ scontrol -d show job 400 obid=400 Name=s1.job UserId=user(123456) GroupId=group(654321) Priority=111 Account=scitas-ge QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:03:39 TimeLimit=00:15:00 TimeMin=N/A SubmitTime=2014-03-06T09:45:27 EligibleTime=2014-03-06T09:45:27 StartTime=2014-03-06T09:45:27 EndTime=2014-03-06T10:00:27 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=serial AllocNode:Sid=castor:106310 ReqNodeList=(null) ExcNodeList=(null) NodeList=c03 BatchHost=c03 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=c03 CPU IDs=0 Mem=1024 MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/<user>/jobs/s1.job WorkDir=/scratch/<group>/<user>

19 / 1 Modules Modules make your life easier module avail module show <take your pick> module load <take your pick> module list module purge module list

20 / 1 ex3.run - Mathematica Copy the following files to your chosen directory: cp /scratch/examples/ex3.run. cp /scratch/examples/mathematica.in. Submit ex3.run to the batch system and see what happens...

21 / 1 ex3.run #!/bin/bash #SBATCH --ntasks 1 #SBATCH --cpus-per-task 1 #SBATCH --nodes 1 #SBATCH --mem 4096 #SBATCH --time 00:05:00 echo STARTING AT date module purge module load mathematica/9.0.1 math < mathematica.in echo FINISHED at date

22 / 1 Compiling ex4.* source files Copy the following files to your chosen directory: /scratch/examples/ex4_readme.txt /scratch/examples/ex4.c /scratch/examples/ex4.cxx /scratch/examples/ex4.f90 /scratch/examples/ex4.run Then compile them with: module load intelmpi/4.1.3 mpiicc -o ex4 c ex4.c mpiicpc -o ex4 cxx ex4.cxx mpiifort -o ex4 f90 ex4.f90

23 / 1 The 3 methods to get interactive access 1/3 In order to schedule an allocation use salloc with exactly the same options for resources as sbatch You will then arrive in a new prompt which is still on the submission node but by using srun you can get access to the allocated resources eroche@castor:hello > salloc -N 1 -n 2 salloc: Granted job allocation 1234 bash-4.1$ hostname castor bash-4.1$ srun hostname c03 c03

24 / 1 The 3 methods to get interactive access 2/3 To get a prompt on the machine one needs to use the --pty option with srun and then bash -i (or tcsh -i ) to get the shell: eroche@castor > salloc -N 1 -n 1 salloc: Granted job allocation 1235 eroche@castor > srun --pty bash -i bash-4.1$ hostname c03

25 / 1 The 3 methods to get interactive access 3/3 This is the least elegant but it is the method by which one can run X11 applications: eroche@bellatrix > salloc -n 1 -c 16 -N 1 salloc: Granted job allocation 1236 bash-4.1$ srun hostname c04 bash-4.1$ ssh -Y c04 eroche@c04 >

26 / 1 The SLURM jobs environment SLURM defines a number of variables, among them: jmenu@castor: > srun env grep SLURM SLURM SUBMIT DIR=/home/jmenu SLURM SUBMIT HOST=castor SLURM JOB ID=90771 SLURM JOB NUM NODES=1 SLURM JOB NODELIST=c05 SLURM JOB CPUS PER NODE=1 SLURM MEM PER CPU=1024 SLURM JOBID=90771... SLURM NODELIST=c05 SLURM TASKS PER NODE=1 SLURM NTASKS=1 SLURM NPROCS=1... SLURM CPUS ON NODE=1 SLURM TASK PID=72868... SLURMD NODENAME=c05

27 / 1 Dynamic libs used in an application ldd displays the libraries an executable file depends on: jmenu@castor: ~ /COURS > ldd ex4 f90 linux-vdso.so.1 => (0x00007fff4b905000) libmpigf.so.4 => /opt/software/intel/14.0.1/intel64/lib/libmpigf.so.4 (0x00007f556cf88000) libmpi.so.4 => /opt/software/intel/14.0.1/intel64/lib/libmpi.so.4 (0x00007f556c91c000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003807e00000) librt.so.1 => /lib64/librt.so.1 (0x0000003808a00000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003808200000) libm.so.6 => /lib64/libm.so.6 (0x0000003807600000) libc.so.6 => /lib64/libc.so.6 (0x0000003807a00000) libgcc s.so.1 => /lib64/libgcc s.so.1 (0x000000380c600000) /lib64/ld-linux-x86-64.so.2 (0x0000003807200000)

28 / 1 ex4.run #!/bin/bash... module purge module load intelmpi/4.1.3 module list echo LAUNCH DIR=/scratch/scitas-ge/jmenu EXECUTABLE="./ex4 f90" echo "--> LAUNCH DIR = ${LAUNCH DIR}" echo "--> EXECUTABLE = ${EXECUTABLE}" echo echo "--> ${EXECUTABLE} depends on the following dynamic libraries:" ldd ${EXECUTABLE} echo cd ${LAUNCH DIR} srun ${EXECUTABLE}...

29 / 1 The debug partition In order to have priority access for debugging sbatch --partition=debug ex1.run Limits on Castor: 1 hours walltime 16 cores between all users

30 / 1 General: cgroups (Castor) cgroups ( control groups ) is a Linux kernel feature to limit, account, and isolate resource usage (CPU, memory, disk I/O, etc.) of process groups SLURM: Linux cgroups apply contraints to the CPUs and memory that can be used by a job They are automatically generated using the resource requests given to SLURM They are automatically destroyed at the end of the job, thus releasing all resources used Even if there is physical memory available a task will be killed if it tries to exceed the limits of the cgroup!

31 / 1 System process view Two tasks running on the same node with ps auxf root 177873 slurmstepd: [1072] user 177877 \_ /bin/bash /var/spool/slurmd/job01072/slurm_script user 177908 \_ sleep 10 root 177890 slurmstepd: [1073] user 177894 \_ /bin/bash /var/spool/slurmd/job01073/slurm_script user 177970 \_ sleep 10 Check memory, thread and core usage with htop

32 / 1 Fair share 1/3 The scheduler is configured to give all groups a share of the computing power Within each group the members have an equal share by default: jmenu@castor:~ > sacctmgr show association where account=lacal format=account,cluster,user,grpnodes, QOS,DefaultQOS,Share tree Account Cluster User GrpNodes QOS Def QOS Share -------------------- ---------- ---------- -------- -------------------- --------- --------- lacal castor normal 1 lacal castor aabecker debug,normal normal 1 lacal castor kleinjun debug,normal normal 1 lacal castor knikitin debug,normal normal 1 Priority is based on recent usage this is forgotten about with time (half-life) fair share comes into play when the resources are heavily used

33 / 1 Fair share 2/3 Job priority is a weighted sum of various factors: jmenu@castor:~ > sprio -w JOBID PRIORITY AGE FAIRSHARE QOS Weights 1000 10000 100000 jmenu@bellatrix:~ > sprio -w JOBID PRIORITY AGE FAIRSHARE JOBSIZE QOS Weights 1000 10000 100 100000 To compare jobs priorities: jmenu@castor:~ > sprio -j80833,77613 JOBID PRIORITY AGE FAIRSHARE QOS 77613 145 146 0 0 80833 9204 93 9111 0

34 / 1 Fair share 3/3 FairShare values range from 0.0 to 1.0: Value Meaning 0.0 you used much more resources that you were granted 0.5 you got what you paid for 1.0 you used nearly no resources jmenu@bellatrix:~ > sshare -a -A lacal Accounts requested: : lacal Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- lacal 666 0.097869 1357691548 0.256328 0.162771 lacal boissaye 1 0.016312 0 0.042721 0.162771 lacal janson 1 0.016312 0 0.042721 0.162771 lacal jetchev 1 0.016312 0 0.042721 0.162771 lacal kleinjun 1 0.016312 1357691548 0.256328 0.000019 lacal pbottine 1 0.016312 0 0.042721 0.162771 lacal saltini 1 0.016312 0 0.042721 0.162771 More information at: http://schedmd.com/slurmdocs/priority_multifactor.html

35 / 1 Helping yourself man pages are your friend! man sbatch man sacct man gcc module load intel/14.0.1 man ifort

36 / 1 Getting help If you still have problems then send a message to: 1234@epfl.ch Please start the subject with HPC for automatic routing to the HPC team Please give as much information as possible including: the jobid the directory location and name of the submission script where the slurm-*.out file is to be found how the sbatch command was used to submit it the output from env and module list commands

37 / 1 Appendix Change your shell at: https://dinfo.epfl.ch/cgi-bin/accountprefs Scitas web site: http://scitas.epfl.ch