Until now: - access the cluster copy data to/from the cluster create parallel software compile code and use optimized libraries how to run the software on the full cluster tl;dr: - submit a job to the scheduler
What is a job?
What is a job scheduler?
Job scheduler/resource manager : Piece of software which: manages and allocates resources; manages and schedules jobs; Two computers are available for 10h You go, then you go. You wait. and sets up the environment for parallel and distributed computing
Resources: CPU cores Memory Disk space Network Accelerators Software Licenses
Slurm Free and open-source Mature Very active community Many success stories Runs 50% of TOP10 systems, including 1st Also an intergalactic soft drink
Other job schedulers PBSpro Torque/Maui Oracle (ex Sun) Grid Engine Condor...
You will learn how to: Create a job Monitor the jobs Control your own job Get job accounting info with
1. Make up your mind e.g. 1 core, 2GB RAM for 1 hour Job parameters resources you need; operations you need to perform. e.g. launch 'myprog' Job steps
2. Write a submission script It is a shell script (Bash) Bash sees these as comments Regular Bash comment Slurm takes them as commands Job step creation Regular Bash commands
Other useful parameters You want You ask To set a job name --job-name=myjobname To attach a comment to the job --comment= Some comment To get emails --email-type= BEGIN END FAILED --email-user=my@mail.com To set the name of the ouptut file --output=result-%j.txt --error=error-%j.txt To delay the start of your job --begin=16:00 --begin=now+1hour --begin=2010-01-20t12:34:00 To specify an ordering of your jobs --dependency=after(ok notok any):jobids --dependency=singleton To control failure options --nokill --norequeue --requeue
Constraints and resources You want You ask To choose a specific feature (e.g. a processor --constraint type or a NIC type) To use a specific resources (e.g. a gpu) --gres To reserve a whole node for yourself --exclusive To chose a partition --partition
3. Submit the script I submit with 'sbatch' Slurm gives me the JobID One more job parameter
So you can play Download http://www.cism.ucl.ac.be/services/formations/slurm.tgz with wget and untar it on hmem compile the 'stress' program you can use it to burn cputime and memory:./stress --cpu 1 --vm-bytes 128M --timeout 30s Write a job script Submit a job See it running Cancel it Get it killed
4. Monitor your job squeue sprio sstat sview
4. Monitor your job squeue sprio sstat sview
4. Monitor your job squeue sprio sstat sview
4. Monitor your job squeue sprio sstat sview
A word about backfill The rule: a job with a lower priority can start before a job with a higher priority if it does not delay that job's start time. resources job 100 job's priority 70 60 80 10 time Low priority job has short max run time and less requirements ; it starts before larger priority job
4. Monitor your job squeue sprio sstat sview
4. Monitor your job squeue sprio sstat sview
4. Monitor your job squeue sprio sstat sview http://www.schedmd.com/slurmdocs/slurm_ug_2011/sview-users-guide.pdf
5. Control your job scancel scontrol sview
5. Control your job scancel scontrol sview
5. Control your job scancel scontrol sview
5. Control your job scancel scontrol sview
5. Control your job scancel scontrol sview http://www.schedmd.com/slurmdocs/slurm_ug_2011/sview-users-guide.pdf
6. Job accounting sacct sreport sshare
6. Job accounting sacct sreport sshare
6. Job accounting sacct sreport sshare
6. Job accounting sacct sreport sshare
6. Job accounting sacct sreport sshare
6. Job accounting sacct sreport sshare
The rules of fairshare A share is allocated to you: 1/nbusers If your actual usage is above that share, your fairshare value is decreased towards 0. If your actual usage is below that share, your fairshare value is increased towards 1. The actual usage taken into account decreases over time
A word about fairshare
A word about fairshare Assume 3 users, 3-cores cluster Red uses 1 core for a certain period of time Blue uses 2 cores for half that period Red uses 2 cores afterwards #nodes time
A word about fairshare Assume 3 users, 3-cores cluster Red uses 1 core for a certain period of time Blue uses 2 cores for half that period Red uses 2 cores afterwards
A word about fairshare
Getting cluster info sinfo sjstat
Getting cluster info sinfo sjstat
Interactive work salloc salloc -ntasks=4 --nodes=2
Interactive work salloc salloc -ntasks=4 --nodes=2
Summary Explore the enviroment Get node features (sinfo --node --long) Get node usage (sinfo --summarize) Submit a job: Define the resources you need Determine what the job should do Submit the job script (sbatch) View the job status (squeue) Get accounting information (sacct) job script
You will learn how to: Create a parallel job Request distributed resources with
Concurrent - Parallel - Distributed Master/slave vs SPMD Synchronous vs asynchronous Message passing vs shared memory
Typical resource request You want You ask 16 independent processes (no communication) --ntasks=16 MPI and do not care about where cores are distributed --ntasks=16 cores spread across distinct nodes --ntasks=16 --nodes=16 cores spread across distinct nodes and nobody else around --ntasks=16 --nodes=16 --exclusive 16 processes to spread across 8 nodes --ntasks=16 --ntasks-per-node=2 16 processes on the same node --ntasks=16 --ntasks-per-node=16 one process multithreading that can use 16 cores for --ntasks=1 --cpus-per-task=16 4 processes that can use 4 cores --ntasks=4 --cpus-per-task=4 more constraint requests --distribution=block cyclic arbitrary
Use case 1: Random sampling Your program draws random numbers and processes them sequentially Parallelism is obtained by launching the same program multiple times simultaneously Every process does the same thing No inter process communication Results appended to one common file
Use case 1: Random sampling You want You ask 16 independent processes (no communication) --ntasks=16 You use srun./myprog
Use case 1: Random sampling You want You ask 16 independent processes (no communication) --array=1-16 --output=res%a You merge with cat res*
Use case 2: Multiple datafiles Your program processes data from one datafile Parallelism is obtained by launching the same program multiple times on distinct data files Everybody does the same thing on distinct data stored in different files No inter process communication Results appended to one common file
Use case 2: Multiple datafiles You want You ask 16 independent processes (no communication) --ntasks=16 You use srun./myprog $SLURM_PROCID
Use case 2: Multiple datafiles Useful commands: xargs and find/ls: Single node: ls data* xargs -n1 -P $SLURM_NPROCS myprog Multiple nodes: ls data* xargs -n1 -P $SLURM_NTASKS srun -c1 myprog Safer: find. -maxdepth1 -name data* -print0 xargs -0 -n1 -P...
Use case 2: Multiple datafiles You want You ask 16 independent processes (no communication) --array=1-16 You use $=SLURM_TASK_ARRAY_ID
Use case 3: Parameter sweep Your program tests something for one particular value of a parameter Parallelism is obtained by launching the same program multiple times with an distinct identifier Everybody does the same thing except for a given parameter value based on the identifier No inter process communication Results appended to one common file
Use case 3: Parameter sweep You want You ask 16 independent processes (no communication) --ntasks=16 You use srun./myprog $SLURM_PROCID
Use case 3: Parameter sweep You want You ask 16 independent processes (no communication) --array=1-16 --output=res%a You use $SLURM_ARRAY_TASK_ID cat res* to merge
Use case 3: Parameter sweep Useful command: GNU Parallel Single node: parallel -j $SLURM_NPROCS myprog ::: {1..5} ::: {A..D} Multiple nodes: parallel -j $SLURM_NTASKS srun -c1 myprog ::: {1..5} ::: {A..D} Useful: parallel --joblog runtask.log resume for checkpointing parallel echo data_{1}_{2}.dat ::: 1 2 3 ::: 1 2 3
Use case 4: Multithread Your program uses OpenMP or TBB Parallelism is obtained by launching a multithreaded program One program spawns itself on the node Inter process communication by shared memory Results managed in the program which outputs a summary
Use case 4: Multithread You want one process multithreading You use that can use You ask 16 cores for --ntasks=1 --cpus-per-task=16 OMP_NUMTHREADS=16 srun myprog
Use case 5: Message passing Your program uses MPI Parallelism is obtained by launching a multi-process program One program spawns itself on several nodes Inter process communication by the network Results managed in the program which outputs a summary
Use case 5: Message passing You want You ask 16 processes for use with MPI --ntasks=16 You use module load openmpi mpirun myprog
Use case 6: Master/slave You have two types of programs: master and slave Parallelism is obtained by launching a several slaves, managed by the master The master launches several slaves on distinct nodes Inter process communication by the network or the disk Results managed in the master program which outputs a summary
Use case 6: Master slave You want You ask 16 processes 16 threads --ntasks=16 --cpus-per-task=16 You use --multi-prog + conf file
Use case 6: Master slave You want You ask 16 processes 16 threads --ntasks=16 --cpus-per-task=16 You use --multi-prog + conf file
Summary Choose number of processes: --ntasks Choose number of threads: --cpu-per-task Launch processes with srun or mpirun Set multithreading with OMP_NUM_THREADS You can use $SLURM_PROC_ID $SLURM_TASK_ARRAY_ID
Try Download MPI hello world on Wikipedia, compile it, write job script and submit it Rewrite 'Multiple files' examples using xargs Rewrite 'Parameter sweep' example using GNU parallel