Efficient cluster computing Introduction to the Sun Grid Engine (SGE) queuing system Markus Rampp (RZG, MIGenAS) MPI for Evolutionary Anthropology Leipzig, Feb. 16, 2007
Outline Introduction Basic concepts: queues, jobs, scripts essential SGE commands and options Advanced topics Job chains Array jobs DRMAA API Tips & Tricks, References not covered: SGE configuration & administration, policies, accounting, grid computing, MPI,...
Introduction Sun Grid Engine (SGE): a popular batch-queuing system Software like SGE is typically used on a computer farm or computer cluster and is responsible for accepting, scheduling, dispatching, and managing the remote execution of large numbers of standalone, parallel or interactive user jobs. It also manages and schedules the allocation of distributed resources such as processors, memory, disk space, and software licenses. (taken from Wikipedia) Popular batch systems (DRMs) Sun Grid Engine (open source) LoadLeveler (IBM) NQS (Cray, NEC) DQS (open source)...
Introduction (2) Why should one use a DRM? Increase efficiency Operator s perspective: transparent resource management clustering of compute resources load balancing, optimization of resource usage fair (policy-based) distribution of resources accounting User s perspective: shared usage of system resources optimize throughput organize/simplify handling of ( large ) computational tasks enhanced stability (survive system crashes, maintenance,... ) well-defined resource allocation ( benchmarking) facilitates (non-interactive) work
Basic concepts Queues: Queue 1 Queue 2 Queue 1 Queue 2 Queue 3 Resource A Resource B queuename qtype used/tot. load_avg arch states all.q@e01.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 d all.q@e02.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 all.q@e03.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 all.q@e04.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 all.q@e05.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 all.q@e06.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 all.q@e07.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 normal@eva001.opt.rzg.mpg.de B 0/0 0.09 lx26-amd64 normal@eva002.opt.rzg.mpg.de B 0/4 0.00 lx26-amd64 normal@eva003.opt.rzg.mpg.de B 0/4 0.00 lx26-amd64 normal@eva004.opt.rzg.mpg.de B 0/4 0.00 lx26-amd64 normal@eva005.opt.rzg.mpg.de B 0/4 0.00 lx26-amd64
Basic concepts (2) Jobs & scripts 1. prepare script of executable commands 2. specify resources and meta information 3. submit to batch system (returns a job ID) 4. use the job ID for job control (query status, cancel,... ) #$ -S /bin/sh #$ -cwd #$ -M mjr@rzg.mpg.de #$ -m e #$ -N example #begin executable commands (shell specified by #$ -S) # note: starting here, a leading # starts # a comment, whereas in the above # SGE header it does NOT echo "starting job..." blastall -p blastp -d nr -i query_1.fa -o blastout_1.txt blastall -p blastp -d nr -i query_2.fa -o blastout_2.txt echo "...done" > qsub example_1.sge Your job 10404 ("example") has been submitted. > qstat job-id prior name user state submit/start at queue slots ja-task-id ------------------------------------- 10404 0.00000 example mjr qw 02/12/2007 11:43:42 1 > qstat job-id prior name user state submit/start at queue slots ja-task-id ------------------------------------- 10404 0.55500 example mjr r 02/12/2007 12:19:26 all.q@e13.bc.rzg.mpg.de 1
SGE commands & options Interacting with the queuing system: SGE s q-commands qsub submit job qstat query queue/job status qdel delete job qhold hold ( suspend ) job; (note: user/operator/system holds) cf. ll-commands of LoadLeveler llsubmit llq,llstatus llcancel llhold qrls releases holds llhold -r qalter, qmod modify job qhost provide concise system overview llmodify qmon Graphical user interface (X)
SGE commands & options (2) Specify qsub options in script header and/or on command line (overrides script) Essential options for qsub: -S: path to shell -m b e a s n...: send mail at beginning end... of job -M: E-mail address for notification -N: name of job -j y: join stdout and stderr Additional options for qsub: -q: queue -p: priority (default 0; users may only decrease) -P: name of project -a: earliest date/time at which a job is eligible for execution... : cf. man qsub
SGE commands & options (3) Commonly used options for qstat: qstat displays list of jobs only qstat -u <user> -j <job ID> displays list of jobs for specified user/job qstat -f full format display qstat -r extended display (incl. resource requirements, scheduling info) >qstat -f queuename qtype used/tot. load_avg arch states all.q@e01.bc.rzg.mpg.de BIP 0/2 0.04 lx26-x86 d all.q@e02.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86... all.q@e07.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 all.q@e08.bc.rzg.mpg.de BIP 2/2 3.67 lx26-x86 10422 0.56000 megablast hfz r 02/13/2007 20:34:12 1 882 10422 0.56000 megablast hfz r 02/13/2007 20:34:12 1 883 all.q@e09.bc.rzg.mpg.de BIP 2/2 4.85 lx26-x86 10422 0.56000 megablast hfz r 02/13/2007 20:28:27 1 864 10422 0.56000 megablast hfz r 02/13/2007 20:31:57 1 875... ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 10422 0.55242 megablast hfz qw 02/13/2007 16:34:29 1 886-5000:1
SGE commands & options (4) > qstat -r -j 10422 ============================================================== job_number: 10422 exec_file: job_scripts/10422 submission_time: Tue Feb 13 16:34:29 2007 owner: hfz uid: 1553 group: rzb gid: 4131 sge_o_home: /afs/ipp/home/h/hfz sge_o_log_name: hfz sge_o_path: /opt/sge6/bin/lx26-x86:/usr/local/bin:/opt/gnome/bin:/usr/games:/usr/bin/x11:/usr/bin:/bin sge_o_shell: /bin/tcsh sge_o_workdir: /bio/tmp/hfz/sargossa sge_o_host: e01 account: sge cwd: /bio/tmp/hfz/sargossa path_aliases: /tmp_mnt/ * * / mail_list: hfz@e01.bc.rzg.mpg.de notify: FALSE job_name: megablast jobshare: 0 shell_list: /bin/sh env_list: script_file: /afs/ipp/home/h/hfz/mysql/sequenzen/e01/submit_megablast_test.sge project: gendb job-array tasks: 1-5000:1 usage 808: cpu=00:08:46, mem=595.19275 GBs, io=0.00000, vmem=1.377g, maxvmem=1.566g.. usage 875: cpu=00:00:24, mem=28.67529 GBs, io=0.00000, vmem=1.251g, maxvmem=1.251g scheduling info: queue instance "all.q@e01.bc.rzg.mpg.de" dropped because it is disabled queue instance "all.q@f12.bc.rzg.mpg.de" dropped because it is queue instance "all.q@e13.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@f08.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@f01.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@e14.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@f03.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@f05.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@e09.bc.rzg.mpg.de" dropped because it is full (project gendb) is not allowed to run in host "e07.bc.rzg.mpg.de" based on the excluded project list not all array task may be started due to max_aj_instances
Input/Output Output stdout: <job name>.o<job ID> stderr: <job name>.e<job ID> path: can be specified by qsub -o <stdout path> -e <stderr path> paths relative to current working directory at submission (with qsub -cwd option) user s home directory (if -cwd option is not specified): > ls example.e10404 example.o10404 example_1.sge Input arguments: qsub [ options ] [ command -- [ command_args ]] > qsub -p -10 example_1.sge arg1
Advanced topics Job chains: sets of consecutive interdependent jobs Job arrays: sets of similar and independent (parallel) jobs DRMAA: API specification
Job chains: sets of consecutive jobs Solution 1 (trivial) >cat allinone.sge #$ -S /bin/sh #$ -N allinone./doformatdb./doblastall./dopostprocessing >qsub allinone.sge Your job 10411 ("allinone") has been submitted. Solution 2 (modular, nested qsub) >cat formatdb.sge #$ -S /bin/sh #$ -N FormatDB./doFormatDB qsub blastall.sge >cat blastall.sge #$ -S /bin/sh #$ -N BlastAll./doBlastAll qsub postprocessing.sge... >qsub formatdb.sge Your job 10421 ("formatdb") has been submitted.
Job chains: sets of consecutive jobs (2) Solution 3 (optimized, uses -hold jid <job id job name>) >cat formatdb.sge #$ -S /bin/sh #$ -N FormatDB./doFormatDB >cat blastall.sge #$ -S /bin/sh #$ -N BlastAll #$ -hold_jid FormatDB./doBlastAll... >qsub formatdb.sge Your job 10451 ("formatdb") has been submitted. >qsub blastall.sge Your job 10452 ("blastall") has been submitted. >qsub postprocessing.sge Your job 10453 ("postprocessing") has been submitted. Advantage: accumulates waiting time Note: -hold_jid <job_name> can only be used to reference jobs of the same user (-hold_jid <job_id> can be used to reference any job)
Array jobs Submit sets of similar and independent tasks : qsub -t 1-500:1 example_3.sge submits 500 instances of the same script each instance ( task ) is executed independently all instances subsumed with a single job ID variable $SGE_TASK_ID discriminates between instances task numbering scheme: -t <first>-<last>:<stepsize> related: $SGE_TASK_FIRST,$SGE_TASK_LAST,$SGE_TASK_STEPSIZE Example: #$ -S /bin/sh #$ -cwd #$ -N blastarray #$ -t 1-500:1 QUERY=query_${SGE_TASK_ID}.fa OUTPUT=blastout_${SGE_TASK_ID}.txt echo "processing query $QUERY..." blastall -p blastn -d nt -i $QUERY -o $OUTPUT echo "...done"
Array jobs (2) > qsub example_3.sge Your job 10420.1-500:1 ("blastarray") has been submitted. > qstat job-id prior name user state submit/start at queue slots ja-task-id ------------------------------------- 10420 0.56000 blastarray mjr r 02/13/2007 15:05:56 all.q@e08.bc.rzg.mpg.de 1 198 10420 0.56000 blastarray mjr r 02/13/2007 15:05:56 all.q@e08.bc.rzg.mpg.de 1 199 10420 0.56000 blastarray mjr r 02/13/2007 15:07:11 all.q@e09.bc.rzg.mpg.de 1 202 10420 0.56000 blastarray mjr r 02/13/2007 15:07:11 all.q@e09.bc.rzg.mpg.de 1 203 10420 0.56000 blastarray mjr r 02/13/2007 15:05:41 all.q@e10.bc.rzg.mpg.de 1 196 10420 0.56000 blastarray mjr r 02/13/2007 15:05:41 all.q@e10.bc.rzg.mpg.de 1 197 10420 0.55241 blastarray mjr r 02/13/2007 15:08:41 all.q@e11.bc.rzg.mpg.de 1 208 10420 0.55241 blastarray mjr r 02/13/2007 15:08:41 all.q@e11.bc.rzg.mpg.de 1 209 10420 0.56000 blastarray mjr r 02/13/2007 15:08:11 all.q@e12.bc.rzg.mpg.de 1 204 10420 0.56000 blastarray mjr r 02/13/2007 15:08:11 all.q@e12.bc.rzg.mpg.de 1 206 10420 0.56000 blastarray mjr r 02/13/2007 15:02:11 all.q@e13.bc.rzg.mpg.de 1 176 10420 0.56000 blastarray mjr r 02/13/2007 15:02:11 all.q@e13.bc.rzg.mpg.de 1 177 10420 0.56000 blastarray mjr r 02/13/2007 15:03:26 all.q@e14.bc.rzg.mpg.de 1 182 10420 0.56000 blastarray mjr r 02/13/2007 15:03:26 all.q@e14.bc.rzg.mpg.de 1 183 10420 0.56000 blastarray mjr r 02/13/2007 15:07:11 all.q@f01.bc.rzg.mpg.de 1 200 10420 0.56000 blastarray mjr r 02/13/2007 15:07:11 all.q@f01.bc.rzg.mpg.de 1 201 10420 0.56000 blastarray mjr r 02/13/2007 15:05:11 all.q@f02.bc.rzg.mpg.de 1 193 10420 0.56000 blastarray mjr r 02/13/2007 15:05:11 all.q@f02.bc.rzg.mpg.de 1 194 10420 0.56000 blastarray mjr r 02/13/2007 15:04:41 all.q@f03.bc.rzg.mpg.de 1 190 10420 0.56000 blastarray mjr r 02/13/2007 15:04:41 all.q@f03.bc.rzg.mpg.de 1 191 10420 0.56000 blastarray mjr r 02/13/2007 15:03:41 all.q@f04.bc.rzg.mpg.de 1 184 10420 0.56000 blastarray mjr r 02/13/2007 15:03:41 all.q@f04.bc.rzg.mpg.de 1 185 10420 0.56000 blastarray mjr r 02/13/2007 15:08:11 all.q@f05.bc.rzg.mpg.de 1 205 10420 0.56000 blastarray mjr r 02/13/2007 15:08:11 all.q@f05.bc.rzg.mpg.de 1 207 10420 0.56000 blastarray mjr r 02/13/2007 15:05:11 all.q@f06.bc.rzg.mpg.de 1 192 10420 0.56000 blastarray mjr r 02/13/2007 15:05:26 all.q@f06.bc.rzg.mpg.de 1 195 10420 0.56000 blastarray mjr r 02/13/2007 15:04:26 all.q@f07.bc.rzg.mpg.de 1 188 10420 0.56000 blastarray mjr r 02/13/2007 15:04:26 all.q@f07.bc.rzg.mpg.de 1 189 10420 0.56000 blastarray mjr r 02/13/2007 15:03:56 all.q@f08.bc.rzg.mpg.de 1 186 10420 0.56000 blastarray mjr r 02/13/2007 15:03:56 all.q@f08.bc.rzg.mpg.de 1 187 10420 0.55242 blastarray mjr qw 02/13/2007 14:28:34 1 210-500:1
Array jobs (3) Benefits: simple organization simple interaction with job (single job ID) optimized throughput (see, e.g. qconf -sconf for jobs-per-user limits, etc.) powerful tool for (trivially) parallel applications Notes: one stdout/stderr file per task stdout: <job name>.o<job ID>.<task ID> stderr: <job name>.e<job ID>.<task ID> task-specific $TMPDIR $SGE TASK ID (and its relatives) are undefined for non-array jobs allocate reasonable chunks of work to tasks
Excursus: load balancing total work chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 PE 1 PE 2 PE 3 overhead idle time time number of PEs number of chunks t tot t overhead
DRMAA Distributed Resource Management Application API: API specification for the submission and control of jobs to one or more DRM systems (see http://drmaa.org) Purpose: integration with applications Advantages: Portability, vendor independence Reliability: avoids error-prone parsing of output from qsub, qstat,... Efficiency: avoids expensive (and intricate: e.g. Perl) system calls Implementations: SGE Bindings for Java, C/C++ Modules for perl, Python...
DRMAA (2) Java example (fragment) package de.mpg.rzg.drmaa.queue; import java.util.list; import org.apache.commons.logging.log; import org.apache.commons.logging.logfactory; import org.ggf.drmaa.drmaaexception; import org.ggf.drmaa.jobtemplate; import org.ggf.drmaa.session; public class DrmaaQueueScheduler {... public String submitjob(){ String jobid = null; try { /* create DRMAA session */ SessionFactory factory session = SessionFactory.getFactory().getSession(); session.init(null); /* setup job template */ JobTemplate jt = session.createjobtemplate(); jt.setremotecommand("blastall"); jt.setargs(new String[]{"-p","blastp","-d","nr"}); jt.setjobname("blast"); List<String> taskids = session.runbulkjobs(jt,1,numjobtasks,chunksize); jobid = taskids.isempty()? null : taskids.get(0).split("[.]")[0]; } catch (DrmaaException e) { logger.error("submitting DRMAA job failed: "+e.getmessage()); } } } return jobid;
Tips & Tricks Submit scripts do not wire SGE logics into your application instead, use SGE scripts only as simple wrappers example: #$ -S /bin/sh #$ -t 1-1000:10 perl ${HOME}/doMegablastChunk.pl $SGE_TASK_ID $SGE_TASK_STEPSIZE $TMPDIR facilitates: (interactive) testing code maintenance portability across different DRMs
Tips & Tricks (2) Misc do not rely on checkpointing: implement restart capability instead do not rely on (interactive) environment (e.g. $PATH) chose appropriate location for stdout, stderr redirect (wanted) stdout to separate file use reasonable partitioning of total computational work: avoid very short jobs/tasks ( 1 Minute): scheduling overhead avoid very long jobs/large arrays ( several days, 10000 tasks): manageability RZG specific issue save-password (AFS/Kerberos) before submitting your first job or after a change of your RZG password monitor E-mail for SGE error messages
Tips & Tricks (3) References and further reading Wikipedia http://en.wikipedia.org/wiki/sun Grid Engine SGE homepage http://gridengine.sunsource.net/ SGE documentation http://gridengine.sunsource.net/documentation.html SGE man pages SGE documentation of the RZG homepage (section Computing ) http://www.rzg.mpg.de/ SGE configuration on the SUN Linux Cluster of the MPI-EVAn http://www.rzg.mpg.de/docs/linux/evacluster.html