Efficient cluster computing

Transcription

1 Efficient cluster computing Introduction to the Sun Grid Engine (SGE) queuing system Markus Rampp (RZG, MIGenAS) MPI for Evolutionary Anthropology Leipzig, Feb. 16, 2007

2 Outline Introduction Basic concepts: queues, jobs, scripts essential SGE commands and options Advanced topics Job chains Array jobs DRMAA API Tips & Tricks, References not covered: SGE configuration & administration, policies, accounting, grid computing, MPI,...

3 Introduction Sun Grid Engine (SGE): a popular batch-queuing system Software like SGE is typically used on a computer farm or computer cluster and is responsible for accepting, scheduling, dispatching, and managing the remote execution of large numbers of standalone, parallel or interactive user jobs. It also manages and schedules the allocation of distributed resources such as processors, memory, disk space, and software licenses. (taken from Wikipedia) Popular batch systems (DRMs) Sun Grid Engine (open source) LoadLeveler (IBM) NQS (Cray, NEC) DQS (open source)...

4 Introduction (2) Why should one use a DRM? Increase efficiency Operator s perspective: transparent resource management clustering of compute resources load balancing, optimization of resource usage fair (policy-based) distribution of resources accounting User s perspective: shared usage of system resources optimize throughput organize/simplify handling of ( large ) computational tasks enhanced stability (survive system crashes, maintenance,... ) well-defined resource allocation ( benchmarking) facilitates (non-interactive) work

5 Basic concepts Queues: Queue 1 Queue 2 Queue 1 Queue 2 Queue 3 Resource A Resource B queuename qtype used/tot. load_avg arch states all.q@e01.bc.rzg.mpg.de BIP 0/ lx26-x86 d all.q@e02.bc.rzg.mpg.de BIP 0/ lx26-x86 all.q@e03.bc.rzg.mpg.de BIP 0/ lx26-x86 all.q@e04.bc.rzg.mpg.de BIP 0/ lx26-x86 all.q@e05.bc.rzg.mpg.de BIP 0/ lx26-x86 all.q@e06.bc.rzg.mpg.de BIP 0/ lx26-x86 all.q@e07.bc.rzg.mpg.de BIP 0/ lx26-x86 normal@eva001.opt.rzg.mpg.de B 0/ lx26-amd64 normal@eva002.opt.rzg.mpg.de B 0/ lx26-amd64 normal@eva003.opt.rzg.mpg.de B 0/ lx26-amd64 normal@eva004.opt.rzg.mpg.de B 0/ lx26-amd64 normal@eva005.opt.rzg.mpg.de B 0/ lx26-amd64

6 Basic concepts (2) Jobs & scripts 1. prepare script of executable commands 2. specify resources and meta information 3. submit to batch system (returns a job ID) 4. use the job ID for job control (query status, cancel,... ) #$ -S /bin/sh #$ -cwd #$ -M mjr@rzg.mpg.de #$ -m e #$ -N example #begin executable commands (shell specified by #$ -S) # note: starting here, a leading # starts # a comment, whereas in the above # SGE header it does NOT echo "starting job..." blastall -p blastp -d nr -i query_1.fa -o blastout_1.txt blastall -p blastp -d nr -i query_2.fa -o blastout_2.txt echo "...done" > qsub example_1.sge Your job ("example") has been submitted. > qstat job-id prior name user state submit/start at queue slots ja-task-id example mjr qw 02/12/ :43:42 1 > qstat job-id prior name user state submit/start at queue slots ja-task-id example mjr r 02/12/ :19:26 all.q@e13.bc.rzg.mpg.de 1

7 SGE commands & options Interacting with the queuing system: SGE s q-commands qsub submit job qstat query queue/job status qdel delete job qhold hold ( suspend ) job; (note: user/operator/system holds) cf. ll-commands of LoadLeveler llsubmit llq,llstatus llcancel llhold qrls releases holds llhold -r qalter, qmod modify job qhost provide concise system overview llmodify qmon Graphical user interface (X)

8 SGE commands & options (2) Specify qsub options in script header and/or on command line (overrides script) Essential options for qsub: -S: path to shell -m b e a s n...: send mail at beginning end... of job -M: address for notification -N: name of job -j y: join stdout and stderr Additional options for qsub: -q: queue -p: priority (default 0; users may only decrease) -P: name of project -a: earliest date/time at which a job is eligible for execution... : cf. man qsub

9 SGE commands & options (3) Commonly used options for qstat: qstat displays list of jobs only qstat -u <user> -j <job ID> displays list of jobs for specified user/job qstat -f full format display qstat -r extended display (incl. resource requirements, scheduling info) >qstat -f queuename qtype used/tot. load_avg arch states BIP 0/ lx26-x86 d all.q@e02.bc.rzg.mpg.de BIP 0/ lx26-x86... all.q@e07.bc.rzg.mpg.de BIP 0/ lx26-x86 all.q@e08.bc.rzg.mpg.de BIP 2/ lx26-x megablast hfz r 02/13/ :34: megablast hfz r 02/13/ :34: all.q@e09.bc.rzg.mpg.de BIP 2/ lx26-x megablast hfz r 02/13/ :28: megablast hfz r 02/13/ :31: ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ megablast hfz qw 02/13/ :34: :1

10 SGE commands & options (4) > qstat -r -j ============================================================== job_number: exec_file: job_scripts/10422 submission_time: Tue Feb 13 16:34: owner: hfz uid: 1553 group: rzb gid: 4131 sge_o_home: /afs/ipp/home/h/hfz sge_o_log_name: hfz sge_o_path: /opt/sge6/bin/lx26-x86:/usr/local/bin:/opt/gnome/bin:/usr/games:/usr/bin/x11:/usr/bin:/bin sge_o_shell: /bin/tcsh sge_o_workdir: /bio/tmp/hfz/sargossa sge_o_host: e01 account: sge cwd: /bio/tmp/hfz/sargossa path_aliases: /tmp_mnt/ * * / mail_list: hfz@e01.bc.rzg.mpg.de notify: FALSE job_name: megablast jobshare: 0 shell_list: /bin/sh env_list: script_file: /afs/ipp/home/h/hfz/mysql/sequenzen/e01/submit_megablast_test.sge project: gendb job-array tasks: :1 usage 808: cpu=00:08:46, mem= GBs, io= , vmem=1.377g, maxvmem=1.566g.. usage 875: cpu=00:00:24, mem= GBs, io= , vmem=1.251g, maxvmem=1.251g scheduling info: queue instance "all.q@e01.bc.rzg.mpg.de" dropped because it is disabled queue instance "all.q@f12.bc.rzg.mpg.de" dropped because it is queue instance "all.q@e13.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@f08.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@f01.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@e14.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@f03.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@f05.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@e09.bc.rzg.mpg.de" dropped because it is full (project gendb) is not allowed to run in host "e07.bc.rzg.mpg.de" based on the excluded project list not all array task may be started due to max_aj_instances

11 Input/Output Output stdout: <job name>.o<job ID> stderr: <job name>.e<job ID> path: can be specified by qsub -o <stdout path> -e <stderr path> paths relative to current working directory at submission (with qsub -cwd option) user s home directory (if -cwd option is not specified): > ls example.e10404 example.o10404 example_1.sge Input arguments: qsub [ options ] [ command -- [ command_args ]] > qsub -p -10 example_1.sge arg1

12 Advanced topics Job chains: sets of consecutive interdependent jobs Job arrays: sets of similar and independent (parallel) jobs DRMAA: API specification

13 Job chains: sets of consecutive jobs Solution 1 (trivial) >cat allinone.sge #$ -S /bin/sh #$ -N allinone./doformatdb./doblastall./dopostprocessing >qsub allinone.sge Your job ("allinone") has been submitted. Solution 2 (modular, nested qsub) >cat formatdb.sge #$ -S /bin/sh #$ -N FormatDB./doFormatDB qsub blastall.sge >cat blastall.sge #$ -S /bin/sh #$ -N BlastAll./doBlastAll qsub postprocessing.sge... >qsub formatdb.sge Your job ("formatdb") has been submitted.

14 Job chains: sets of consecutive jobs (2) Solution 3 (optimized, uses -hold jid <job id job name>) >cat formatdb.sge #$ -S /bin/sh #$ -N FormatDB./doFormatDB >cat blastall.sge #$ -S /bin/sh #$ -N BlastAll #$ -hold_jid FormatDB./doBlastAll... >qsub formatdb.sge Your job ("formatdb") has been submitted. >qsub blastall.sge Your job ("blastall") has been submitted. >qsub postprocessing.sge Your job ("postprocessing") has been submitted. Advantage: accumulates waiting time Note: -hold_jid <job_name> can only be used to reference jobs of the same user (-hold_jid <job_id> can be used to reference any job)

15 Array jobs Submit sets of similar and independent tasks : qsub -t 1-500:1 example_3.sge submits 500 instances of the same script each instance ( task ) is executed independently all instances subsumed with a single job ID variable $SGE_TASK_ID discriminates between instances task numbering scheme: -t <first>-<last>:<stepsize> related: $SGE_TASK_FIRST,$SGE_TASK_LAST,$SGE_TASK_STEPSIZE Example: #$ -S /bin/sh #$ -cwd #$ -N blastarray #$ -t 1-500:1 QUERY=query_${SGE_TASK_ID}.fa OUTPUT=blastout_${SGE_TASK_ID}.txt echo "processing query $QUERY..." blastall -p blastn -d nt -i $QUERY -o $OUTPUT echo "...done"

16 Array jobs (2) > qsub example_3.sge Your job :1 ("blastarray") has been submitted. > qstat job-id prior name user state submit/start at queue slots ja-task-id blastarray mjr r 02/13/ :05:56 all.q@e08.bc.rzg.mpg.de blastarray mjr r 02/13/ :05:56 all.q@e08.bc.rzg.mpg.de blastarray mjr r 02/13/ :07:11 all.q@e09.bc.rzg.mpg.de blastarray mjr r 02/13/ :07:11 all.q@e09.bc.rzg.mpg.de blastarray mjr r 02/13/ :05:41 all.q@e10.bc.rzg.mpg.de blastarray mjr r 02/13/ :05:41 all.q@e10.bc.rzg.mpg.de blastarray mjr r 02/13/ :08:41 all.q@e11.bc.rzg.mpg.de blastarray mjr r 02/13/ :08:41 all.q@e11.bc.rzg.mpg.de blastarray mjr r 02/13/ :08:11 all.q@e12.bc.rzg.mpg.de blastarray mjr r 02/13/ :08:11 all.q@e12.bc.rzg.mpg.de blastarray mjr r 02/13/ :02:11 all.q@e13.bc.rzg.mpg.de blastarray mjr r 02/13/ :02:11 all.q@e13.bc.rzg.mpg.de blastarray mjr r 02/13/ :03:26 all.q@e14.bc.rzg.mpg.de blastarray mjr r 02/13/ :03:26 all.q@e14.bc.rzg.mpg.de blastarray mjr r 02/13/ :07:11 all.q@f01.bc.rzg.mpg.de blastarray mjr r 02/13/ :07:11 all.q@f01.bc.rzg.mpg.de blastarray mjr r 02/13/ :05:11 all.q@f02.bc.rzg.mpg.de blastarray mjr r 02/13/ :05:11 all.q@f02.bc.rzg.mpg.de blastarray mjr r 02/13/ :04:41 all.q@f03.bc.rzg.mpg.de blastarray mjr r 02/13/ :04:41 all.q@f03.bc.rzg.mpg.de blastarray mjr r 02/13/ :03:41 all.q@f04.bc.rzg.mpg.de blastarray mjr r 02/13/ :03:41 all.q@f04.bc.rzg.mpg.de blastarray mjr r 02/13/ :08:11 all.q@f05.bc.rzg.mpg.de blastarray mjr r 02/13/ :08:11 all.q@f05.bc.rzg.mpg.de blastarray mjr r 02/13/ :05:11 all.q@f06.bc.rzg.mpg.de blastarray mjr r 02/13/ :05:26 all.q@f06.bc.rzg.mpg.de blastarray mjr r 02/13/ :04:26 all.q@f07.bc.rzg.mpg.de blastarray mjr r 02/13/ :04:26 all.q@f07.bc.rzg.mpg.de blastarray mjr r 02/13/ :03:56 all.q@f08.bc.rzg.mpg.de blastarray mjr r 02/13/ :03:56 all.q@f08.bc.rzg.mpg.de blastarray mjr qw 02/13/ :28: :1

17 Array jobs (3) Benefits: simple organization simple interaction with job (single job ID) optimized throughput (see, e.g. qconf -sconf for jobs-per-user limits, etc.) powerful tool for (trivially) parallel applications Notes: one stdout/stderr file per task stdout: <job name>.o<job ID>.<task ID> stderr: <job name>.e<job ID>.<task ID> task-specific $TMPDIR $SGE TASK ID (and its relatives) are undefined for non-array jobs allocate reasonable chunks of work to tasks

18 Excursus: load balancing total work chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 PE 1 PE 2 PE 3 overhead idle time time number of PEs number of chunks t tot t overhead

19 DRMAA Distributed Resource Management Application API: API specification for the submission and control of jobs to one or more DRM systems (see Purpose: integration with applications Advantages: Portability, vendor independence Reliability: avoids error-prone parsing of output from qsub, qstat,... Efficiency: avoids expensive (and intricate: e.g. Perl) system calls Implementations: SGE Bindings for Java, C/C++ Modules for perl, Python...

20 DRMAA (2) Java example (fragment) package de.mpg.rzg.drmaa.queue; import java.util.list; import org.apache.commons.logging.log; import org.apache.commons.logging.logfactory; import org.ggf.drmaa.drmaaexception; import org.ggf.drmaa.jobtemplate; import org.ggf.drmaa.session; public class DrmaaQueueScheduler {... public String submitjob(){ String jobid = null; try { /* create DRMAA session */ SessionFactory factory session = SessionFactory.getFactory().getSession(); session.init(null); /* setup job template */ JobTemplate jt = session.createjobtemplate(); jt.setremotecommand("blastall"); jt.setargs(new String[]{"-p","blastp","-d","nr"}); jt.setjobname("blast"); List<String> taskids = session.runbulkjobs(jt,1,numjobtasks,chunksize); jobid = taskids.isempty()? null : taskids.get(0).split("[.]")[0]; } catch (DrmaaException e) { logger.error("submitting DRMAA job failed: "+e.getmessage()); } } } return jobid;

21 Tips & Tricks Submit scripts do not wire SGE logics into your application instead, use SGE scripts only as simple wrappers example: #$ -S /bin/sh #$ -t :10 perl ${HOME}/doMegablastChunk.pl $SGE_TASK_ID $SGE_TASK_STEPSIZE $TMPDIR facilitates: (interactive) testing code maintenance portability across different DRMs

22 Tips & Tricks (2) Misc do not rely on checkpointing: implement restart capability instead do not rely on (interactive) environment (e.g. $PATH) chose appropriate location for stdout, stderr redirect (wanted) stdout to separate file use reasonable partitioning of total computational work: avoid very short jobs/tasks ( 1 Minute): scheduling overhead avoid very long jobs/large arrays ( several days, tasks): manageability RZG specific issue save-password (AFS/Kerberos) before submitting your first job or after a change of your RZG password monitor for SGE error messages

23 Tips & Tricks (3) References and further reading Wikipedia Grid Engine SGE homepage SGE documentation SGE man pages SGE documentation of the RZG homepage (section Computing ) SGE configuration on the SUN Linux Cluster of the MPI-EVAn