Efficient cluster computing



Similar documents
SGE Roll: Users Guide. Version Edition

Grid Engine Users Guide p1 Edition

Introduction to Sun Grid Engine (SGE)

Streamline Computing Linux Cluster User Training. ( Nottingham University)

Enigma, Sun Grid Engine (SGE), and the Joint High Performance Computing Exchange (JHPCE) Cluster

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

Introduction to the SGE/OGS batch-queuing system

The SUN ONE Grid Engine BATCH SYSTEM

User s Guide. Introduction

Grid Engine 6. Troubleshooting. BioTeam Inc.

GRID Computing: CAS Style

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

Grid Engine Training Introduction

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Introduction to Sun Grid Engine 5.3

Miami University RedHawk Cluster Working with batch jobs on the Cluster

Introduction to Grid Engine

Quick Tutorial for Portable Batch System (PBS)

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Grid 101. Grid 101. Josh Hegie.

Oracle Grid Engine. User Guide Release 6.2 Update 7 E

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

Notes on the SNOW/Rmpi R packages with OpenMPI and Sun Grid Engine

Cluster Computing With R

Submitting Jobs to the Sun Grid Engine. CiCS Dept The University of Sheffield.

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

Grid Engine. Application Integration

An Introduction to High Performance Computing in the Department

LoadLeveler Overview. January 30-31, IBM Storage & Technology Group. IBM HPC Developer TIFR, Mumbai

Grid Engine 6. Policies. BioTeam Inc.

Running applications on the Cray XC30 4/12/2015

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

How To Run A Tompouce Cluster On An Ipra (Inria) (Sun) 2 (Sun Geserade) (Sun-Ge) 2/5.2 (

High Performance Computing with Sun Grid Engine on the HPSCC cluster. Fernando J. Pineda

Job Scheduling with Moab Cluster Suite

Using Parallel Computing to Run Multiple Jobs

SLURM Workload Manager

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

Submitting batch jobs Slurm on ecgate. Xavi Abellan User Support Section

Job scheduler details

NorduGrid ARC Tutorial

Grid Engine experience in Finis Terrae, large Itanium cluster supercomputer. Pablo Rey Mayo Systems Technician, Galicia Supercomputing Centre (CESGA)

AstroCompute. AWS101 - using the cloud for Science. Brendan Bouffler ( boof ) Scientific Computing AWS. ska-astrocompute@amazon.

Beyond Windows: Using the Linux Servers and the Grid

Using the Yale HPC Clusters

High Performance Computing

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

User s Manual

The RWTH Compute Cluster Environment

Running ANSYS Fluent Under SGE

Batch Scripts for RA & Mio

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

How to Run Parallel Jobs Efficiently

LSKA 2010 Survey Report Job Scheduler

Installing and running COMSOL on a Linux cluster

NYUAD HPC Center Running Jobs

High-Performance Reservoir Risk Assessment (Jacta Cluster)

Parallel Debugging with DDT

NEC HPC-Linux-Cluster

Using NeSI HPC Resources. NeSI Computational Science Team

Ra - Batch Scripts. Timothy H. Kaiser, Ph.D. tkaiser@mines.edu

Oracle Grid Engine. Administration Guide Release 6.2 Update 7 E

Batch Job Analysis to Improve the Success Rate in HPC

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Maintaining Non-Stop Services with Multi Layer Monitoring

Grid Engine Administration. Overview

Benchmark Report: Univa Grid Engine, Nextflow, and Docker for running Genomic Analysis Workflows

Sun Grid Engine, a new scheduler for EGEE

Introduction to HPC Workshop. Center for e-research

Resource Management and Job Scheduling

Rocoto. HWRF Python Scripts Training Miami, FL November 19, 2015

KISTI Supercomputer TACHYON Scheduling scheme & Sun Grid Engine

locuz.com HPC App Portal V2.0 DATASHEET

HPCC USER S GUIDE. Version 1.2 July IITS (Research Support) Singapore Management University. IITS, Singapore Management University Page 1 of 35

Hodor and Bran - Job Scheduling and PBS Scripts

Martinos Center Compute Clusters

Grid Engine 6. Monitoring, Accounting & Reporting. BioTeam Inc. info@bioteam.net

General Overview. Slurm Training15. Alfred Gil & Jordi Blasco (HPCNow!)

Operating Systems OBJECTIVES 7.1 DEFINITION. Chapter 7. Note:

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

Kiko> A personal job scheduler

Configuration of High Performance Computing for Medical Imaging and Processing. SunGridEngine 6.2u5

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

Parallels Plesk Panel

JobScheduler Web Services Executing JobScheduler commands

Manual for using Super Computing Resources

Transcription:

Efficient cluster computing Introduction to the Sun Grid Engine (SGE) queuing system Markus Rampp (RZG, MIGenAS) MPI for Evolutionary Anthropology Leipzig, Feb. 16, 2007

Outline Introduction Basic concepts: queues, jobs, scripts essential SGE commands and options Advanced topics Job chains Array jobs DRMAA API Tips & Tricks, References not covered: SGE configuration & administration, policies, accounting, grid computing, MPI,...

Introduction Sun Grid Engine (SGE): a popular batch-queuing system Software like SGE is typically used on a computer farm or computer cluster and is responsible for accepting, scheduling, dispatching, and managing the remote execution of large numbers of standalone, parallel or interactive user jobs. It also manages and schedules the allocation of distributed resources such as processors, memory, disk space, and software licenses. (taken from Wikipedia) Popular batch systems (DRMs) Sun Grid Engine (open source) LoadLeveler (IBM) NQS (Cray, NEC) DQS (open source)...

Introduction (2) Why should one use a DRM? Increase efficiency Operator s perspective: transparent resource management clustering of compute resources load balancing, optimization of resource usage fair (policy-based) distribution of resources accounting User s perspective: shared usage of system resources optimize throughput organize/simplify handling of ( large ) computational tasks enhanced stability (survive system crashes, maintenance,... ) well-defined resource allocation ( benchmarking) facilitates (non-interactive) work

Basic concepts Queues: Queue 1 Queue 2 Queue 1 Queue 2 Queue 3 Resource A Resource B queuename qtype used/tot. load_avg arch states all.q@e01.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 d all.q@e02.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 all.q@e03.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 all.q@e04.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 all.q@e05.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 all.q@e06.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 all.q@e07.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 normal@eva001.opt.rzg.mpg.de B 0/0 0.09 lx26-amd64 normal@eva002.opt.rzg.mpg.de B 0/4 0.00 lx26-amd64 normal@eva003.opt.rzg.mpg.de B 0/4 0.00 lx26-amd64 normal@eva004.opt.rzg.mpg.de B 0/4 0.00 lx26-amd64 normal@eva005.opt.rzg.mpg.de B 0/4 0.00 lx26-amd64

Basic concepts (2) Jobs & scripts 1. prepare script of executable commands 2. specify resources and meta information 3. submit to batch system (returns a job ID) 4. use the job ID for job control (query status, cancel,... ) #$ -S /bin/sh #$ -cwd #$ -M mjr@rzg.mpg.de #$ -m e #$ -N example #begin executable commands (shell specified by #$ -S) # note: starting here, a leading # starts # a comment, whereas in the above # SGE header it does NOT echo "starting job..." blastall -p blastp -d nr -i query_1.fa -o blastout_1.txt blastall -p blastp -d nr -i query_2.fa -o blastout_2.txt echo "...done" > qsub example_1.sge Your job 10404 ("example") has been submitted. > qstat job-id prior name user state submit/start at queue slots ja-task-id ------------------------------------- 10404 0.00000 example mjr qw 02/12/2007 11:43:42 1 > qstat job-id prior name user state submit/start at queue slots ja-task-id ------------------------------------- 10404 0.55500 example mjr r 02/12/2007 12:19:26 all.q@e13.bc.rzg.mpg.de 1

SGE commands & options Interacting with the queuing system: SGE s q-commands qsub submit job qstat query queue/job status qdel delete job qhold hold ( suspend ) job; (note: user/operator/system holds) cf. ll-commands of LoadLeveler llsubmit llq,llstatus llcancel llhold qrls releases holds llhold -r qalter, qmod modify job qhost provide concise system overview llmodify qmon Graphical user interface (X)

SGE commands & options (2) Specify qsub options in script header and/or on command line (overrides script) Essential options for qsub: -S: path to shell -m b e a s n...: send mail at beginning end... of job -M: E-mail address for notification -N: name of job -j y: join stdout and stderr Additional options for qsub: -q: queue -p: priority (default 0; users may only decrease) -P: name of project -a: earliest date/time at which a job is eligible for execution... : cf. man qsub

SGE commands & options (3) Commonly used options for qstat: qstat displays list of jobs only qstat -u <user> -j <job ID> displays list of jobs for specified user/job qstat -f full format display qstat -r extended display (incl. resource requirements, scheduling info) >qstat -f queuename qtype used/tot. load_avg arch states all.q@e01.bc.rzg.mpg.de BIP 0/2 0.04 lx26-x86 d all.q@e02.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86... all.q@e07.bc.rzg.mpg.de BIP 0/2 0.00 lx26-x86 all.q@e08.bc.rzg.mpg.de BIP 2/2 3.67 lx26-x86 10422 0.56000 megablast hfz r 02/13/2007 20:34:12 1 882 10422 0.56000 megablast hfz r 02/13/2007 20:34:12 1 883 all.q@e09.bc.rzg.mpg.de BIP 2/2 4.85 lx26-x86 10422 0.56000 megablast hfz r 02/13/2007 20:28:27 1 864 10422 0.56000 megablast hfz r 02/13/2007 20:31:57 1 875... ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 10422 0.55242 megablast hfz qw 02/13/2007 16:34:29 1 886-5000:1

SGE commands & options (4) > qstat -r -j 10422 ============================================================== job_number: 10422 exec_file: job_scripts/10422 submission_time: Tue Feb 13 16:34:29 2007 owner: hfz uid: 1553 group: rzb gid: 4131 sge_o_home: /afs/ipp/home/h/hfz sge_o_log_name: hfz sge_o_path: /opt/sge6/bin/lx26-x86:/usr/local/bin:/opt/gnome/bin:/usr/games:/usr/bin/x11:/usr/bin:/bin sge_o_shell: /bin/tcsh sge_o_workdir: /bio/tmp/hfz/sargossa sge_o_host: e01 account: sge cwd: /bio/tmp/hfz/sargossa path_aliases: /tmp_mnt/ * * / mail_list: hfz@e01.bc.rzg.mpg.de notify: FALSE job_name: megablast jobshare: 0 shell_list: /bin/sh env_list: script_file: /afs/ipp/home/h/hfz/mysql/sequenzen/e01/submit_megablast_test.sge project: gendb job-array tasks: 1-5000:1 usage 808: cpu=00:08:46, mem=595.19275 GBs, io=0.00000, vmem=1.377g, maxvmem=1.566g.. usage 875: cpu=00:00:24, mem=28.67529 GBs, io=0.00000, vmem=1.251g, maxvmem=1.251g scheduling info: queue instance "all.q@e01.bc.rzg.mpg.de" dropped because it is disabled queue instance "all.q@f12.bc.rzg.mpg.de" dropped because it is queue instance "all.q@e13.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@f08.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@f01.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@e14.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@f03.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@f05.bc.rzg.mpg.de" dropped because it is full queue instance "all.q@e09.bc.rzg.mpg.de" dropped because it is full (project gendb) is not allowed to run in host "e07.bc.rzg.mpg.de" based on the excluded project list not all array task may be started due to max_aj_instances

Input/Output Output stdout: <job name>.o<job ID> stderr: <job name>.e<job ID> path: can be specified by qsub -o <stdout path> -e <stderr path> paths relative to current working directory at submission (with qsub -cwd option) user s home directory (if -cwd option is not specified): > ls example.e10404 example.o10404 example_1.sge Input arguments: qsub [ options ] [ command -- [ command_args ]] > qsub -p -10 example_1.sge arg1

Advanced topics Job chains: sets of consecutive interdependent jobs Job arrays: sets of similar and independent (parallel) jobs DRMAA: API specification

Job chains: sets of consecutive jobs Solution 1 (trivial) >cat allinone.sge #$ -S /bin/sh #$ -N allinone./doformatdb./doblastall./dopostprocessing >qsub allinone.sge Your job 10411 ("allinone") has been submitted. Solution 2 (modular, nested qsub) >cat formatdb.sge #$ -S /bin/sh #$ -N FormatDB./doFormatDB qsub blastall.sge >cat blastall.sge #$ -S /bin/sh #$ -N BlastAll./doBlastAll qsub postprocessing.sge... >qsub formatdb.sge Your job 10421 ("formatdb") has been submitted.

Job chains: sets of consecutive jobs (2) Solution 3 (optimized, uses -hold jid <job id job name>) >cat formatdb.sge #$ -S /bin/sh #$ -N FormatDB./doFormatDB >cat blastall.sge #$ -S /bin/sh #$ -N BlastAll #$ -hold_jid FormatDB./doBlastAll... >qsub formatdb.sge Your job 10451 ("formatdb") has been submitted. >qsub blastall.sge Your job 10452 ("blastall") has been submitted. >qsub postprocessing.sge Your job 10453 ("postprocessing") has been submitted. Advantage: accumulates waiting time Note: -hold_jid <job_name> can only be used to reference jobs of the same user (-hold_jid <job_id> can be used to reference any job)

Array jobs Submit sets of similar and independent tasks : qsub -t 1-500:1 example_3.sge submits 500 instances of the same script each instance ( task ) is executed independently all instances subsumed with a single job ID variable $SGE_TASK_ID discriminates between instances task numbering scheme: -t <first>-<last>:<stepsize> related: $SGE_TASK_FIRST,$SGE_TASK_LAST,$SGE_TASK_STEPSIZE Example: #$ -S /bin/sh #$ -cwd #$ -N blastarray #$ -t 1-500:1 QUERY=query_${SGE_TASK_ID}.fa OUTPUT=blastout_${SGE_TASK_ID}.txt echo "processing query $QUERY..." blastall -p blastn -d nt -i $QUERY -o $OUTPUT echo "...done"

Array jobs (2) > qsub example_3.sge Your job 10420.1-500:1 ("blastarray") has been submitted. > qstat job-id prior name user state submit/start at queue slots ja-task-id ------------------------------------- 10420 0.56000 blastarray mjr r 02/13/2007 15:05:56 all.q@e08.bc.rzg.mpg.de 1 198 10420 0.56000 blastarray mjr r 02/13/2007 15:05:56 all.q@e08.bc.rzg.mpg.de 1 199 10420 0.56000 blastarray mjr r 02/13/2007 15:07:11 all.q@e09.bc.rzg.mpg.de 1 202 10420 0.56000 blastarray mjr r 02/13/2007 15:07:11 all.q@e09.bc.rzg.mpg.de 1 203 10420 0.56000 blastarray mjr r 02/13/2007 15:05:41 all.q@e10.bc.rzg.mpg.de 1 196 10420 0.56000 blastarray mjr r 02/13/2007 15:05:41 all.q@e10.bc.rzg.mpg.de 1 197 10420 0.55241 blastarray mjr r 02/13/2007 15:08:41 all.q@e11.bc.rzg.mpg.de 1 208 10420 0.55241 blastarray mjr r 02/13/2007 15:08:41 all.q@e11.bc.rzg.mpg.de 1 209 10420 0.56000 blastarray mjr r 02/13/2007 15:08:11 all.q@e12.bc.rzg.mpg.de 1 204 10420 0.56000 blastarray mjr r 02/13/2007 15:08:11 all.q@e12.bc.rzg.mpg.de 1 206 10420 0.56000 blastarray mjr r 02/13/2007 15:02:11 all.q@e13.bc.rzg.mpg.de 1 176 10420 0.56000 blastarray mjr r 02/13/2007 15:02:11 all.q@e13.bc.rzg.mpg.de 1 177 10420 0.56000 blastarray mjr r 02/13/2007 15:03:26 all.q@e14.bc.rzg.mpg.de 1 182 10420 0.56000 blastarray mjr r 02/13/2007 15:03:26 all.q@e14.bc.rzg.mpg.de 1 183 10420 0.56000 blastarray mjr r 02/13/2007 15:07:11 all.q@f01.bc.rzg.mpg.de 1 200 10420 0.56000 blastarray mjr r 02/13/2007 15:07:11 all.q@f01.bc.rzg.mpg.de 1 201 10420 0.56000 blastarray mjr r 02/13/2007 15:05:11 all.q@f02.bc.rzg.mpg.de 1 193 10420 0.56000 blastarray mjr r 02/13/2007 15:05:11 all.q@f02.bc.rzg.mpg.de 1 194 10420 0.56000 blastarray mjr r 02/13/2007 15:04:41 all.q@f03.bc.rzg.mpg.de 1 190 10420 0.56000 blastarray mjr r 02/13/2007 15:04:41 all.q@f03.bc.rzg.mpg.de 1 191 10420 0.56000 blastarray mjr r 02/13/2007 15:03:41 all.q@f04.bc.rzg.mpg.de 1 184 10420 0.56000 blastarray mjr r 02/13/2007 15:03:41 all.q@f04.bc.rzg.mpg.de 1 185 10420 0.56000 blastarray mjr r 02/13/2007 15:08:11 all.q@f05.bc.rzg.mpg.de 1 205 10420 0.56000 blastarray mjr r 02/13/2007 15:08:11 all.q@f05.bc.rzg.mpg.de 1 207 10420 0.56000 blastarray mjr r 02/13/2007 15:05:11 all.q@f06.bc.rzg.mpg.de 1 192 10420 0.56000 blastarray mjr r 02/13/2007 15:05:26 all.q@f06.bc.rzg.mpg.de 1 195 10420 0.56000 blastarray mjr r 02/13/2007 15:04:26 all.q@f07.bc.rzg.mpg.de 1 188 10420 0.56000 blastarray mjr r 02/13/2007 15:04:26 all.q@f07.bc.rzg.mpg.de 1 189 10420 0.56000 blastarray mjr r 02/13/2007 15:03:56 all.q@f08.bc.rzg.mpg.de 1 186 10420 0.56000 blastarray mjr r 02/13/2007 15:03:56 all.q@f08.bc.rzg.mpg.de 1 187 10420 0.55242 blastarray mjr qw 02/13/2007 14:28:34 1 210-500:1

Array jobs (3) Benefits: simple organization simple interaction with job (single job ID) optimized throughput (see, e.g. qconf -sconf for jobs-per-user limits, etc.) powerful tool for (trivially) parallel applications Notes: one stdout/stderr file per task stdout: <job name>.o<job ID>.<task ID> stderr: <job name>.e<job ID>.<task ID> task-specific $TMPDIR $SGE TASK ID (and its relatives) are undefined for non-array jobs allocate reasonable chunks of work to tasks

Excursus: load balancing total work chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 PE 1 PE 2 PE 3 overhead idle time time number of PEs number of chunks t tot t overhead

DRMAA Distributed Resource Management Application API: API specification for the submission and control of jobs to one or more DRM systems (see http://drmaa.org) Purpose: integration with applications Advantages: Portability, vendor independence Reliability: avoids error-prone parsing of output from qsub, qstat,... Efficiency: avoids expensive (and intricate: e.g. Perl) system calls Implementations: SGE Bindings for Java, C/C++ Modules for perl, Python...

DRMAA (2) Java example (fragment) package de.mpg.rzg.drmaa.queue; import java.util.list; import org.apache.commons.logging.log; import org.apache.commons.logging.logfactory; import org.ggf.drmaa.drmaaexception; import org.ggf.drmaa.jobtemplate; import org.ggf.drmaa.session; public class DrmaaQueueScheduler {... public String submitjob(){ String jobid = null; try { /* create DRMAA session */ SessionFactory factory session = SessionFactory.getFactory().getSession(); session.init(null); /* setup job template */ JobTemplate jt = session.createjobtemplate(); jt.setremotecommand("blastall"); jt.setargs(new String[]{"-p","blastp","-d","nr"}); jt.setjobname("blast"); List<String> taskids = session.runbulkjobs(jt,1,numjobtasks,chunksize); jobid = taskids.isempty()? null : taskids.get(0).split("[.]")[0]; } catch (DrmaaException e) { logger.error("submitting DRMAA job failed: "+e.getmessage()); } } } return jobid;

Tips & Tricks Submit scripts do not wire SGE logics into your application instead, use SGE scripts only as simple wrappers example: #$ -S /bin/sh #$ -t 1-1000:10 perl ${HOME}/doMegablastChunk.pl $SGE_TASK_ID $SGE_TASK_STEPSIZE $TMPDIR facilitates: (interactive) testing code maintenance portability across different DRMs

Tips & Tricks (2) Misc do not rely on checkpointing: implement restart capability instead do not rely on (interactive) environment (e.g. $PATH) chose appropriate location for stdout, stderr redirect (wanted) stdout to separate file use reasonable partitioning of total computational work: avoid very short jobs/tasks ( 1 Minute): scheduling overhead avoid very long jobs/large arrays ( several days, 10000 tasks): manageability RZG specific issue save-password (AFS/Kerberos) before submitting your first job or after a change of your RZG password monitor E-mail for SGE error messages

Tips & Tricks (3) References and further reading Wikipedia http://en.wikipedia.org/wiki/sun Grid Engine SGE homepage http://gridengine.sunsource.net/ SGE documentation http://gridengine.sunsource.net/documentation.html SGE man pages SGE documentation of the RZG homepage (section Computing ) http://www.rzg.mpg.de/ SGE configuration on the SUN Linux Cluster of the MPI-EVAn http://www.rzg.mpg.de/docs/linux/evacluster.html