Using Parallel Computing to Run Multiple Jobs



Similar documents
Miami University RedHawk Cluster Working with batch jobs on the Cluster

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

Grid 101. Grid 101. Josh Hegie.

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Quick Tutorial for Portable Batch System (PBS)

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

Grid Engine Users Guide p1 Edition

NEC HPC-Linux-Cluster

Job Scheduling with Moab Cluster Suite

Introduction to Sun Grid Engine (SGE)

Installing and running COMSOL on a Linux cluster

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

Cluster Computing With R

Beyond Windows: Using the Linux Servers and the Grid

User s Manual

Running applications on the Cray XC30 4/12/2015

NYUAD HPC Center Running Jobs

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Using the Yale HPC Clusters

Martinos Center Compute Clusters

High-Performance Reservoir Risk Assessment (Jacta Cluster)

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Hodor and Bran - Job Scheduling and PBS Scripts

Streamline Computing Linux Cluster User Training. ( Nottingham University)

Parallel Debugging with DDT

Batch Scripts for RA & Mio

An Introduction to High Performance Computing in the Department

Manual for using Super Computing Resources

Getting Started with HPC

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

SGE Roll: Users Guide. Version Edition

Introduction to the SGE/OGS batch-queuing system

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

HPC at IU Overview. Abhinav Thota Research Technologies Indiana University

Ra - Batch Scripts. Timothy H. Kaiser, Ph.D. tkaiser@mines.edu

New High-performance computing cluster: PAULI. Sascha Frick Institute for Physical Chemistry

Introduction to HPC Workshop. Center for e-research

Rocoto. HWRF Python Scripts Training Miami, FL November 19, 2015

Simplest Scalable Architecture

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Job scheduler details

RA MPI Compilers Debuggers Profiling. March 25, 2009

Lab 1 Beginning C Program

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

SLURM Workload Manager

High Performance Computing with Sun Grid Engine on the HPSCC cluster. Fernando J. Pineda

HTCondor within the European Grid & in the Cloud

Extreme computing lab exercises Session one

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Thirty Useful Unix Commands

HPCC - Hrothgar Getting Started User Guide MPI Programming

Beginners Shell Scripting for Batch Jobs

HPCC USER S GUIDE. Version 1.2 July IITS (Research Support) Singapore Management University. IITS, Singapore Management University Page 1 of 35

The SUN ONE Grid Engine BATCH SYSTEM

Introduction to SDSC systems and data analytics software packages "

Hands-On UNIX Exercise:

Parallel Processing using the LOTUS cluster

Using the Yale HPC Clusters

GRID workload management system and CMS fall production. Massimo Sgaravatto INFN Padova

MSU Tier 3 Usage and Troubleshooting. James Koll

Job Scheduling Explained More than you ever want to know about how jobs get scheduled on WestGrid systems...

Unix Scripts and Job Scheduling

Using SVN to Manage Source RTL

How To Run A Steady Case On A Creeper

The CNMS Computer Cluster

Matlab on a Supercomputer

The RWTH Compute Cluster Environment

ORACLE NOSQL DATABASE HANDS-ON WORKSHOP Cluster Deployment and Management

MFCF Grad Session 2015

Biowulf2 Training Session

Using NeSI HPC Resources. NeSI Computational Science Team

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

The Asterope compute cluster

HP Operations Manager Software for Windows Integration Guide

How To Run A Tompouce Cluster On An Ipra (Inria) (Sun) 2 (Sun Geserade) (Sun-Ge) 2/5.2 (

Resource Management and Job Scheduling

PBSPro scheduling. PBS overview Qsub command: resource requests. Queues a7ribu8on. Fairshare. Backfill Jobs submission.

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

Notes on the SNOW/Rmpi R packages with OpenMPI and Sun Grid Engine

How to Run Parallel Jobs Efficiently

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

Stanford HPC Conference. Panasas Storage System Integration into a Cluster

Linux Firewalls (Ubuntu IPTables) II

Submitting batch jobs Slurm on ecgate. Xavi Abellan User Support Section

A High Performance Computing Scheduling and Resource Management Primer

Summary. Load and Open GaussView to start example

Transcription:

Beowulf Training Using Parallel Computing to Run Multiple Jobs Jeff Linderoth August 5, 2003 August 5, 2003 Beowulf Training Running Multiple Jobs Slide 1

Outline Introduction to Scheduling Software The Wonderful World of PBS The Equally Wonderful World of Condor Lab Time. Why do we need scheduling software? August 5, 2003 Beowulf Training Running Multiple Jobs Slide 2

Resource Scheduling So people don't ght over the resources! Schedulers... Locate appropriate resources, Manage resources, so multiple processes don't conict over the same processor Ensure a fairness policy, Are integrated with accounting software. The schedulers on the Beowulf cluster are PBS and Condor. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 3

Mmmmmmmmmmmmmm. Pie Our rst computational task will be to estimate π by numerical integration. Everyone knows... 1 0 1 1 + x 2 dx = arctan(x) 1 x=0 = arctan(1) = π 4. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 4

The Rectangle Rule 4 3.5 4/(1+x*x) 3 2.5 2 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 August 5, 2003 Beowulf Training Running Multiple Jobs Slide 5

A Program to Estimate π I've written a π-calculator for you. cd mkdir compute-pi cd compute-pi cp /tmp/training/session2/pi1.c. gcc pi1.c -lm -o pi1./pi1 1000 This is not a parallel program. Just a simple (one process) program. Nevertheless, we must submit it through a scheduling system to run it on the Beowulf cluster. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 6

Running with PBS A simple four step process... Create a PBS submission script Submit the script to the PBS system using the command qsub PBS runs the script on the rst available resources PBS collects output for user's inspection August 5, 2003 Beowulf Training Running Multiple Jobs Slide 7

The PBS Submission Script Overview (1) You make a request for resources, (2) PBS will allocate a node pool to fulll your request. (3) Now you have to tell the node pool what to do! Both steps (1) and (3) are accomplished through the PBS submission script The script contains PBS request statements Shell commands that will run your job on the allocated resources. The shell commands are executed on the rst node in your allocated nodes August 5, 2003 Beowulf Training Running Multiple Jobs Slide 8

Our First PBS Submission Script #PBS -q small #PBS -l nodes=1:public #PBS -l cput=00:05:00 #PBS -V echo "The PBS job ID is: ${PBS_JOBID}" echo "The PBS Node File is" cat $PBS_NODEFILE $HOME/compute-pi/pi1 100 August 5, 2003 Beowulf Training Running Multiple Jobs Slide 9

Format of the PBS Submission Script Lines that begin with #PBS are PBS directives Everything else is a shell command Shell commands are just things that you would type at the regular login-prompt. But you can also do fancy looping and conditions. http://www.gnu.org/manual/bash/html chapter/bashref toc.html After the PBS commands, you put any commands you would like. Usually the command to run your program is usually a good one to include. :-) Again, this is executed on the rst node. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 10

Breaking It Down. PBS Directives -q Species the queue in which to place the job. We have two queues, small and large small Max CPU time 20 minutes/process. large Lower priority than jobs in small queue -l Denes the resources that are required by the job and establishes a limit to the amount of resource that can be consumed. -V Declares that all environment variables in the qsub command's environment are to be exported to the batch job. If you would like the PBS job to inherit the same environment as the one you are currently running in (same PATH variable, etc), you should include this directive. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 11

The -l Story For resources, you will typically only need to declare the number of nodes, which class of nodes you request #PBS -l nodes=4:public the maximum cpu time #PBS -l cput=00:15:00 For the truly brave and curious the command is man pbs resources August 5, 2003 Beowulf Training Running Multiple Jobs Slide 12

PBS The Big Three qsub Submit a PBS job qstat Check the status of a PBS job qdel Delete a PBS job man <command> will give you lots more information August 5, 2003 Beowulf Training Running Multiple Jobs Slide 13

Let's do it! [jtl3@fire1 compute-pi-1]$ qsub run.pbs 5972.fire1 [jtl3@fire1 compute-pi-1]$ qstat -a fire1: Req d Req d Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 5972.fire1 jtl3 small run.pbs 27018 1 -- -- 00:20 E -- Note that the job ID is printed for you when you submit the job qstat -a : Shows the status of all jobs August 5, 2003 Beowulf Training Running Multiple Jobs Slide 14

Looking at the Output By default standard output goes to <scriptname>.o<job number> By default standard error goes to <scriptname>.e<job number> [jtl3@fire1 compute-pi-1]$ cat run.pbs.o5972 The PBS job ID is: 5972.fire1 The PBS Node File is fire34 pi is about 3.1614997369512658487167300 Error is 1.9907083361472733e-02 Note how the PBS environment variables are interpreted in the script. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 15

Other Cool PBS Stuff You May Want To Do #PBS -N <Name> : Name your job #PBS -o <File.out> : Redirect standard output to File.out #PBS -e <File.err> : Redirect standard error to File.err #PBS -m -M : Mail options Job dependencies For a list of all PBS command le options... man qsub Any PBS Questions? August 5, 2003 Beowulf Training Running Multiple Jobs Slide 16

Condor For purposes of this discussion, think of Condor as a different scheduler. Condor is a bit more fancy. Used often for nondedicated resources. (Will run only when no one else would use the machine). Checkpointing/Migration Remote I/O Likely, the accounting charge will be less for jobs submit to the Condor scheduler. http://www.cs.wisc.edu/condor http://www.lehigh.edu/~inlts/comp/linux/condor/ August 5, 2003 Beowulf Training Running Multiple Jobs Slide 17

Checkpointing/Migration Professor s Machine Professor Arrives } 5 min 5am 8am Grad Student s Machine Checkpoint Server Grad Student Arrives Grad Student Leaves 8:10am } 12pm 5 min August 5, 2003 Beowulf Training Running Multiple Jobs Slide 18

Condor Universes Condor jobs are submit to a specic Condor Universe Standard Has cool features like checkpointing and migration of jobs Requires special linking of your program Vanilla No cool condor features (regular) MPI/PVM Not mentioned here today, but they exist. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 19

Compiling for Condor Standard Universe Put the command condor compile in front of your normal link line. [jtl3@fire1 condor]$ condor compile gcc pi1.c -o pi1-standard -lm Vanilla Universe Do nothing Now Condor submission is like PBS submission Different command (job description) le Different submission/montoring commands August 5, 2003 Beowulf Training Running Multiple Jobs Slide 20

A Sample Condor Submission File universe = standard executable = pi1-standard arguments = 1000000000 output = pi1.out error = pi1.err notification = Complete notify_user = jtl3@lehigh.edu getenv = True rank = kflops queue man condor submit August 5, 2003 Beowulf Training Running Multiple Jobs Slide 21

The Big Four condor submit <job.condor> Submit a job to the Condor scheduler condor q Check the status of the queue of Condor jobs condor status Check the status of the condor pool condor rm <jobid> Delete a Condor job August 5, 2003 Beowulf Training Running Multiple Jobs Slide 22

Let's Do It! [jtl3@fire1 condor]$ condor_submit run.condor Submitting job(s). 1 job(s) submitted to cluster 16. [jtl3@fire1 condor]$ condor_q -- Submitter: fire1.cluster : <192.168.0.1:32777> : fire1.cluster ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 16.0 jtl3 8/4 11:22 0+00:00:16 R 0 3.4 pi1-standard 1000000000 [jtl3@fire1 condor]$ cat pi1.out pi is about 3.1415926555921398488635532 Error is 2.0023467328655897e-09 I could do condor rm 16.0 Any Condor questions? August 5, 2003 Beowulf Training Running Multiple Jobs Slide 23

Quit Wasting My Time! OK, Linderoth, I thought today was supposed to be about parallel computing! That will be the focus of the next section(s) For now, let's do some simple parallel computing. Suppose I'd like to run the same executable pi1, but with many different input les or parameters. Use the multiple processors to get your work done faster August 5, 2003 Beowulf Training Running Multiple Jobs Slide 24

Running Many Jobs We need a way to easily submit many different jobs We will use the shell's scripting capabilities PBS Use a template command le and the sed utility Condor Use the -a ag to condor submit August 5, 2003 Beowulf Training Running Multiple Jobs Slide 25

PBS Run Multiple Jobs. Step #1 Create a template submission le. #!/bin/bash #PBS -q small #PBS -l nodes=1:public #PBS -l walltime=00:05:00 #PBS -V echo "The PBS job ID is: ${PBS_JOBID}" echo "The PBS Node File is" cat $PBS_NODEFILE /home/jtl3/class/pbs/pi1 XXX N XXX August 5, 2003 Beowulf Training Running Multiple Jobs Slide 26

PBS Run Multiple Jobs. Step #2 Create a shell script to do the multiple submission #!/bin/bash for n in 100 1000 10000 100000 1000000 do sed s/xxx_n_xxx/$n/g run.pbs.template > run.pbs.tmp qsub run.pbs.tmp rm run.pbs.tmp done The sed commands replaces all occurances of the pattern XXX N XXX with the variable $n in run.pbs.template. August 5, 2003 Beowulf Training Running Multiple Jobs Slide 27

PBS Run Multiple Jobs. [jtl3@fire1 pbs]$ sh run-many.sh 5989.fire1 5990.fire1 5991.fire1 5992.fire1 5993.fire1 sh the script you created Any questions about PBS multiple job submission? August 5, 2003 Beowulf Training Running Multiple Jobs Slide 28

Condor Run Multiple Jobs Example condor submit allows the user to override statements in the submission le. Use the -a ag This makes our scripting life easier we don't need to use sed August 5, 2003 Beowulf Training Running Multiple Jobs Slide 29

Condor Run Multiple Jobs. Step #1 Create the Condor submission le Note no arguments or output lines! executable = pi1-standard universe = standard notification = Complete notify_user = jtl3@lehigh.edu getenv = True rank = kflops queue August 5, 2003 Beowulf Training Running Multiple Jobs Slide 30

The Condor Multiple Job Submission Script Create the condor multiple job submission script Note the use of the -a option! #!/bin/bash for n in 100 1000 10000 100000 1000000 do condor_submit -a "arguments = $n" -a "output = pi.$n.out"\ run.condor.many done August 5, 2003 Beowulf Training Running Multiple Jobs Slide 31

Multiple Condor Submission Example [jtl3@fire1 condor]$ sh run-many.sh Submitting job(s). 1 job(s) submitted to cluster 32. Submitting job(s). 1 job(s) submitted to cluster 33. Submitting job(s). 1 job(s) submitted to cluster 34. Submitting job(s). 1 job(s) submitted to cluster 35. Submitting job(s). 1 job(s) submitted to cluster 36. [jtl3@fire1 condor]$ condor_q -- Submitter: fire1.cluster : <192.168.0.1:32777> : fire1.cluster ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 33.0 jtl3 8/4 12:16 0+00:00:01 R 0 3.4 pi1-standard 1000 34.0 jtl3 8/4 12:16 0+00:00:00 R 0 3.4 pi1-standard 10000 35.0 jtl3 8/4 12:16 0+00:00:00 I 0 3.4 pi1-standard 10000 36.0 jtl3 8/4 12:16 0+00:00:00 I 0 3.4 pi1-standard 10000 4 jobs; 2 idle, 2 running, 0 held August 5, 2003 Beowulf Training Running Multiple Jobs Slide 32

The End! Schedulers are required for use in a parallel computing environment PBS and Condor are cool You can do parallel computing even with MPI The Beowulf cluster can by a CPU cycle server for your research! August 5, 2003 Beowulf Training Running Multiple Jobs Slide 33