Minerva Training: Introduction to Load Sharing Facility(LSF)

Size: px
Start display at page:

Download "Minerva Training: Introduction to Load Sharing Facility(LSF)"

Transcription

1 Minerva Training: Introduction to Load Sharing Facility(LSF) A Distributed Resource Management System 26 Mar 2014

2 Table of Contents 1. Introduction 2. LSF versus PBS 3. LSF command overview 4. bsub 5. Other lsf commands 6. Checkpointing

3 Introduction

4 What is a Distributed Resource Management System Control usage of hard resources CPU cycles Memory Disk Space Network bandwidth Goal of DRMS is to achieve best utilization of resources and maximize system throughput. Can be decomposed into subsystems: Job management Physical resource management Scheduling and queuing LSF Training 26 Mar

5 Major Functional Blocks of Job Scheduler We will focus here in this talk. LSF Training 26 Mar

6 Distributed Resource Management System Other names for DRMS: Job Management Systems Resource Management Systems Schedulers Queuing Systems Batch Systems Some popular systems: Load Sharing Facility (LSF) Portable Batch Systems (PBS) Sun Grid Engine (SGE) IBM Load Leveler Condor LSF Training 26 Mar

7 Why LSF vice PBS LSF can handle more than twice as many job submissions per minute than PBS. LSF system can recover faster from a daemon failure which minimizes (or eliminates) lost jobs System is responsive to user commands at all times. Order of magnitude increase in speed of job dispatching. Significantly better job array handling. Allows for a fault tolerant configuration to ensure availability. Bonus: Checkpoint works as advertised LSF Training 26 Mar

8 Load Sharing Facility How to use it

9 Quick LSF vs PBS User Command PBS LSF Job Submission qsub [script file] bsub [script file] bsub < [script file] Job Deletion qdel [job id] bkill [job id] Job Status(by job) qstat [job id] bjobs [job id] Job Status (by user) qstat u [username] bjobs u [username] Job Hold qhold [job id] bstop [job id] Job Release qrls [job id] bresume [job id] Queue List qstat Q bqueues Node List pbsnodes -l bhosts Cluster Status qstat -a bqueues LSF Training 26 Mar

10 Common LSF Commands lsid! A good choice of LSF command to start with is the lsid command lshosts/bhosts! shows all of the nodes that the LSF system is aware of bsub! submits a job interactively or in batch using LSF batch scheduling and queue layer of the LSF suite bjobs! isplays information about a recently run job. You can use the l option to view a more detailed accounting bqueues! displays information about the batch queues. Again, the l option will display a more thorough description bkill <job ID# >! kill the job with job ID number of # bhist -l <job ID# >! displays historical information about jobs. A -a flag can displays information about both finished and unfinished jobs bpeek -f <job ID#>! displays the stdout and stderr output of an unfinished job with a job ID of #. bhpart! displays information about host partitions bstop! Suspend a unfinished jobs bacct l <job ID#> Accounting statistics for finished job

11 How to Submit Jobs via LSF on Minerva - bsub bsub can be invoked in one of two ways bsub [options] my_batch_job This will submit the script my_batch_job using the options on the command line. This will NOT interpret the #BSUB cookies in the script. If the job script contains #BSUB cookies: bsub [options] < my_batch_job This will interpret the #BSUB cookies in the script. Options on the command line override what is in the script. LSF Training 26 Mar

12 Some bsub options Option Use -q qname Specify queue -n min[,max] Specify number of cores. This is total number of cores. They can be allocated anywhere. By default system will try to fill a node first, cf. R, -a and app options -I Run job interactively -W walltime Wall time in HH:MM NO SECONDS! -o path Append output to specified file. By default output is mailed. This option specifies output should be concatenated to specified file. Can use %J in path to specifiy job id Can use %I in path to specify job array index LSF Training 26 Mar

13 Some bsub options Option -oo path Use Overwrite output file if it exists -e path Append stderr to specified file. Will be ed by default. If not specified, stderr gets merged with stdout -oe path Overwrite error file if it exists -J job-description Jobname[index start-end:increment] Enclosed in quotes. Optional index specifications signify this is a job array. Job index starts at 1. LSB_JOBINDEX is the index of the job LSF Training 26 Mar

14 Some bsub options Option Use -x Specifies exclusive use of the node -a esub-script Specify an external submission script to use These can be used to change your execution environment at job start. Most common one is probably openmp -app app-script Specify application profile. Preset bsub parameters. E.g. mpi switch configuration, checkpointing LSF Training 26 Mar

15 bsub Options -q [queue_name] alloc Queue Description Default Wall Time expressalloc gpualloc scavenger gpuscavenger Jobs that will be charged against an allocation. High throughput for jobs that will be charged to an allocation GPU nodes for users with gpu allocations For jobs that are not to be charged against an allocation For GPU jobs that are not to be charged against an allocation 5h 1h 5h 5h 5h Maximum Wall Time 144h (6d) 2h 144h (6d) 24h 24h

16 Example job: testit.lsf #!/bin/bash #BSUB q alloc #BSUB n 1 #BSUB o t.out echo Salve Munde!

17 bsub Script is NOT executable: bsub test.lsf Job <764675> is submitted to default queue <scavenger>. Output is lost until we fix mail bsub o t.out test.lsf Job <764676> is submitted to default queue <scavenger>. t.out: /tmp/ : line 8: test.lsf: command not found bsub o t.out./test.lsf /tmp/ : line 8:./t.lsf: Permission denied

18 bsub Script is NOT executable: bsub < t.lsf Job <764687> is submitted to queue <alloc>. t.out -> Salve Munde! Script is executable: bsub o t.out./t.lsf Job <764689> is submitted to default queue <scavenger>. t.out -> Salve Munde! bsub <./t.lsf Job <764690> is submitted to queue <alloc>. t.out -> Salve Munde! LSF Training 26 Mar kkjjjj

19 bsub With LSF, you can even bsub a shell command: bsub o ls.out ls tail ls.out The output (if any) follows: 45.tar out acc_7.txt Aligned.out.sam a.otf arjun.rd_isa audit LSF Training 26 Mar

20 Specifying a Resource -R rusage[mem=mem_per_slot_in_mb] Specify how much memory per slot/core your program will require. Default is 2500 Bsub n 6 R rusage[mem=4000] This will allocate 6*4000MB or 24000MB to the job. LSF Training 26 Mar

21 Specifying a Resource The R option is used to specify resources; Span: define the shape of the cores you ask for: -n 12 R span[ptile=12] - all 12 cores must be on 1 node -n 24 R span[ptile=12] - allocate 12 cores per node = 2 nodes -n 24 R span[hosts=1] - allocate all 24 cores to one host bsub n 12 R span[hosts=1] < my_parallel_job OMP_NUM_THREADS must be set in script: export OMP_NUM_THREADS=12 export OMP_NUM_THREADS=$LSB_DJOB_NUMPROC Dangerous Better: bsub n 12 R span[ptile=12] a openmp < my_parallel_job LSF sets it for you as number of procs per node LSF Training 26 Mar

22 Specifying Resource For MPI jobs, you want nodes allocated on one switch: -R cu[type=switch:maxcus=1:pref=maxavail] 24 nodes per switch is maximum = 24*12 cores per switch maximum bsub n 20 R cu[type=switch:maxcus=1:pref=maxavail] < my_mpi_job But cores may not be on same node, so: bsub n 20 R cu[type=switch:maxcus=1:pref=maxavail] -R span[ptile=12] Or bsub n 20 R cu[type=switch:maxcus=1:pref=maxavail] span[ptile=12] Or bsub n 20 app 1switch < my_mpi_job Also have a 2switch LSF Training 26 Mar

23 A Bravura Submission - Mixing it all together Suppose you want to run a combined MPI-openMP job. One mpi process per node, openmp in each MPI Rank: bsub n 160 R span[ptile=8] app 1switch a openmp <my_awsome_job 1switch will insert resource requests for 1 swich and tile of 12/node Command line span will override the app span so we will get 8 per node The openmp esub script will start only 1 process per node and set OMP_NUM_THREADS on each node to 8 LSF Training 26 Mar

24 bhosts chang]$ bhosts HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV master_hosts closed node21-1 ok node21-10 ok node21-11 ok node21-12 ok node21-13 ok node21-14 closed node21-15 closed LSF Training 26 Mar

25 bjobs Check your jobs: bjobs JOBID USER JOB_NAME STAT QUEUE FROM_HOST EXEC_HOST SUBMIT_TIME TIME_LEF fludee01 *txt ttt.out RUN gpualloc login1 24*node25-23 Mar 25 12:08 95:28 L Check all jobs: Bjobs u all zhangj21 *minimac.log PEND alloc login1 - Mar 25 11: zhangj21 *minimac.log PEND alloc login1 - Mar 25 11: zhangj21 *minimac.log PEND alloc login1 - Mar 25 11: zhangj21 *minimac.log PEND alloc login1 - Mar 25 11: zhangj21 *minimac.log PEND alloc login1 - Mar 25 11: zhangj21 *minimac.log PEND alloc login1 - Mar 25 11:41 - LSF Training 26 Mar

26 bpeek Check put the output while job is running. f option tails the output: [fludee01@login1 ~]$ bpeek << output from stdout >> test size **** Dynamic Bayesian Expert System based on Qualitative Hypotheses *************************************************************** The current working directory is: - /sc/orga/scratch/fludee01/chang C-MYC OCT-4,SOX1,LEF1,FOXO1,SOX9,GATA2,ZFP64 0,0,1,1,1,1, FOXA2 1 LSF Training 26 Mar

27 bkill Kill jobs in the queue whether running or not Lots of ways to get away with murder: Job arrays get same job name and jobid so: Kill by job id bkill Kill by job name bkill J myjob_1 Kill a bunch of jobs bkill J myjob_* Kill entire job array: bkill bkill J my_array Kill one job in array: bkill [42] bkill J my_array[3] LSF Training 26 Mar

28 Checkpoint -k "checkpoint_dir [init=initial_checkpoint_period] [checkpoint_period] [method=method_name] Must use method=blcr default method does not work checkpoint_dir - directory in which checkpoints are to be stored init - how long ( in minutes ) to wait until you can take a checkpoint default 1m checkpoint_period take a checkpoint every xxx minutes method - how to do the checkpoint - must be blcr -app chkpnt Directory =./ckpnt init = 1 method = blcr LSF Training 26 Mar

29 Checkpoint Sample job script #!/bin/bash #BSUB q scavenger #BSUB -app chkpnt #BSUB -n 1 #BSUB -W 03 #BSUB -o lsf.out cr_run./basic LSF Training 26 Mar

30 Checkpoint Program must be dynamically linked Serial programs OK OpenMP programs OK MPI not OK Execute your program using cr_run cr_run my_long_program Can checkpoint on demand with bchkpnt Will checkpoint if time expires automatically Restart with brestart LSF Training 26 Mar

31 Checkpoint After checkpoint, chkpnt dir looks like: ls chkpnt ls restart chklog context context.4070 echkpnt.out erestart.out out shell echkpnt.err erestart.err chkpnt.log context. To restart: brestart -q alloc W 144:00./chkpnt LSF Training 26 Mar

32 Final Friendly Reminders Never run jobs on login nodes For file management, coding, compilation, etc., purposes only Never run jobs outside LSF Fair sharing Scratch disk not backed up, efficient use of limited resources Old files will automatically be deleted without notification Logging onto compute nodes is no longer allowed. LSF Training 26 Mar

The RWTH Compute Cluster Environment

The RWTH Compute Cluster Environment The RWTH Compute Cluster Environment Tim Cramer 11.03.2013 Source: D. Both, Bull GmbH Rechen- und Kommunikationszentrum (RZ) How to login Frontends cluster.rz.rwth-aachen.de cluster-x.rz.rwth-aachen.de

More information

Miami University RedHawk Cluster Working with batch jobs on the Cluster

Miami University RedHawk Cluster Working with batch jobs on the Cluster Miami University RedHawk Cluster Working with batch jobs on the Cluster The RedHawk cluster is a general purpose research computing resource available to support the research community at Miami University.

More information

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine) Grid Engine Basics (Formerly: Sun Grid Engine) Table of Contents Table of Contents Document Text Style Associations Prerequisites Terminology What is the Grid Engine (SGE)? Loading the SGE Module on Turing

More information

UMass High Performance Computing Center

UMass High Performance Computing Center .. UMass High Performance Computing Center University of Massachusetts Medical School October, 2014 2 / 32. Challenges of Genomic Data It is getting easier and cheaper to produce bigger genomic data every

More information

Running Jobs with Platform LSF. Platform LSF Version 8.0 June 2011

Running Jobs with Platform LSF. Platform LSF Version 8.0 June 2011 Running Jobs with Platform LSF Platform LSF Version 8.0 June 2011 Copyright 1994-2011 Platform Computing Corporation. Although the information in this document has been carefully reviewed, Platform Computing

More information

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007 PBS Tutorial Fangrui Ma Universit of Nebraska-Lincoln October 26th, 2007 Abstract In this tutorial we gave a brief introduction to using PBS Pro. We gave examples on how to write control script, and submit

More information

Parallel Processing using the LOTUS cluster

Parallel Processing using the LOTUS cluster Parallel Processing using the LOTUS cluster Alison Pamment / Cristina del Cano Novales JASMIN/CEMS Workshop February 2015 Overview Parallelising data analysis LOTUS HPC Cluster Job submission on LOTUS

More information

Ra - Batch Scripts. Timothy H. Kaiser, Ph.D. tkaiser@mines.edu

Ra - Batch Scripts. Timothy H. Kaiser, Ph.D. tkaiser@mines.edu Ra - Batch Scripts Timothy H. Kaiser, Ph.D. tkaiser@mines.edu Jobs on Ra are Run via a Batch System Ra is a shared resource Purpose: Give fair access to all users Have control over where jobs are run Set

More information

Using the Yale HPC Clusters

Using the Yale HPC Clusters Using the Yale HPC Clusters Stephen Weston Robert Bjornson Yale Center for Research Computing Yale University Oct 2015 To get help Send an email to: hpc@yale.edu Read documentation at: http://research.computing.yale.edu/hpc-support

More information

Batch Scripts for RA & Mio

Batch Scripts for RA & Mio Batch Scripts for RA & Mio Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Jobs are Run via a Batch System Ra and Mio are shared resources Purpose: Give fair access to all users Have control over where jobs

More information

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education www.accre.vanderbilt.

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education www.accre.vanderbilt. SLURM: Resource Management and Job Scheduling Software Advanced Computing Center for Research and Education www.accre.vanderbilt.edu Simple Linux Utility for Resource Management But it s also a job scheduler!

More information

Running applications on the Cray XC30 4/12/2015

Running applications on the Cray XC30 4/12/2015 Running applications on the Cray XC30 4/12/2015 1 Running on compute nodes By default, users do not log in and run applications on the compute nodes directly. Instead they launch jobs on compute nodes

More information

Hodor and Bran - Job Scheduling and PBS Scripts

Hodor and Bran - Job Scheduling and PBS Scripts Hodor and Bran - Job Scheduling and PBS Scripts UND Computational Research Center Now that you have your program compiled and your input file ready for processing, it s time to run your job on the cluster.

More information

Introduction to Sun Grid Engine (SGE)

Introduction to Sun Grid Engine (SGE) Introduction to Sun Grid Engine (SGE) What is SGE? Sun Grid Engine (SGE) is an open source community effort to facilitate the adoption of distributed computing solutions. Sponsored by Sun Microsystems

More information

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education www.accre.vanderbilt.

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education www.accre.vanderbilt. SLURM: Resource Management and Job Scheduling Software Advanced Computing Center for Research and Education www.accre.vanderbilt.edu Simple Linux Utility for Resource Management But it s also a job scheduler!

More information

Streamline Computing Linux Cluster User Training. ( Nottingham University)

Streamline Computing Linux Cluster User Training. ( Nottingham University) 1 Streamline Computing Linux Cluster User Training ( Nottingham University) 3 User Training Agenda System Overview System Access Description of Cluster Environment Code Development Job Schedulers Running

More information

/6)%DWFK8VHU V*XLGH. Sixth Edition, August 1998 3ODWIRUP&RPSXWLQJ&RUSRUDWLRQ

/6)%DWFK8VHU V*XLGH. Sixth Edition, August 1998 3ODWIRUP&RPSXWLQJ&RUSRUDWLRQ /6)%DWFK8VHU V*XLGH Sixth Edition, August 1998 3ODWIRUP&RPSXWLQJ&RUSRUDWLRQ /6)%DWFK8VHU V*XLGH Copyright 1994-1998 Platform Computing Corporation All rights reserved. This document is copyrighted. This

More information

Running Jobs with Platform LSF. Version 6.2 September 2005 Comments to: doc@platform.com

Running Jobs with Platform LSF. Version 6.2 September 2005 Comments to: doc@platform.com Running Jobs with Platform LSF Version 6.2 September 2005 Comments to: doc@platform.com Copyright We d like to hear from you 1994-2005 Platform Computing Corporation All rights reserved. You can help us

More information

Quick Tutorial for Portable Batch System (PBS)

Quick Tutorial for Portable Batch System (PBS) Quick Tutorial for Portable Batch System (PBS) The Portable Batch System (PBS) system is designed to manage the distribution of batch jobs and interactive sessions across the available nodes in the cluster.

More information

Grid 101. Grid 101. Josh Hegie. grid@unr.edu http://hpc.unr.edu

Grid 101. Grid 101. Josh Hegie. grid@unr.edu http://hpc.unr.edu Grid 101 Josh Hegie grid@unr.edu http://hpc.unr.edu Accessing the Grid Outline 1 Accessing the Grid 2 Working on the Grid 3 Submitting Jobs with SGE 4 Compiling 5 MPI 6 Questions? Accessing the Grid Logging

More information

Grid Engine 6. Troubleshooting. BioTeam Inc. info@bioteam.net

Grid Engine 6. Troubleshooting. BioTeam Inc. info@bioteam.net Grid Engine 6 Troubleshooting BioTeam Inc. info@bioteam.net Grid Engine Troubleshooting There are two core problem types Job Level Cluster seems OK, example scripts work fine Some user jobs/apps fail Cluster

More information

Installing Platform Product Suite for SAS (Windows)

Installing Platform Product Suite for SAS (Windows) Installing Platform Product Suite for SAS (Windows) Version 3.1 March 29, 2007 Contents Introduction on page 3 Supported Versions and Requirements on page 4 Prerequisites on page 5 Install the Software

More information

Using Parallel Computing to Run Multiple Jobs

Using Parallel Computing to Run Multiple Jobs Beowulf Training Using Parallel Computing to Run Multiple Jobs Jeff Linderoth August 5, 2003 August 5, 2003 Beowulf Training Running Multiple Jobs Slide 1 Outline Introduction to Scheduling Software The

More information

Job scheduler details

Job scheduler details Job scheduler details Advanced Computing Center for Research & Education (ACCRE) Job scheduler details 1 / 25 Outline 1 Batch queue system overview 2 Torque and Moab 3 Submitting jobs (ACCRE) Job scheduler

More information

Introduction to the SGE/OGS batch-queuing system

Introduction to the SGE/OGS batch-queuing system Grid Computing Competence Center Introduction to the SGE/OGS batch-queuing system Riccardo Murri Grid Computing Competence Center, Organisch-Chemisches Institut, University of Zurich Oct. 6, 2011 The basic

More information

HPC-Nutzer Informationsaustausch. The Workload Management System LSF

HPC-Nutzer Informationsaustausch. The Workload Management System LSF HPC-Nutzer Informationsaustausch The Workload Management System LSF Content Cluster facts Job submission esub messages Scheduling strategies Tools and security Future plans 2 von 10 Some facts about the

More information

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27. Linux für bwgrid Sabine Richling, Heinz Kredel Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim 27. June 2011 Richling/Kredel (URZ/RUM) Linux für bwgrid FS 2011 1 / 33 Introduction

More information

On-demand (Pay-per-Use) HPC Service Portal

On-demand (Pay-per-Use) HPC Service Portal On-demand (Pay-per-Use) Portal Wang Junhong INTRODUCTION High Performance Computing, Computer Centre The Service Portal is a key component of the On-demand (pay-per-use) HPC service delivery. The Portal,

More information

SGE Roll: Users Guide. Version @VERSION@ Edition

SGE Roll: Users Guide. Version @VERSION@ Edition SGE Roll: Users Guide Version @VERSION@ Edition SGE Roll: Users Guide : Version @VERSION@ Edition Published Aug 2006 Copyright 2006 UC Regents, Scalable Systems Table of Contents Preface...i 1. Requirements...1

More information

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St

More information

Submitting batch jobs Slurm on ecgate. Xavi Abellan xavier.abellan@ecmwf.int User Support Section

Submitting batch jobs Slurm on ecgate. Xavi Abellan xavier.abellan@ecmwf.int User Support Section Submitting batch jobs Slurm on ecgate Xavi Abellan xavier.abellan@ecmwf.int User Support Section Slide 1 Outline Interactive mode versus Batch mode Overview of the Slurm batch system on ecgate Batch basic

More information

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF) Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF) ALCF Resources: Machines & Storage Mira (Production) IBM Blue Gene/Q 49,152 nodes / 786,432 cores 768 TB of memory Peak flop rate:

More information

Efficient cluster computing

Efficient cluster computing Efficient cluster computing Introduction to the Sun Grid Engine (SGE) queuing system Markus Rampp (RZG, MIGenAS) MPI for Evolutionary Anthropology Leipzig, Feb. 16, 2007 Outline Introduction Basic concepts:

More information

Grid Engine Users Guide. 2011.11p1 Edition

Grid Engine Users Guide. 2011.11p1 Edition Grid Engine Users Guide 2011.11p1 Edition Grid Engine Users Guide : 2011.11p1 Edition Published Nov 01 2012 Copyright 2012 University of California and Scalable Systems This document is subject to the

More information

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina High Performance Computing Facility Specifications, Policies and Usage Supercomputer Project Bibliotheca Alexandrina Bibliotheca Alexandrina 1/16 Topics Specifications Overview Site Policies Intel Compilers

More information

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014 Using WestGrid Patrick Mann, Manager, Technical Operations Jan.15, 2014 Winter 2014 Seminar Series Date Speaker Topic 5 February Gino DiLabio Molecular Modelling Using HPC and Gaussian 26 February Jonathan

More information

Juropa. Batch Usage Introduction. May 2014 Chrysovalantis Paschoulas c.paschoulas@fz-juelich.de

Juropa. Batch Usage Introduction. May 2014 Chrysovalantis Paschoulas c.paschoulas@fz-juelich.de Juropa Batch Usage Introduction May 2014 Chrysovalantis Paschoulas c.paschoulas@fz-juelich.de Batch System Usage Model A Batch System: monitors and controls the resources on the system manages and schedules

More information

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria Tutorial: Using WestGrid Drew Leske Compute Canada/WestGrid Site Lead University of Victoria Fall 2013 Seminar Series Date Speaker Topic 23 September Lindsay Sill Introduction to WestGrid 9 October Drew

More information

Cluster@WU User s Manual

Cluster@WU User s Manual Cluster@WU User s Manual Stefan Theußl Martin Pacala September 29, 2014 1 Introduction and scope At the WU Wirtschaftsuniversität Wien the Research Institute for Computational Methods (Forschungsinstitut

More information

Using the Yale HPC Clusters

Using the Yale HPC Clusters Using the Yale HPC Clusters Stephen Weston Robert Bjornson Yale Center for Research Computing Yale University Dec 2015 To get help Send an email to: hpc@yale.edu Read documentation at: http://research.computing.yale.edu/hpc-support

More information

Getting Started with HPC

Getting Started with HPC Getting Started with HPC An Introduction to the Minerva High Performance Computing Resource 17 Sep 2013 Outline of Topics Introduction HPC Accounts Logging onto the HPC Clusters Common Linux Commands Storage

More information

Agenda. Using HPC Wales 2

Agenda. Using HPC Wales 2 Using HPC Wales Agenda Infrastructure : An Overview of our Infrastructure Logging in : Command Line Interface and File Transfer Linux Basics : Commands and Text Editors Using Modules : Managing Software

More information

Platform LSF Version 9 Release 1.1. Security SC27-5303-01

Platform LSF Version 9 Release 1.1. Security SC27-5303-01 Platform LSF Version 9 Release 1.1 Security SC27-5303-01 Platform LSF Version 9 Release 1.1 Security SC27-5303-01 Note Before using this information and the product it supports, read the information in

More information

An Introduction to High Performance Computing in the Department

An Introduction to High Performance Computing in the Department An Introduction to High Performance Computing in the Department Ashley Ford & Chris Jewell Department of Statistics University of Warwick October 30, 2012 1 Some Background 2 How is Buster used? 3 Software

More information

NYUAD HPC Center Running Jobs

NYUAD HPC Center Running Jobs NYUAD HPC Center Running Jobs 1 Overview... Error! Bookmark not defined. 1.1 General List... Error! Bookmark not defined. 1.2 Compilers... Error! Bookmark not defined. 2 Loading Software... Error! Bookmark

More information

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Goals of the session Overview of parallel MATLAB Why parallel MATLAB? Multiprocessing in MATLAB Parallel MATLAB using the Parallel Computing

More information

Martinos Center Compute Clusters

Martinos Center Compute Clusters Intro What are the compute clusters How to gain access Housekeeping Usage Log In Submitting Jobs Queues Request CPUs/vmem Email Status I/O Interactive Dependencies Daisy Chain Wrapper Script In Progress

More information

RWTH GPU Cluster. Sandra Wienke wienke@rz.rwth-aachen.de November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

RWTH GPU Cluster. Sandra Wienke wienke@rz.rwth-aachen.de November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky RWTH GPU Cluster Fotos: Christian Iwainsky Sandra Wienke wienke@rz.rwth-aachen.de November 2012 Rechen- und Kommunikationszentrum (RZ) The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) innovative

More information

Heterogeneous Clustering- Operational and User Impacts

Heterogeneous Clustering- Operational and User Impacts Heterogeneous Clustering- Operational and User Impacts Sarita Salm Sterling Software MS 258-6 Moffett Field, CA 94035.1000 sarita@nas.nasa.gov http :llscience.nas.nasa.govl~sarita ABSTRACT Heterogeneous

More information

Parallel Debugging with DDT

Parallel Debugging with DDT Parallel Debugging with DDT Nate Woody 3/10/2009 www.cac.cornell.edu 1 Debugging Debugging is a methodical process of finding and reducing the number of bugs, or defects, in a computer program or a piece

More information

Installing and running COMSOL on a Linux cluster

Installing and running COMSOL on a Linux cluster Installing and running COMSOL on a Linux cluster Introduction This quick guide explains how to install and operate COMSOL Multiphysics 5.0 on a Linux cluster. It is a complement to the COMSOL Installation

More information

SLURM Workload Manager

SLURM Workload Manager SLURM Workload Manager What is SLURM? SLURM (Simple Linux Utility for Resource Management) is the native scheduler software that runs on ASTI's HPC cluster. Free and open-source job scheduler for the Linux

More information

NEC HPC-Linux-Cluster

NEC HPC-Linux-Cluster NEC HPC-Linux-Cluster Hardware configuration: 4 Front-end servers: each with SandyBridge-EP processors: 16 cores per node 128 GB memory 134 compute nodes: 112 nodes with SandyBridge-EP processors (16 cores

More information

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource PBS INTERNALS PBS & TORQUE PBS (Portable Batch System)-software system for managing system resources on workstations, SMP systems, MPPs and vector computers. It was based on Network Queuing System (NQS)

More information

High-Performance Reservoir Risk Assessment (Jacta Cluster)

High-Performance Reservoir Risk Assessment (Jacta Cluster) High-Performance Reservoir Risk Assessment (Jacta Cluster) SKUA-GOCAD 2013.1 Paradigm 2011.3 With Epos 4.1 Data Management Configuration Guide 2008 2013 Paradigm Ltd. or its affiliates and subsidiaries.

More information

Beyond Windows: Using the Linux Servers and the Grid

Beyond Windows: Using the Linux Servers and the Grid Beyond Windows: Using the Linux Servers and the Grid Topics Linux Overview How to Login & Remote Access Passwords Staying Up-To-Date Network Drives Server List The Grid Useful Commands Linux Overview Linux

More information

Running a Workflow on a PowerCenter Grid

Running a Workflow on a PowerCenter Grid Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)

More information

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015 Work Environment David Tur HPC Expert HPC Users Training September, 18th 2015 1. Atlas Cluster: Accessing and using resources 2. Software Overview 3. Job Scheduler 1. Accessing Resources DIPC technicians

More information

General Overview. Slurm Training15. Alfred Gil & Jordi Blasco (HPCNow!)

General Overview. Slurm Training15. Alfred Gil & Jordi Blasco (HPCNow!) Slurm Training15 Agenda 1 2 3 About Slurm Key Features of Slurm Extending Slurm Resource Management Daemons Job/step allocation 4 5 SMP MPI Parametric Job monitoring Accounting Scheduling Administration

More information

How to Run Parallel Jobs Efficiently

How to Run Parallel Jobs Efficiently How to Run Parallel Jobs Efficiently Shao-Ching Huang High Performance Computing Group UCLA Institute for Digital Research and Education May 9, 2013 1 The big picture: running parallel jobs on Hoffman2

More information

Enigma, Sun Grid Engine (SGE), and the Joint High Performance Computing Exchange (JHPCE) Cluster

Enigma, Sun Grid Engine (SGE), and the Joint High Performance Computing Exchange (JHPCE) Cluster Enigma, Sun Grid Engine (SGE), and the Joint High Performance Computing Exchange (JHPCE) Cluster http://www.biostat.jhsph.edu/bit/sge_lecture.ppt.pdf Marvin Newhouse Fernando J. Pineda The JHPCE staff:

More information

Vital-IT Users Training: HPC in Life Sciences

Vital-IT Users Training: HPC in Life Sciences Vital-IT Users Training: HPC in Life Sciences Vital-IT Group, Lausanne Contact: projects@vital-it.ch Status: 29 January 2015 Objectives of this course Obtain basic knowledge on high throughput and high

More information

Resource Management and Job Scheduling

Resource Management and Job Scheduling Resource Management and Job Scheduling Jenett Tillotson Senior Cluster System Administrator Indiana University May 18 18-22 May 2015 1 Resource Managers Keep track of resources Nodes: CPUs, disk, memory,

More information

User s Guide. Introduction

User s Guide. Introduction CHAPTER 3 User s Guide Introduction Sun Grid Engine (Computing in Distributed Networked Environments) is a load management tool for heterogeneous, distributed computing environments. Sun Grid Engine provides

More information

How To Run A Tompouce Cluster On An Ipra (Inria) 2.5.5 (Sun) 2 (Sun Geserade) 2-5.4 (Sun-Ge) 2/5.2 (

How To Run A Tompouce Cluster On An Ipra (Inria) 2.5.5 (Sun) 2 (Sun Geserade) 2-5.4 (Sun-Ge) 2/5.2 ( Running Hadoop and Stratosphere jobs on TomPouce cluster 16 October 2013 TomPouce cluster TomPouce is a cluster of 20 calcula@on nodes = 240 cores Located in the Inria Turing building (École Polytechnique)

More information

Maxwell compute cluster

Maxwell compute cluster Maxwell compute cluster An introduction to the Maxwell compute cluster Part 1 1.1 Opening PuTTY and getting the course materials on to Maxwell 1.1.1 On the desktop, double click on the shortcut icon for

More information

High Performance Computing with Sun Grid Engine on the HPSCC cluster. Fernando J. Pineda

High Performance Computing with Sun Grid Engine on the HPSCC cluster. Fernando J. Pineda High Performance Computing with Sun Grid Engine on the HPSCC cluster Fernando J. Pineda HPSCC High Performance Scientific Computing Center (HPSCC) " The Johns Hopkins Service Center in the Dept. of Biostatistics

More information

Rocoto. HWRF Python Scripts Training Miami, FL November 19, 2015

Rocoto. HWRF Python Scripts Training Miami, FL November 19, 2015 Rocoto HWRF Python Scripts Training Miami, FL November 19, 2015 Outline Introduction to Rocoto How it works Overview and description of XML Effectively using Rocoto (run, boot, stat, check, rewind, logs)

More information

GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems

GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems Riccardo Murri, Sergio Maffioletti Grid Computing Competence Center, Organisch-Chemisches Institut, University of Zurich

More information

TACC Linux User Environment LSF/SGE Batch Schedulers. Outline

TACC Linux User Environment LSF/SGE Batch Schedulers. Outline TACC Linux User Environment LSF/SGE Batch Schedulers Karl W. Schulz Texas Advanced Computing Center The University of Texas at Austin UT/Portugal Summer Institute Training Coimbra, Portugal July 14, 2008

More information

Until now: tl;dr: - submit a job to the scheduler

Until now: tl;dr: - submit a job to the scheduler Until now: - access the cluster copy data to/from the cluster create parallel software compile code and use optimized libraries how to run the software on the full cluster tl;dr: - submit a job to the

More information

Introduction to Sun Grid Engine 5.3

Introduction to Sun Grid Engine 5.3 CHAPTER 1 Introduction to Sun Grid Engine 5.3 This chapter provides background information about the Sun Grid Engine 5.3 system that is useful to users and administrators alike. In addition to a description

More information

Job Scheduling with Moab Cluster Suite

Job Scheduling with Moab Cluster Suite Job Scheduling with Moab Cluster Suite IBM High Performance Computing February 2010 Y. Joanna Wong, Ph.D. yjw@us.ibm.com 2/22/2010 Workload Manager Torque Source: Adaptive Computing 2 Some terminology..

More information

Batch Scheduling and Resource Management

Batch Scheduling and Resource Management Batch Scheduling and Resource Management Luke Tierney Department of Statistics & Actuarial Science University of Iowa October 18, 2007 Luke Tierney (U. of Iowa) Batch Scheduling and Resource Management

More information

GRID Computing: CAS Style

GRID Computing: CAS Style CS4CC3 Advanced Operating Systems Architectures Laboratory 7 GRID Computing: CAS Style campus trunk C.I.S. router "birkhoff" server The CAS Grid Computer 100BT ethernet node 1 "gigabyte" Ethernet switch

More information

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research ! Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research! Cynthia Cornelius! Center for Computational Research University at Buffalo, SUNY! cdc at

More information

Grid Engine Training Introduction

Grid Engine Training Introduction Grid Engine Training Jordi Blasco (jordi.blasco@xrqtc.org) 26-03-2012 Agenda 1 How it works? 2 History Current status future About the Grid Engine version of this training Documentation 3 Grid Engine internals

More information

HOD Scheduler. Table of contents

HOD Scheduler. Table of contents Table of contents 1 Introduction... 2 2 HOD Users... 2 2.1 Getting Started... 2 2.2 HOD Features...5 2.3 Troubleshooting... 14 3 HOD Administrators... 21 3.1 Getting Started... 22 3.2 Prerequisites...

More information

NorduGrid ARC Tutorial

NorduGrid ARC Tutorial NorduGrid ARC Tutorial / Arto Teräs and Olli Tourunen 2006-03-23 Slide 1(34) NorduGrid ARC Tutorial Arto Teräs and Olli Tourunen CSC, Espoo, Finland March 23

More information

Technical Guide to ULGrid

Technical Guide to ULGrid Technical Guide to ULGrid Ian C. Smith Computing Services Department September 4, 2007 1 Introduction This document follows on from the User s Guide to Running Jobs on ULGrid using Condor-G [1] and gives

More information

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma carlos@tacc.utexas.edu

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma carlos@tacc.utexas.edu Debugging and Profiling Lab Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma carlos@tacc.utexas.edu Setup Login to Ranger: - ssh -X username@ranger.tacc.utexas.edu Make sure you can export graphics

More information

Cluster Computing With R

Cluster Computing With R Cluster Computing With R Stowers Institute for Medical Research R/Bioconductor Discussion Group Earl F. Glynn Scientific Programmer 18 December 2007 1 Cluster Computing With R Accessing Linux Boxes from

More information

HOW TO USE THIS DOCUMENT READ OBEY

HOW TO USE THIS DOCUMENT READ OBEY Exercise: Learning Batch Computing on OSCER s Linux Cluster Supercomputer This exercise will help you learn to use Boomer, the Linux cluster supercomputer administered by the OU Supercomputing Center for

More information

Batch Job Analysis to Improve the Success Rate in HPC

Batch Job Analysis to Improve the Success Rate in HPC Batch Job Analysis to Improve the Success Rate in HPC 1 JunWeon Yoon, 2 TaeYoung Hong, 3 ChanYeol Park, 4 HeonChang Yu 1, First Author KISTI and Korea University, jwyoon@kisti.re.kr 2,3, KISTI,tyhong@kisti.re.kr,chan@kisti.re.kr

More information

NTTCT Mail Hosting Service Account Management

NTTCT Mail Hosting Service Account Management NTTCT Mail Hosting Service Account Management (Mail Hosting: NTT Communications (Thailand) Co., Ltd.) About This Document This document is intended to be a quick reference guide to follow for administrator

More information

Job Scheduler Daemon Configuration Guide

Job Scheduler Daemon Configuration Guide Job Scheduler Daemon Configuration Guide A component of Mark Dickinsons Unix Job Scheduler This manual covers the server daemon component of Mark Dickinsons unix Job Scheduler. This manual is for version

More information

Backup Tab. User Guide

Backup Tab. User Guide Backup Tab User Guide Contents 1. Introduction... 2 Documentation... 2 Licensing... 2 Overview... 2 2. Create a New Backup... 3 3. Manage backup jobs... 4 Using the Edit menu... 5 Overview... 5 Destination...

More information

Biowulf2 Training Session

Biowulf2 Training Session Biowulf2 Training Session 9 July 2015 Slides at: h,p://hpc.nih.gov/docs/b2training.pdf HPC@NIH website: h,p://hpc.nih.gov System hardware overview What s new/different The batch system & subminng jobs

More information

Platform LSF Version 9 Release 1.1. Using on Windows SC27-5311-01

Platform LSF Version 9 Release 1.1. Using on Windows SC27-5311-01 Platform LSF Version 9 Release 1.1 Using on Windows SC27-5311-01 Platform LSF Version 9 Release 1.1 Using on Windows SC27-5311-01 Note Before using this information and the product it supports, read the

More information

HPC at IU Overview. Abhinav Thota Research Technologies Indiana University

HPC at IU Overview. Abhinav Thota Research Technologies Indiana University HPC at IU Overview Abhinav Thota Research Technologies Indiana University What is HPC/cyberinfrastructure? Why should you care? Data sizes are growing Need to get to the solution faster Compute power is

More information

High Performance Computing

High Performance Computing High Performance Computing at Stellenbosch University Gerhard Venter Outline 1 Background 2 Clusters 3 SU History 4 SU Cluster 5 Using the Cluster 6 Examples What is High Performance Computing? Wikipedia

More information

WebSphere Application Server security auditing

WebSphere Application Server security auditing Copyright IBM Corporation 2008 All rights reserved IBM WebSphere Application Server V7 LAB EXERCISE WebSphere Application Server security auditing What this exercise is about... 1 Lab requirements... 1

More information

Configuration of High Performance Computing for Medical Imaging and Processing. SunGridEngine 6.2u5

Configuration of High Performance Computing for Medical Imaging and Processing. SunGridEngine 6.2u5 Configuration of High Performance Computing for Medical Imaging and Processing SunGridEngine 6.2u5 A manual guide for installing, configuring and using the cluster. Mohammad Naquiddin Abd Razak Summer

More information

IBM Platform LSF. Best practices. Setting up firewall rules for IBM Platform LSF. Yuxing Ren LSF Development Systems & Technology Group

IBM Platform LSF. Best practices. Setting up firewall rules for IBM Platform LSF. Yuxing Ren LSF Development Systems & Technology Group IBM Platform LSF Best practices Setting up firewall rules for IBM Platform LSF Yuxing Ren LSF Development Systems & Technology Group Issued: September 2014 Setting up firewall rules for IBM Platform LSF...

More information

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology Volume 1.0 FACULTY OF CUMPUTER SCIENCE & ENGINEERING Ghulam Ishaq Khan Institute of Engineering Sciences & Technology User Manual For HPC Cluster at GIKI Designed and prepared by Faculty of Computer Science

More information

Grid Engine experience in Finis Terrae, large Itanium cluster supercomputer. Pablo Rey Mayo Systems Technician, Galicia Supercomputing Centre (CESGA)

Grid Engine experience in Finis Terrae, large Itanium cluster supercomputer. Pablo Rey Mayo Systems Technician, Galicia Supercomputing Centre (CESGA) Grid Engine experience in Finis Terrae, large Itanium cluster supercomputer Pablo Rey Mayo Systems Technician, Galicia Supercomputing Centre (CESGA) Agenda Introducing CESGA Finis Terrae Architecture Grid

More information

Chapter 2: Getting Started

Chapter 2: Getting Started Chapter 2: Getting Started Once Partek Flow is installed, Chapter 2 will take the user to the next stage and describes the user interface and, of note, defines a number of terms required to understand

More information

An introduction to compute resources in Biostatistics. Chris Scheller schelcj@umich.edu

An introduction to compute resources in Biostatistics. Chris Scheller schelcj@umich.edu An introduction to compute resources in Biostatistics Chris Scheller schelcj@umich.edu 1. Resources 1. Hardware 2. Account Allocation 3. Storage 4. Software 2. Usage 1. Environment Modules 2. Tools 3.

More information

LoadLeveler Overview. January 30-31, 2012. IBM Storage & Technology Group. IBM HPC Developer Education @ TIFR, Mumbai

LoadLeveler Overview. January 30-31, 2012. IBM Storage & Technology Group. IBM HPC Developer Education @ TIFR, Mumbai IBM HPC Developer Education @ TIFR, Mumbai IBM Storage & Technology Group LoadLeveler Overview January 30-31, 2012 Pidad D'Souza (pidsouza@in.ibm.com) IBM, System & Technology Group 2009 IBM Corporation

More information

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 3 Grid Schedulers: Condor, Sun Grid Engine 2010-2011 Introduction. Up until now Definition of Cloud Computing. Grid Computing: Schedulers: Condor architecture. 1 Summary Condor:

More information