Using the SLURM Job Scheduler
|
|
|
- Maryann Ramsey
- 9 years ago
- Views:
Transcription
1 Using the SLURM Job Scheduler [web] [ ] portal.biohpc.swmed.edu 1 Updated for
2 Overview Today we re going to cover: Learn the basics of SLURM s architecture and demons Learn to how to use a basic set of commands Learn how to write a sbatch script for job submission This is only an introduction, but it should provide you a good start. 2
3 What is SLURM Simple Linux Utility for Resource Management - Started as a simple resource manger for Linux clusters - About 500,000 lines of C code today The glue for a parallel computer to execute parallel jobs - Make a parallel computer as almost easy to use as a PC - Typically use MPI to manage communications within the parallel program 3
4 Cluster Architecture ( 4
5 BioHPC (current system and partition) storage systems: (Lysosome & Endosome ) /home2 /project /work Computer nodes: partitions Login via SSH to nucleus.biohpc.swmed.edu Nucleus005 (Login Node) Submit job via SLURM 64GB* : Nucleus GB : Nucleus GB : Nucleus (GPU : Nucleus ) 384GB : Nucleus050 super : Nucleus
6 BioHPC (future partition) 64GB* : 8 nodes -> 0 nodes 128GB : 16 nodes -> 32 nodes 256GB : 8 nodes -> 32 nodes 384GB : 1 nodes -> 2 nodes super : 41 nodes -> 66 nodes GPU : 2 nodes -> 8 nodes total : 41 nodes -> 74 nodes 6
7 Role of SLURM (resource management & job scheduling) Fair-share resource allocations Allocate resources within a cluster - Supports hierarchical communications with configurable format - For each node, socket : cores per socket : threads per core = 2 : 8 : 2, therefore, total available parallel tasks within a single node = 2*8*2 = 32 When there is more work than resources, SLURM manages queues of work - request resource based on your needs - resource limits by queue, user, etc. 7
8 SLURM commands SLURM commands - Before job submission : sinfo, squeue, sview, smap - Submit a job : sbatch, srun, sallocate, sattach - During job running : squeue, sview, scontrol - After job completed : sacct 8
9 Commands: General Information Man pages available for all commands --help option prints brief descriptions of all options --usage option prints a list of the options Almost all options have two formats - A single letter option (e.g. -p super ) - A verbose option (e.g. --partition=super ) 9
10 sinfo DEMO Reports status of nodes of partitions (partition-oriented format by default) >sinfo --Node (report status in node-oriented form) >sinfo p 256GB (report status of nodes in partition 256GB ) 10
11 Life cycle of a job after submission Submit a Job Pending Configuration (node booting) Resizing Running Suspended Completing * BioHPC Policy : 16 running nodes per user Cancelled (scancel) Completed (zero exit code) Failed (non-zero exit code) Time out (time limit reached) 11
12 Demo: submit a job and use squeue, and scontrol to check job >squeue >scontrol show job {jobid} 12
13 Demo: cancel job with scancle, and use sacct to check previous jobs >scancel {jobid} >sacct u {username} >sacct j {jobid} 13
14 List of SLURM commands sbatch : Submit a script for later execution (batch mode) salloc : Create job allocation and start a shell to use it (interactive mode) srun : Create a job allocation (if needed) and launch a job step (typically an MPI job) sattach : Connect stdin/stdout/stderr for an existing job or job step squeue : Report job and job step status smap : Report system, job or step status with topology, less functionality than sview sview : Report/update system, job, step, partition or reservation status (GTK-based GUI) scontrol : Administrator tool to view/update system, job, step, partition or reservation status sacct : Report accounting information by individual job and job step 14
15 List of valid job states PENDING (PD) : Job is awaiting resource allocation RUNNING (R) : Job is currently has an allocation SUSPENDED (S) : Job has an allocation, but execution has been suspended COMPLETING (CG) : Job is in the process of completing COMPLETED (CD) : Job has terminated all processes on all nodes COMFIGURING (CF) : Job has been allocated resources, waiting for them to be ready for use CANCELLED (CA) : Job was explicitly cancelled by the user or system administrator FAILED (F) : Job terminated with non-zero code or other failure condition TIMEOUT (TO) : job terminated upon reaching its time limit NODE_FAIL (NF) : Job terminated with non-zero exit code or other failure condition 15
16 Demo 01 : submit a single job to single node #!/bin/bash #SBATCH --job-name=singlematlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:00:30 #SBATCH --output=single.%j.out #SBATCH --error=single.%j.time module add matlab/2013a format: D-H:M:S set up SLURM environment load software (export path & library) time(matlab -nodisplay -nodesktop -nosplash -r "forbiohpctestplot(1), exit") alternatively you may use: matlab nodisplay nodesktop nosplash < script.m Command (s) you want to execute Running time ~3.046 s 16
17 More sbatch options #SBATCH --begin=now+1hour Defer the allocation of the job until the specified time #SBATCH mail-type=all Notify user by when certain event types occur (BEGIN, END, FAIL, REQUEUE, etc.) #SBATCH Use to reveive notification of state changes as defined by mail-type #SBATCH --mem=65536 Specify the real memory required per node in MegaBytes. Default value is DefMemPerNode (UNLIMITED) #SBATCH --nodelist=nucleus0[10-20] Request a specific list of node names. The order of the node names in the list is not important, the node names will be sorted by SLURM 17
18 Demo 02 : submit multiple tasks to single node (sequential tasks) #!/bin/bash #SBATCH --job-name=multimatlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:00:30 #SBATCH --output=multiplejob.%j.out #SBATCH --error=smultiplejob.%j.time script.m module add matlab/2013a time(sh script.m) Running time ~ s 18
19 Demo 03 : submit multiple tasks to single node (simultaneous tasks) version 1 #!/bin/bash #SBATCH --job-name=multicorematlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:00:30 #SBATCH --output=multiplecore.%j.out #SBATCH --error=smultiplecore.%j.time script.m module add matlab/2013a time(sh script.m) Running time ~6.156 s 19
20 Demo 04 : submit multiple jobs to single node (simultaneous tasks) version 2 #!/bin/bash #SBATCH --job-name=multi_matlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH --time=0-00:00:30 #SBATCH --output=multiplecoreadv.%j.out #SBATCH --error=smultiplecoreadv.%j.time module add matlab/2013a time(sh script.m) script.m Running time ~4.904 s 20
21 Demo 05 : submit multiple jobs to single node with srun #!/bin/bash #SBATCH --job-name=srunsinglenodematlab #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH ntasks=16 #SBATCH --time=0-00:00:30 #SBATCH --output=srunsinglenode.%j.out #SBATCH --error=srunsinglenode.%j.time srun will create a resource relocation to run the parallel job SLURM_LOCALID : environment variable; Node local task ID for the process within a job. ( zero-based ) module add matlab/2013a time(sh script.m) script.m Running time ~4.596 s 21
22 Why srun? 4 slurmd 3 srun slurmctld slurmstepd Node 0 srun will create a resource relocation to run the parallel job SLURM_LOCALID : environment variable; Node local task ID for the process within a job. ( zero-based ) 22
23 Demo 06 : submit multiple jobs to multi-node with srun #!/bin/bash #SBATCH --job-name=srun2nodesmatlab #SBATCH --partition=super #SBATCH --nodes=2 #SBATCH ntasks=16 #SBATCH --time=0-00:00:30 #SBATCH --output=srun2nodes.%j.out #SBATCH --error=srun2nodes.%j.time SLURM_NODEID : the relative node ID of the current node (zero-based) SLURM_NNODES : Total number of nodes in the job s resource allocation SLURM_NTASKS : Total number of processes in the current job module add matlab/2013a time(sh script.m) script.m Running time ~4.741 s 23
24 Demo 06 : submit multiple tasks to multi-node with srun Matrix 1-8 executed on Nucleus015 Matrix 9-16 executed on Nucleus016 Inside each node, job execution is out-of-order 24
25 Demo 07 : submit job with dependency >sbatch test_srun2nodes.sh >sbatch depend={jobid} gen_gif.sh 25
26 Parallelize your job At thread level - Shared Memory: multiple processors can operate independently but share the same memory - multi-threading - Fast but lower scalability (can not cross the node) e.g.: POSIX Threads, usually referred to as Pthreads - Implemented with a <pthread.h> header and a thread library - For C programming language - Thread management creating, joining threads etc. 26
27 Demo 08 : submit multi-core job to single node #!/bin/bash #SBATCH --job-name=phenix #SBATCH --partition=super #SBATCH --nodes=1 #SBATCH ntasks=30 not necessary need to be specified here #SBATCH --time=0-20:00:00 #SBATCH --output=phenix.%j.out #SBATCH --error=phenix.%j.time Think of your job at a scientific aspect. If this scientific problem can be solved in parallel. module add phenix/1.9 phenix.den_refine model.pdb data.mtz nproc=30 Check with the software or read the manual to see if it is able to use multithreading. Try to use proper parameters If you need help, send to [email protected] 27
28 Parallelize your job At node level - Distributed Memory - Processors have their own local memory - Memory is scalable with number of processors Q: How does a processor access the memory of another node? e.g. Message Passing Interface - Scales to over 100k processors - Distributed - Library of functions (for C, C++, Fortran), not a new language 28
29 Demo 09 : submit multi-node job with message passing #!/bin/bash #SBATCH --job-name=mpi_test #SBATCH --partition=super #SBATCH --nodes=2 #SBATCH ntasks=8 #SBATCH --time=0-00:00:30 #SBATCH --output=mpi_test.%j.out module add mvapich2/gcc/1.9 output mpirun./mpi_only total tasks / number of Node <= 32 memory needed for each task * tasks on each node <= 64GB/128GB/256GB/384GB 29
30 Demo 10 : submit multi-node job (e.g.: MPI-pthread) #!/bin/bash output #SBATCH --job-name=mpi_pthread #SBATCH --partition=super #SBATCH --nodes=2 #SBATCH ntasks=8 #SBATCH --time=0-00:00:30 #SBATCH --output=mpi_pthread.%j.out module add mvapich2/gcc/1.9 let NUM_THREADS=$SLURM_CPU_ON_NODE/($SLURM_NTASKS/$SLURM_NNODES) mpirun./mpi_pthread $NUM_THREADS more total MPI tasks * thread per tasks/ number of Node <= 32 memory needed for each MPI task * MPI tasks on each node <= 64GB/128GB/256GB/384GB similar strategy for MPI-openMP code, let OMP_NUM_THREAS =$NUM_THREADS 30
31 Demo 11: Submit a GPU job (e.g.: CUDA) #!/bin/bash #SBATCH --job-name=cuda_test #SBATCH --partition=gpu #SBATCH --gres=gpu:1 #SBATCH --time=0-00:10:00 #SBATCH output=cuda.%j.out #SBATCH --error=cuda.%j.err module add cuda65 Jobs will not be allocated any generic resources unless specifically requested at job submit time. Using the gres option supported by sbatch and srun. Format: --gres=gpu:[n], where n is the number of GPUs Use GPU partition A(320, 10240) B(10240, 320)./matrixMul wa=320 ha=10240 wb=10240 hb=320 31
32 Getting Effective Help the ticket system: What is the problem? Provide any error message, and diagnostic output you have When did it happen? What time? Cluster or client? What job id? How did you run it? What did you run, what parameters, what do they mean? Any unusual circumstances? Have you compiled your own software? Do you customize startup scripts? Can we look at your scripts and data? Tell us if you are happy for us to access your scripts/data to help troubleshoot. 32
SLURM Workload Manager
SLURM Workload Manager What is SLURM? SLURM (Simple Linux Utility for Resource Management) is the native scheduler software that runs on ASTI's HPC cluster. Free and open-source job scheduler for the Linux
General Overview. Slurm Training15. Alfred Gil & Jordi Blasco (HPCNow!)
Slurm Training15 Agenda 1 2 3 About Slurm Key Features of Slurm Extending Slurm Resource Management Daemons Job/step allocation 4 5 SMP MPI Parametric Job monitoring Accounting Scheduling Administration
Until now: tl;dr: - submit a job to the scheduler
Until now: - access the cluster copy data to/from the cluster create parallel software compile code and use optimized libraries how to run the software on the full cluster tl;dr: - submit a job to the
SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education www.accre.vanderbilt.
SLURM: Resource Management and Job Scheduling Software Advanced Computing Center for Research and Education www.accre.vanderbilt.edu Simple Linux Utility for Resource Management But it s also a job scheduler!
SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education www.accre.vanderbilt.
SLURM: Resource Management and Job Scheduling Software Advanced Computing Center for Research and Education www.accre.vanderbilt.edu Simple Linux Utility for Resource Management But it s also a job scheduler!
Submitting batch jobs Slurm on ecgate. Xavi Abellan [email protected] User Support Section
Submitting batch jobs Slurm on ecgate Xavi Abellan [email protected] User Support Section Slide 1 Outline Interactive mode versus Batch mode Overview of the Slurm batch system on ecgate Batch basic
Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research
! Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research! Cynthia Cornelius! Center for Computational Research University at Buffalo, SUNY! cdc at
An introduction to compute resources in Biostatistics. Chris Scheller [email protected]
An introduction to compute resources in Biostatistics Chris Scheller [email protected] 1. Resources 1. Hardware 2. Account Allocation 3. Storage 4. Software 2. Usage 1. Environment Modules 2. Tools 3.
Managing GPUs by Slurm. Massimo Benini HPC Advisory Council Switzerland Conference March 31 - April 3, 2014 Lugano
Managing GPUs by Slurm Massimo Benini HPC Advisory Council Switzerland Conference March 31 - April 3, 2014 Lugano Agenda General Slurm introduction Slurm@CSCS Generic Resource Scheduling for GPUs Resource
Introduction to parallel computing and UPPMAX
Introduction to parallel computing and UPPMAX Intro part of course in Parallel Image Analysis Elias Rudberg [email protected] March 22, 2011 Parallel computing Parallel computing is becoming increasingly
The Asterope compute cluster
The Asterope compute cluster ÅA has a small cluster named asterope.abo.fi with 8 compute nodes Each node has 2 Intel Xeon X5650 processors (6-core) with a total of 24 GB RAM 2 NVIDIA Tesla M2050 GPGPU
Slurm Workload Manager Architecture, Configuration and Use
Slurm Workload Manager Architecture, Configuration and Use Morris Jette [email protected] Goals Learn the basics of SLURM's architecture, daemons and commands Learn how to use a basic set of commands Learn
Matlab on a Supercomputer
Matlab on a Supercomputer Shelley L. Knuth Research Computing April 9, 2015 Outline Description of Matlab and supercomputing Interactive Matlab jobs Non-interactive Matlab jobs Parallel Computing Slides
Biowulf2 Training Session
Biowulf2 Training Session 9 July 2015 Slides at: h,p://hpc.nih.gov/docs/b2training.pdf HPC@NIH website: h,p://hpc.nih.gov System hardware overview What s new/different The batch system & subminng jobs
To connect to the cluster, simply use a SSH or SFTP client to connect to:
RIT Computer Engineering Cluster The RIT Computer Engineering cluster contains 12 computers for parallel programming using MPI. One computer, cluster-head.ce.rit.edu, serves as the master controller or
Batch Usage on JURECA Introduction to Slurm. Nov 2015 Chrysovalantis Paschoulas [email protected]
Batch Usage on JURECA Introduction to Slurm Nov 2015 Chrysovalantis Paschoulas [email protected] Batch System Concepts (1) A cluster consists of a set of tightly connected "identical" computers
Job Scheduling Using SLURM
Job Scheduling Using SLURM Morris Jette [email protected] Goals Learn the basics of SLURM's architecture, daemons and commands Learn how to use a basic set of commands Learn how to build, configure and
Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC
Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Goals of the session Overview of parallel MATLAB Why parallel MATLAB? Multiprocessing in MATLAB Parallel MATLAB using the Parallel Computing
Tutorial-4a: Parallel (multi-cpu) Computing
HTTP://WWW.HEP.LU.SE/COURSES/MNXB01 Introduction to Programming and Computing for Scientists (2015 HT) Tutorial-4a: Parallel (multi-cpu) Computing Balazs Konya (Lund University) Programming for Scientists
Introduction to Supercomputing with Janus
Introduction to Supercomputing with Janus Shelley Knuth [email protected] Peter Ruprecht [email protected] www.rc.colorado.edu Outline Who is CU Research Computing? What is a supercomputer?
Miami University RedHawk Cluster Working with batch jobs on the Cluster
Miami University RedHawk Cluster Working with batch jobs on the Cluster The RedHawk cluster is a general purpose research computing resource available to support the research community at Miami University.
A High Performance Computing Scheduling and Resource Management Primer
LLNL-TR-652476 A High Performance Computing Scheduling and Resource Management Primer D. H. Ahn, J. E. Garlick, M. A. Grondona, D. A. Lipari, R. R. Springmeyer March 31, 2014 Disclaimer This document was
Simulation of batch scheduling using real production-ready software tools
Simulation of batch scheduling using real production-ready software tools Alejandro Lucero 1 Barcelona SuperComputing Center Abstract. Batch scheduling for high performance cluster installations has two
Compute Cluster Documentation
Drucksachenkategorie Compute Cluster Documentation Compiled by Martin Geier (DLR FA-STM) Authors Michael Schäfer (INIT GmbH) Martin Geier (DLR FA-STM) Date 22.05.2015 Table of contents Table of contents...
Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.
Linux für bwgrid Sabine Richling, Heinz Kredel Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim 27. June 2011 Richling/Kredel (URZ/RUM) Linux für bwgrid FS 2011 1 / 33 Introduction
Introduction to Linux and Cluster Basics for the CCR General Computing Cluster
Introduction to Linux and Cluster Basics for the CCR General Computing Cluster Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY 14203 Phone: 716-881-8959
Running applications on the Cray XC30 4/12/2015
Running applications on the Cray XC30 4/12/2015 1 Running on compute nodes By default, users do not log in and run applications on the compute nodes directly. Instead they launch jobs on compute nodes
An Introduction to High Performance Computing in the Department
An Introduction to High Performance Computing in the Department Ashley Ford & Chris Jewell Department of Statistics University of Warwick October 30, 2012 1 Some Background 2 How is Buster used? 3 Software
NEC HPC-Linux-Cluster
NEC HPC-Linux-Cluster Hardware configuration: 4 Front-end servers: each with SandyBridge-EP processors: 16 cores per node 128 GB memory 134 compute nodes: 112 nodes with SandyBridge-EP processors (16 cores
Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014
Using WestGrid Patrick Mann, Manager, Technical Operations Jan.15, 2014 Winter 2014 Seminar Series Date Speaker Topic 5 February Gino DiLabio Molecular Modelling Using HPC and Gaussian 26 February Jonathan
The RWTH Compute Cluster Environment
The RWTH Compute Cluster Environment Tim Cramer 11.03.2013 Source: D. Both, Bull GmbH Rechen- und Kommunikationszentrum (RZ) How to login Frontends cluster.rz.rwth-aachen.de cluster-x.rz.rwth-aachen.de
Parallel Processing using the LOTUS cluster
Parallel Processing using the LOTUS cluster Alison Pamment / Cristina del Cano Novales JASMIN/CEMS Workshop February 2015 Overview Parallelising data analysis LOTUS HPC Cluster Job submission on LOTUS
Streamline Computing Linux Cluster User Training. ( Nottingham University)
1 Streamline Computing Linux Cluster User Training ( Nottingham University) 3 User Training Agenda System Overview System Access Description of Cluster Environment Code Development Job Schedulers Running
Running COMSOL in parallel
Running COMSOL in parallel COMSOL can run a job on many cores in parallel (Shared-memory processing or multithreading) COMSOL can run a job run on many physical nodes (cluster computing) Both parallel
Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015
Work Environment David Tur HPC Expert HPC Users Training September, 18th 2015 1. Atlas Cluster: Accessing and using resources 2. Software Overview 3. Job Scheduler 1. Accessing Resources DIPC technicians
Cloud Bursting with SLURM and Bright Cluster Manager. Martijn de Vries CTO
Cloud Bursting with SLURM and Bright Cluster Manager Martijn de Vries CTO Architecture CMDaemon 2 Management Interfaces Graphical User Interface (GUI) Offers administrator full cluster control Standalone
Using NeSI HPC Resources. NeSI Computational Science Team ([email protected])
NeSI Computational Science Team ([email protected]) Outline 1 About Us About NeSI Our Facilities 2 Using the Cluster Suitable Work What to expect Parallel speedup Data Getting to the Login Node 3 Submitting
Microsoft HPC. V 1.0 José M. Cámara ([email protected])
Microsoft HPC V 1.0 José M. Cámara ([email protected]) Introduction Microsoft High Performance Computing Package addresses computing power from a rather different approach. It is mainly focused on commodity
8/15/2014. Best Practices @OLCF (and more) General Information. Staying Informed. Staying Informed. Staying Informed-System Status
Best Practices @OLCF (and more) Bill Renaud OLCF User Support General Information This presentation covers some helpful information for users of OLCF Staying informed Aspects of system usage that may differ
Using the Windows Cluster
Using the Windows Cluster Christian Terboven [email protected] aachen.de Center for Computing and Communication RWTH Aachen University Windows HPC 2008 (II) September 17, RWTH Aachen Agenda o Windows Cluster
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
Introduction to HPC Workshop. Center for e-research ([email protected])
Center for e-research ([email protected]) Outline 1 About Us About CER and NeSI The CS Team Our Facilities 2 Key Concepts What is a Cluster Parallel Programming Shared Memory Distributed Memory 3 Using
Parallel Debugging with DDT
Parallel Debugging with DDT Nate Woody 3/10/2009 www.cac.cornell.edu 1 Debugging Debugging is a methodical process of finding and reducing the number of bugs, or defects, in a computer program or a piece
Job Scheduling with Moab Cluster Suite
Job Scheduling with Moab Cluster Suite IBM High Performance Computing February 2010 Y. Joanna Wong, Ph.D. [email protected] 2/22/2010 Workload Manager Torque Source: Adaptive Computing 2 Some terminology..
JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert
Mitglied der Helmholtz-Gemeinschaft JUROPA Linux Cluster An Overview 19 May 2014 Ulrich Detert JuRoPA JuRoPA Jülich Research on Petaflop Architectures Bull, Sun, ParTec, Intel, Mellanox, Novell, FZJ JUROPA
HPC at IU Overview. Abhinav Thota Research Technologies Indiana University
HPC at IU Overview Abhinav Thota Research Technologies Indiana University What is HPC/cyberinfrastructure? Why should you care? Data sizes are growing Need to get to the solution faster Compute power is
Hodor and Bran - Job Scheduling and PBS Scripts
Hodor and Bran - Job Scheduling and PBS Scripts UND Computational Research Center Now that you have your program compiled and your input file ready for processing, it s time to run your job on the cluster.
Bright Cluster Manager 5.2. User Manual. Revision: 3324. Date: Fri, 30 Nov 2012
Bright Cluster Manager 5.2 User Manual Revision: 3324 Date: Fri, 30 Nov 2012 Table of Contents Table of Contents........................... i 1 Introduction 1 1.1 What Is A Beowulf Cluster?..................
Installing and running COMSOL on a Linux cluster
Installing and running COMSOL on a Linux cluster Introduction This quick guide explains how to install and operate COMSOL Multiphysics 5.0 on a Linux cluster. It is a complement to the COMSOL Installation
How to Run Parallel Jobs Efficiently
How to Run Parallel Jobs Efficiently Shao-Ching Huang High Performance Computing Group UCLA Institute for Digital Research and Education May 9, 2013 1 The big picture: running parallel jobs on Hoffman2
Using the Yale HPC Clusters
Using the Yale HPC Clusters Stephen Weston Robert Bjornson Yale Center for Research Computing Yale University Oct 2015 To get help Send an email to: [email protected] Read documentation at: http://research.computing.yale.edu/hpc-support
Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research
Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St
Martinos Center Compute Clusters
Intro What are the compute clusters How to gain access Housekeeping Usage Log In Submitting Jobs Queues Request CPUs/vmem Email Status I/O Interactive Dependencies Daisy Chain Wrapper Script In Progress
HPC-Nutzer Informationsaustausch. The Workload Management System LSF
HPC-Nutzer Informationsaustausch The Workload Management System LSF Content Cluster facts Job submission esub messages Scheduling strategies Tools and security Future plans 2 von 10 Some facts about the
Introduction to Sun Grid Engine (SGE)
Introduction to Sun Grid Engine (SGE) What is SGE? Sun Grid Engine (SGE) is an open source community effort to facilitate the adoption of distributed computing solutions. Sponsored by Sun Microsystems
Debugging with TotalView
Tim Cramer 17.03.2015 IT Center der RWTH Aachen University Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again and...... enrich
Job scheduler details
Job scheduler details Advanced Computing Center for Research & Education (ACCRE) Job scheduler details 1 / 25 Outline 1 Batch queue system overview 2 Torque and Moab 3 Submitting jobs (ACCRE) Job scheduler
Quick Tutorial for Portable Batch System (PBS)
Quick Tutorial for Portable Batch System (PBS) The Portable Batch System (PBS) system is designed to manage the distribution of batch jobs and interactive sessions across the available nodes in the cluster.
SLURM Resources isolation through cgroups. Yiannis Georgiou email: [email protected] Matthieu Hautreux email: matthieu.hautreux@cea.
SLURM Resources isolation through cgroups Yiannis Georgiou email: [email protected] Matthieu Hautreux email: [email protected] Outline Introduction to cgroups Cgroups implementation upon
Cluster@WU User s Manual
Cluster@WU User s Manual Stefan Theußl Martin Pacala September 29, 2014 1 Introduction and scope At the WU Wirtschaftsuniversität Wien the Research Institute for Computational Methods (Forschungsinstitut
Manual for using Super Computing Resources
Manual for using Super Computing Resources Super Computing Research and Education Centre at Research Centre for Modeling and Simulation National University of Science and Technology H-12 Campus, Islamabad
Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)
Grid Engine Basics (Formerly: Sun Grid Engine) Table of Contents Table of Contents Document Text Style Associations Prerequisites Terminology What is the Grid Engine (SGE)? Loading the SGE Module on Turing
Grid Engine Users Guide. 2011.11p1 Edition
Grid Engine Users Guide 2011.11p1 Edition Grid Engine Users Guide : 2011.11p1 Edition Published Nov 01 2012 Copyright 2012 University of California and Scalable Systems This document is subject to the
Remote & Collaborative Visualization. Texas Advanced Compu1ng Center
Remote & Collaborative Visualization Texas Advanced Compu1ng Center So6ware Requirements SSH client VNC client Recommended: TigerVNC http://sourceforge.net/projects/tigervnc/files/ Web browser with Java
Getting Started with HPC
Getting Started with HPC An Introduction to the Minerva High Performance Computing Resource 17 Sep 2013 Outline of Topics Introduction HPC Accounts Logging onto the HPC Clusters Common Linux Commands Storage
1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology
Volume 1.0 FACULTY OF CUMPUTER SCIENCE & ENGINEERING Ghulam Ishaq Khan Institute of Engineering Sciences & Technology User Manual For HPC Cluster at GIKI Designed and prepared by Faculty of Computer Science
MFCF Grad Session 2015
MFCF Grad Session 2015 Agenda Introduction Help Centre and requests Dept. Grad reps Linux clusters using R with MPI Remote applications Future computing direction Technical question and answer period MFCF
CHEOPS Cologne High Efficient Operating Platform for Science Brief Instructions
CHEOPS Cologne High Efficient Operating Platform for Science Brief Instructions (Version: 07.10.2013) Foto: Thomas Josek/JosekDesign Viktor Achter Dr. Stefan Borowski Lech Nieroda Dr. Lars Packschies Volker
RWTH GPU Cluster. Sandra Wienke [email protected] November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky
RWTH GPU Cluster Fotos: Christian Iwainsky Sandra Wienke [email protected] November 2012 Rechen- und Kommunikationszentrum (RZ) The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) innovative
Slurm License Management
Slurm License Management Slurm 2013 User Group Bill Brophy, Bull 1 Background Software and the associated Licenses are EXPENSIVE Nearly $160 billion will be spent by American companies on software purchases
PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007
PBS Tutorial Fangrui Ma Universit of Nebraska-Lincoln October 26th, 2007 Abstract In this tutorial we gave a brief introduction to using PBS Pro. We gave examples on how to write control script, and submit
locuz.com HPC App Portal V2.0 DATASHEET
locuz.com HPC App Portal V2.0 DATASHEET Ganana HPC App Portal makes it easier for users to run HPC applications without programming and for administrators to better manage their clusters. The web-based
How To Run A Tompouce Cluster On An Ipra (Inria) 2.5.5 (Sun) 2 (Sun Geserade) 2-5.4 (Sun-Ge) 2/5.2 (
Running Hadoop and Stratosphere jobs on TomPouce cluster 16 October 2013 TomPouce cluster TomPouce is a cluster of 20 calcula@on nodes = 240 cores Located in the Inria Turing building (École Polytechnique)
High-Performance Computing
High-Performance Computing Windows, Matlab and the HPC Dr. Leigh Brookshaw Dept. of Maths and Computing, USQ 1 The HPC Architecture 30 Sun boxes or nodes Each node has 2 x 2.4GHz AMD CPUs with 4 Cores
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
bwgrid Treff MA/HD Sabine Richling, Heinz Kredel Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim 29.
bwgrid Treff MA/HD Sabine Richling, Heinz Kredel Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim 29. September 2010 Richling/Kredel (URZ/RUM) bwgrid Treff WS 2010/2011 1 / 25 Course
MPI / ClusterTools Update and Plans
HPC Technical Training Seminar July 7, 2008 October 26, 2007 2 nd HLRS Parallel Tools Workshop Sun HPC ClusterTools 7+: A Binary Distribution of Open MPI MPI / ClusterTools Update and Plans Len Wisniewski
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
LSKA 2010 Survey Report Job Scheduler
LSKA 2010 Survey Report Job Scheduler Graduate Institute of Communication Engineering {r98942067, r98942112}@ntu.edu.tw March 31, 2010 1. Motivation Recently, the computing becomes much more complex. However,
Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria
Tutorial: Using WestGrid Drew Leske Compute Canada/WestGrid Site Lead University of Victoria Fall 2013 Seminar Series Date Speaker Topic 23 September Lindsay Sill Introduction to WestGrid 9 October Drew
Kiko> A personal job scheduler
Kiko> A personal job scheduler V1.2 Carlos allende prieto october 2009 kiko> is a light-weight tool to manage non-interactive tasks on personal computers. It can improve your system s throughput significantly
Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises
Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises Pierre-Yves Taunay Research Computing and Cyberinfrastructure 224A Computer Building The Pennsylvania State University University
Slurm Fault Tolerant Workload Management
Slurm Fault Tolerant Workload Management 19 October 2013 David Bigagli, Morris Jette [david,jette]@schedmd.com Motivation Failures in large computers are inevitable. Bigger the size of the parallel job
Grid 101. Grid 101. Josh Hegie. [email protected] http://hpc.unr.edu
Grid 101 Josh Hegie [email protected] http://hpc.unr.edu Accessing the Grid Outline 1 Accessing the Grid 2 Working on the Grid 3 Submitting Jobs with SGE 4 Compiling 5 MPI 6 Questions? Accessing the Grid Logging
Spectrum Technology Platform. Version 9.0. Spectrum Spatial Administration Guide
Spectrum Technology Platform Version 9.0 Spectrum Spatial Administration Guide Contents Chapter 1: Introduction...7 Welcome and Overview...8 Chapter 2: Configuring Your System...9 Changing the Default
Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)
Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF) ALCF Resources: Machines & Storage Mira (Production) IBM Blue Gene/Q 49,152 nodes / 786,432 cores 768 TB of memory Peak flop rate:
Grid Engine 6. Troubleshooting. BioTeam Inc. [email protected]
Grid Engine 6 Troubleshooting BioTeam Inc. [email protected] Grid Engine Troubleshooting There are two core problem types Job Level Cluster seems OK, example scripts work fine Some user jobs/apps fail Cluster
OLCF Best Practices. Bill Renaud OLCF User Assistance Group
OLCF Best Practices Bill Renaud OLCF User Assistance Group Overview This presentation covers some helpful information for users of OLCF Staying informed Some aspects of system usage that may differ from
Requesting Nodes, Processors, and Tasks in Moab
LLNL-MI-401783 LAWRENCE LIVERMORE NATIONAL LABORATORY Requesting Nodes, Processors, and Tasks in Moab D.A Lipari March 29, 2012 This document was prepared as an account of work sponsored by an agency of
