High Performance Compute Cluster

Similar documents
Streamline Computing Linux Cluster User Training. ( Nottingham University)

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

New Storage System Solutions

User s Manual

An Introduction to High Performance Computing in the Department

How to Run Parallel Jobs Efficiently

Grid Engine Users Guide p1 Edition

High Performance Computing Cluster Quick Reference User Guide

Manual for using Super Computing Resources

How To Run A Tompouce Cluster On An Ipra (Inria) (Sun) 2 (Sun Geserade) (Sun-Ge) 2/5.2 (

The CNMS Computer Cluster

SGE Roll: Users Guide. Version Edition

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Introduction to the SGE/OGS batch-queuing system

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Parallel Processing using the LOTUS cluster

Getting Started with HPC

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology

Introduction to Sun Grid Engine (SGE)

Grid 101. Grid 101. Josh Hegie.

NEC HPC-Linux-Cluster

Enigma, Sun Grid Engine (SGE), and the Joint High Performance Computing Exchange (JHPCE) Cluster

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Cluster Implementation and Management; Scheduling

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

MPI / ClusterTools Update and Plans

AstroCompute. AWS101 - using the cloud for Science. Brendan Bouffler ( boof ) Scientific Computing AWS. ska-astrocompute@amazon.

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

GRID Computing: CAS Style

HPCC USER S GUIDE. Version 1.2 July IITS (Research Support) Singapore Management University. IITS, Singapore Management University Page 1 of 35

Kiko> A personal job scheduler

System Requirements Table of contents

Introduction to Gluster. Versions 3.0.x

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Using the Windows Cluster

Current Status of FEFS for the K computer

The RWTH Compute Cluster Environment

FLOW-3D Performance Benchmark and Profiling. September 2012

Using NeSI HPC Resources. NeSI Computational Science Team

Isilon OneFS. Version 7.2. OneFS Migration Tools Guide

HPC at IU Overview. Abhinav Thota Research Technologies Indiana University

Grid Engine 6. Troubleshooting. BioTeam Inc.

An Oracle White Paper August Oracle WebCenter Content 11gR1 Performance Testing Results

Using Red Hat Network Satellite Server to Manage Dell PowerEdge Servers

Active Fabric Manager (AFM) Plug-in for VMware vcenter Virtual Distributed Switch (VDS) CLI Guide

LS DYNA Performance Benchmarks and Profiling. January 2009

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

An Oracle White Paper July Oracle ACFS

SUN ORACLE EXADATA STORAGE SERVER

Introduction to ACENET Accelerating Discovery with Computational Research May, 2015

HPC Update: Engagement Model

PARALLELS SERVER BARE METAL 5.0 README

Legal Notices Introduction... 3

NetScaler VPX FAQ. Table of Contents

High Performance Computing OpenStack Options. September 22, 2015

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre


SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Installing and running COMSOL on a Linux cluster

Building Clusters for Gromacs and other HPC applications

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Cray DVS: Data Virtualization Service

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

Quantum StorNext. Product Brief: Distributed LAN Client

Introduction to Grid Engine

Sun Constellation System: The Open Petascale Computing Architecture

Overview of HPC Resources at Vanderbilt

PARALLELS SERVER 4 BARE METAL README

Isilon OneFS. Version OneFS Migration Tools Guide

High Performance Computing with Sun Grid Engine on the HPSCC cluster. Fernando J. Pineda

Introduction to HPC Workshop. Center for e-research

Parallels Cloud Storage

ORACLE OPS CENTER: PROVISIONING AND PATCH AUTOMATION PACK

Running a Workflow on a PowerCenter Grid

Architecting a High Performance Storage System

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Server Installation ZENworks Mobile Management 2.7.x August 2013

Intel Xeon Processor E5-2600

How to Choose your Red Hat Enterprise Linux Filesystem

Assignment # 1 (Cloud Computing Security)

An Oracle White Paper May Exadata Smart Flash Cache and the Oracle Exadata Database Machine

NFS SERVER WITH 10 GIGABIT ETHERNET NETWORKS

IBM WebSphere Application Server Version 7.0

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

Lessons learned from parallel file system operation

Transcription:

High Performance Compute Cluster Overview for Researchers and Users Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 1

Copyrights, Licenses and Acknowledgements Dell, PowerEdge, PowerVault and PowerConnect are recognised trademarks of Dell Corporation. Intel, Intel Xeon and QPI are recognised trademark of Intel Corp. AMD Opteron is a recognised trademark of Advanced Micro Devices, Inc. The Whamcloud logo and some course content is reproduced with permission from Whamcloud Inc. Lustre and Lustre file system are recognised trademarks of Xyratex. Other product names, logos, brands and other trademarks referred to within this documentation, as well as other products and services are the property of their respective trademark holders. All rights reserved. These trademark holders are not affiliated with Alces Software, our products, or our services, and may not sponsor or endorse our materials. This material is designed to be used to support an instructor led training schedule for reference purposes only. This documentation is not designed as a stand-alone training tool example commands and syntax intended to demonstrate functionality in a training workshop environment and may cause data loss if executed on live systems. Remember: always take backups of any valuable data. This documentation is provided as is and without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. The Alces Software HPC Cluster Toolkit is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. These packages are distributed in the hope that they will be useful, but without any warranty; without even the implied warranty of merchanability or fitness for a particular purpose. See the GNU Affero General Public License for more details (http://www.gnu.org/licenses/). A copy of the GNU Affero General Public License is distributed along with this product. For more information on Alces Software, please visit: http://www.alces-software.com/. Please support software developers wherever they work if you use Open Source software, please consider contributing to the maintaining organisation or project, and crediting their work as part of your research publications. This work is Copyright 2007-2013 Alces Software Ltd All Rights Reserved. Unauthorized re-distribution is prohibited except where express written permission is supplied by Alces Software. Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 2

Agenda HPC cluster technology overview Getting started Cluster access methods (CLI, graphical, web-portal) Data storage (access, retrieval, quotas, performance) Software development environment Using compilers, libraries and MPIs Software application services Job scheduler system Principles and benefits Creating job scripts, submitting jobs and getting results Special cases, tuning Documentation review Where to get help Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 3

HPC Cluster Technology Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 4

Beowulf Architecture Loosely-coupled Linux Supercomputer Efficient for a number of use-cases Embarrassingly parallel / single-threaded jobs SMP / multi-threaded, single-node jobs MPI / parallel multi-node jobs Very cost effective HPC solution Commodity X86_64 server hardware and storage Linux based operating system Specialist high-performance interconnect and software Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 5

Beowulf Architecture Scalable architecture Dedicated management and storage nodes Dedicated login node pair to provide user access Multiple compute node types for different jobs Standard memory compute nodes (4GB/core) High memory compute nodes (16GB/core) Very high memory SMP node (1TB RAM) Multiple storage tiers for user data 2TB local scratch disk on every node 55TB shared Lustre parallel filesystem 4TB home-directory / archive storage Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 6

Cluster facilities Dual headnodes Dual login nodes Resilient networks Dual Infiniband core Dual Lustre servers Compute 20 x standard nodes / 320 cores 16 x high-memory nodes / 256 cores 1 x very high memory node / 1TB RAM Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 7

User facilities Modern 64-bit Linux operating system Compatible with a wide range of software Pre-installed with tuned HPC applications Compiled for the latest CPU architecture Comprehensive software development environment C, C++, Fortran, Java, Ruby, Python, Perl, R Modules environment management Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 8

User facilities High throughput job-scheduler for increased utilisation Resource request limits to protect running jobs Fair-share for cluster load-balancing Resource reservation with job backfilling Multiple user-interfaces supporting all abilities Command-line access Graphical desktop access Web portal for job monitoring, file transfer and accounting Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 9

HPC Cluster Hardware Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 10

Service node configuration Head and storage nodes are Dell PowerEdge R520 Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 11

Service node configuration Login nodes are Dell PowerEdge R720 Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 12

Compute node configuration Dell C6220 compute nodes Dual 8-core CPUs, 64 or 256GB RAM QDR IB Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 13

Compute node configuration 65nm 45nm 32nm 22nm Intel Core Microarchitecture First high-volume server Quad-Core CPUs Nehalem Microarchitecture Up to 6 cores and 12MB Cache 1 Sandy Bridge Microarchitecture Up to 8 cores and 20MB Cache Has Microarc Dedicated high-speed bus per CPU HW-assisted virtualization (VT-x) Integrated memory controller with DDR3 support Turbo Boost, Intel HT, AES- NI 1 End-to-end HW-assisted virtualization (VT-x, -d, -c) Integrated PCI Express* Turbo Boost 2.0 Intel Advanced Vector Extensions (AVX) Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 14

QPI QPI Compute node configuration DDR3 1333Mhz DDR3 DDR3 DDR3 Xeon 5500/5600 Platform Xeon 5500 Xeon 5600 4/6 Cores Nehalem/Westmere 6.4 GT/s QPI Xeon 5500 Xeon 5600 4/6 Cores DDR3 DDR3 DDR3 DDR3 1600Mhz DDR3 DDR3 DDR3 DDR3 Xeon E5 Platform Xeon E5-2600 4/6/8 Cores Sandy-bridge/Ivy-bridge 8.0 GT/s QPI QPI Xeon E5-2600 4/6/8 Cores DDR3 DDR3 DDR3 DDR3 x4 x8 Intel 5500 Series (IOH) x4 Intel ICH 10 Up to 12 DIMMs per 2S platform Up to 36 PCIe2 lanes Two-chip IOH / ICH up to 36 lanes PCIe2 Intel C600 Series (PCH) up to 40 lanes PCIe3 per socket Serial Attached SCSI (SAS) 4 ports, 6Gb/s Up to 16DIMMs Up to 80 PCIe 3.0 lanes Two QPI links between CPUs One-chip Platform Controller Hub (PCH) Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 15

Very high memory node Dell PowerEdge R910 4 x 8-core Xeon CPUs 1TB RAM QDR IB and dual Gb Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 16

Infiniband fabric High-performance 40Gbps Infiniband fabric 32Gbps effective bandwidth per link 1.23us latency for small messages (measured) Scalable core/edge design Dual resilient subnet manager Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 17

Cluster facilities Dual headnodes Dual login nodes Resilient networks Dual Infiniband core Dual Lustre servers Compute 20 x standard nodes / 320 cores 16 x high-memory nodes / 256 cores 1 x very high memory node / 1TB RAM Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 18

Accessing the cluster Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 19

Command-line access Login via SSH Use SSH client on Linux, UNIX and Mac hosts Use PuTTY, Exceed, OpenSSH on Windows Connect to one of the login nodes maxwell1.abdn.ac.uk maxwell2.abdn.ac.uk Use your site AD username and password Terminal access also available via web portal Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 20

Admin command-line access Compute node access provided via job scheduler Use the qrsh command to get an interactive shell Commands requiring special permissions require sudo access Only enabled for admin accounts Privileged access to shared filesystems Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 21

Graphical Access X-forwarding is supported for graphical apps Use the ssh X command to enable Available via PuTTY Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 22

Remote Desktop FreeNX remote desktop to login nodes Use cluster private key file to connect /etc/nxserver/client.id_dsa.key Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 23

NX Remote Desktop Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 24

Web Portal access Browse to https://maxwell.abdn.ac.uk Designed to supplement command-line access User interface provides: Web-based overview of cluster status Access to user documentation Filesystem usage and quota statistics File browser for your cluster filesystems Job scheduler statistics Terminal interface Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 25

User documentation Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 26

User documentation Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 27

Storage system metrics Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 28

File browser Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 29

Job submission metrics Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 30

Job submission metrics Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 31

HPC Cluster Software Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 32

Operating system Unified 64-bit Linux installation Scientific Linux 6.3 2.6.32-279 64-bit kernel Automatic Transparent HugePage support Centralised software deployment system Headnode manage cluster software repositories Compute and login nodes deployed from masters Personality-less software simplifies management Automatic software deployment for system regeneration Centralised software patching and update system Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 33

Application software Supporting software provided by SL distro Compilers, interpreters, drivers Specific software repositories also included Intel/Whamcloud: Lustre software Fedora/EPEL: RedHat compatible libraries and tools Open Grid Scheduler: Job scheduler software Alces Software: Symphony HPC Toolkit, web portal service Application software Commercial applications (various) Alces Software: Gridware application repository Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 34

System software Infiniband software Redhat Open Fabric Enterprise Distribution (OFED) Dual resilient fabric manager Supports native Infiniband verbs interface Accelerated Intel True Scale Performance Scaled Messaging (PSM) Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 35

System software (storage) Deployed as a number of cluster filesystems /opt/alces (100GB, read-only) Configuration information /opt/gridware (500GB, writable by admins) User applications, compilers, libraries /users (3.5TB, writable by owning user) User home directories /nobackup (55TB, writable by owning user) High performance Parallel Filesystem Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 36

Lustre parallel filesystem Open Source codebase (GPL v2) A Distributed Filesystem A Cluster Filesystem A POSIX Filesystem A large scale filesystem A High Performance Filesystem Hardware agnostic Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 37

Cluster Filesystem Early users Beowulf clusters MPI jobs Large shared file tasks Scientific data CGI rendering Large file-per-process tasks Cray XT series main filesystem Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 38

Initial goals From the first SOW Must do everything NFS does Shared namespace Robust, ubiquitous Must support full POSIX semantics Must support all network types of interest Must be scalable Cheaply attach storage Performance scales Must be easy to administer Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 39

POSIX Compliant All client filesystem operations are coherent No stale data or metadata No timeout based cache coherency like NFS Single client passes full POSIX suite Early Lustre requirement Very important to many customers Applications designed to work with POSIX compliant filesystems should work correctly with Lustre Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 40

Distributed Filesystem Lustre aggregates multiple servers into a single filesystem namespace The namespace is then exported to clients across a network Client/Server design Servers can have client processes Modular storage can be added on the fly WAN many groups use Lustre over at least a campus distance some grid implementations Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 41

Data file components Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 42

Relative file sections Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 43

The Big Picture The Lustre client splits filesystem operations into metadata and object data Each goes to a dedicated server Metadata file/directory names, permissions, etc Data Stored as objects block data files may be striped across multiple servers Filesystem needs a central registration point Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 44

Basic Lustre I/O Operation Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 45

Distributed Locking Longstanding goal of Lustre protect data in use Provide POSIX semantics Clients never see stale data or metadata Fairly complex locking, reasonable scaling Each target provides a lock service for its objects (data/metadata) Predictive lock growth to reduce load on servers Contended resources are handled directly by the server instead of giving a lock to each client to do the operation Lock revocation flushes data from client cache Data always written to disk before it is sent to another client Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 46

Lustre overview Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 47

Lustre file striping If only one data object is associated with a Lustre MDS entry, we refer to that object as a singly-striped file Striping allows parts of a file to be stored on different OSTs: Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 48

Lustre file striping Striping large files can lead to massive performance increases Each OST can process data independently, allowing parallel I/O from clients More performance enhancing cache memory available Performance from 10 identical OSS maybe greater than 10X single OSS due to reduced contention Larger maximum filesize Better performance with smaller OSTs Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 49

Lustre file striping Use the lfs getstripe and lfs setstripe commands to influence policies: Forced striped settings can have unexpected results File availability considerations Extra stripe information stored on MDS Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 50

Lustre file striping The lfs setstripe command takes the following parameters: lfs setstripe [-d] ([--size -s stripe-size] [--count -c stripe-cnt] [--offset -o start-ost] [--pool -p <pool>]) <filename dirname> Size is stripe size in multiples of 64KB Enter zero to use the filesystem default (1024KB) Count is the number of OSTs to stripe over Enter zero to use the filesystem default Enter -1 to use all available OSTs Offset is the starting OST to write to Enter -1 to allow Lustre to decide best OST Pool is the OST pool name to use Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 51

Lustre best practice for performance Minimise unnecessary metadata traffic Avoid using ls l ; an ls command will sometimes suffice Remove the --color alias to the ls command Avoid repetitive stat operations on the filesystem Avoid directories with large number of files Invites lock contention for directory namespace Use stripe count of 1 for large numbers of small files Huge numbers of small files may not perform well on Lustre Tuning for multiple readers/writers Tune your compute jobs to match the number of OSS/OSTs e.g. with 6 x OSTs, a multi-writer job will perform best when run over a subset of 6,12,18 nodes When reading, try and make best use of OSS cache Align reads of the same data to be simultaneous from clients Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 52

Cluster data storage Multiple filesystems provided for users Software repositories (/opt/gridware) 0.5TB Contains all shared, user-facing software Read-only for users Home-directories (/users) 3.5TB Default place that users login to Contains help files and links to available scratch areas Quota enforced to encourage users to remove/compress data Backed-up by University service Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 53

Cluster data storage Multiple filesystems provided for users Lustre data storage (/nobackup) 55TB High performance parallel filesystem Available to all nodes; 1.5GB/sec max throughput Quotas may be set if required Shared scratch (~/sharedscratch) 55TB Shared scratch location for parallel jobs Orphan data is automatically deleted Local scratch (~/localscratch) 2TB/node Local scratch disk on nodes = fastest available storage Orphan data is automatically deleted Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 54

Cluster data storage summary Storage type Location Access Persistence Write Performance Application store /opt/gridware Read-only Shared, Permanent N/A Home directory ~ (/users/<username>) Quota enabled Shared, Permanent Medium Parallel filesystem /nobackup Quota enabled Shared, Permanent Fast Shared scratch ~/sharedscratch Quota enabled Shared, auto-deletion Fast Local scratch ~/localscratch Full Local to node, auto-deletion Fastest Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 55

Filesystem Quotas Use the quota command to show your filesystem quota usage Use the -s option for human-readable format [alces-cluster@login1(cluster) ~]$ quota s Disk quotas for user alces-cluster (uid 501): Filesystem blocks quota limit grace files quota limit grace storage.ib:/users 9405M* 8813M 97657M 7days 68934 100k 200k [alces-cluster@login1(cluster) ~]$ * means that you have exceeded a quota Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 56

Filesystem Quotas All users have a home-directory quota set Soft limit defaults to 50GB and 100K files per user Users can continue to write data when soft limit is exceeded Default of 7-day time limit to reduce capacity used After time limit, no further capacity may be used Hard limit default to 75GB and 150K files per user Users cannot exceed their hard limit Group limits may also be enforced as required Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 57

Efficient space usage Data stored in scratch areas may be removed Compute node disks are persistent during jobs only Old data automatically removed from scratch areas Data stored in /nobackup is not backed-up Consider summarising and saving to home-dir Contact your supervisor about data archive facilities Consider compressing data stored in home-dir Standard Linux utilities installed on login nodes gzip, bzip2, zip/unzip, TAR Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 58

File permissions and sharing All files and directories have default permissions Contents of your home directory are private [alces-cluster@login1(cluster) users]$ ls -ald $HOME drwx------ 17 alces-cluster users 24576 Feb 18 17:57 /users/alces-cluster To give colleagues access to your files, use the chmod command: [alces-cluster@login1(cluster) users]$ chmod g+rx $HOME [alces-cluster@login1(cluster) users]$ ls -ald $HOME drwxr-x--- 17 alces-cluster users 24576 Feb 18 17:57 /users/alces-cluster Scratch directories are more open by default Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 59

Software Development Environment Compilers Defaults to distribution version (GCC 4.4.6) GCC version 4.6.3 also available via module Intel Cluster Suite Pro (v117.13.0) also available To use, load module then just call compiler [alces-cluster@login1(cluster) ~]$ module load compilers/intel compilers/intel/117.13.0 OK [alces-cluster@login1(cluster) ~]$ which icc /opt/gridware/pkg/apps/intelcsxe/2013.0.028/bin/bin/icc [alces-cluster@login1(cluster) ~]$ which ifort /opt/gridware/pkg/apps/intelcsxe/2013.0.028/bin/bin/ifort Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 60

Software Development Environment Message passing environment (MPI) Used for parallel jobs spanning multiple nodes Multiple versions available OpenMPI version 1.6.3 Intel MPI version 4.1.0.024 To use, load module then compile and run [alces-cluster@login1(cluster) ~]$ module load mpi/intelmpi mpi/intelmpi/4.1.0.024/bin OK [alces-cluster@login1(cluster) ~]$ which mpirun /opt/gridware/pkg/apps/intelcsxe/2013.0.028/bin/impi/4.1.0.024/bin64/mpirun [alces-cluster@login1(cluster) ~]$ Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 61

Software Development Environment modules environment management The primary method for users to access software Enables central, shared software library Provides separation between software packages Support multiple incompatible versions Automatic dependency analysis and module loading Available for all users to Use centralised application repository Build their own applications and modules in home dirs Ignore modules and setup user account manually Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 62

Modules environment management Loading a module does the following: Puts binaries in your $PATH for your current session Puts libraries in your $LD_LIBRARY_PATH Enables manual pages and help files Sets variables to allow you to find and use packages [alces-cluster@login1(cluster) ~]$ module load apps/breakdancer apps/breakdancer/1.3.5.1/gcc-4.4.6+samtools-0.1.18+boost-1.51.0 OK [alces-cluster@login1(cluster) ~]$ echo $BREAKDANCERDIR /opt/gridware/pkg/apps/breakdancer/1.3.5.1/gcc-4.4.6+samtools-0.1.18+boost-1.51.0 [alces-cluster@login1(cluster) ~]$ echo $BREAKDANCERBIN /opt/gridware/pkg/apps/breakdancer/1.3.5.1/gcc-4.4.6+samtools-0.1.18+boost-1.51.0/bin [alces-cluster@login1(cluster) ~]$ which breakdancer-max /opt/gridware/pkg/apps/breakdancer/1.3.5.1/gcc-4.4.6+samtools-0.1.18+boost- 1.51.0/bin/breakdancer-max Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 63

Modules environment management Using the module avail command [alces-cluster@login1(cluster) ~]$ module avail ----------------------- /users/alces-cluster/gridware/etc/modules -------------------- apps/hpl/2.1/gcc-4.4.6+openmpi-1.6.3+atlas-3.10.0 ---------------------------- /opt/gridware/etc/modules ------------------------------- apps/r/2.15.2/gcc-4.4.6+lapack-3.4.1+blas-1 apps/picard/1.85/noarch apps/abyss/1.2.6/gcc-4.4.6+openmpi-1.6.3+sparsehash-1.10 apps/plink/1.07/gcc-4.4.6 apps/artemis/14.0.17/noarch apps/python/2.7.3/gcc-4.4.6 apps/augustus/2.6.1/gcc-4.4.6 apps/rapid/0.6/noarch apps/blat/35/gcc-4.4.6 apps/recon/1.05/gcc-4.4.6 apps/bowtie/0.12.8/gcc-4.4.6 libs/blas/1/gcc-4.4.6 libs/boost/1.45.0/gcc-4.4.6+openmpi-1.6.3+python-2.7.3 mpi/intelmpi/4.1.0.024/bin mpi/openmpi/1.6.3/gcc-4.4.6 Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 64

Modules environment management Other modules commands available # module unload <module> Removes the module from the current environment # module list Shows currently loaded modules # module display <module> # module whatis <module> Shows information about the software # module keyword <search term> Searches for the supplied term in the available modules whatis entries Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 65

Modules environment management By default, modules are loaded for your session Loading modules automatically (all sessions): # module initadd <module> Loads the named module every time a user logs in # module initlist Shows which modules are automatically loaded at login # module initrm <module> Stops a module being loaded on login # module use <new module dir> Allows users to supply their own modules branches Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 66

Modules environment management Using modules Shared applications are installed with a module New requests are assessed for suitability to be a shared application Available from /opt/gridware NFS share No applications installed on compute nodes Same application set available throughout the cluster Modules always available to all users Can be loaded on the command line Can be included to load automatically at login Can be included in scheduler job scripts Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 67

HPC Applications Many popular applications available Centrally hosted to avoid replication Multiple versions supported via modules Interactive applications also supported Please run on compute nodes via qrsh Users should not run applications on login nodes Open-source and commercial apps supported Commercial application support provided by license provider Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 68

Requesting new applications Contact your site admin for information Application requests will be reviewed Most open-source applications can be incorporated Commercial applications require appropriate licenses Site license may be required License server available on cluster service nodes Users can install their own packages in home dirs Many software tools are available in Linux Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 69

Cluster job scheduler Open Grid Scheduler Developed from Sun Grid Engine codebase Virtually identical to original syntax Open-source with wide range of contributors Free to download and use Please consider acknowledging copyrights Very stable and established application Many command-line utilities and tools Admin GUI also available but rarely used Alces web-portal used to report accounting data Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 70

Cluster job scheduler Why do we need a job scheduler? Need to allocate compute resources to users Need to prevent user applications from overloading compute nodes Want to queue work to run overnight and out of hours Want to ensure that users each get a fair share of the available resources Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 71

Types of job Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 72

Types of job Serial jobs Use a single CPU core May be submitted as a task array (many single jobs) Multi-threaded jobs Use two or more CPU cores on the same node User must request the number of cores required Parallel jobs Use two or more CPU cores on one or more nodes User must request the number of cores required User may optionally request a number of nodes to use Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 73

Class of job Interactive jobs Started with the qrsh command Will start immediately if resources are available Will exit immediately if there are no resources available May be serial, multi-threaded or parallel type Users must request a maximum runtime Users should specify resources required to run Input data is the shell session or application started Output data is shown on screen Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 74

Interactive Jobs Qrsh allows users to run interactive applications directly on compute node Do not run demanding applications on login nodes Per-user CPU and memory limits are in place to discourage users from running on login nodes Login node CPU time is not included in scheduler accounting Users are prevented from logging in directly to compute nodes; use qrsh for access [alces-cluster@login1(cluster) ~]$ qrsh [alces-cluster@node12(cluster) ~]$ uptime 17:01:07 up 7 days, 5:12, 1 user, load average: 0.00, 0.00, 0.00 [alces-cluster@node12(cluster) ~]$ Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 75

Class of job Non-interactive (batch) jobs Submitted via qsub <job> command Submitted to the scheduler queue and left to run Will start immediately if resources are available Will queue for resources if job cannot run May be serial, multi-threaded or parallel type Users must request a maximum runtime Users should specify resources required to run Input data is usually stored on disk Output is stored to a file on disk Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 76

Non-interactive jobs Jobs are submitted via a job-script Contains the commands to run your job Can contain instructions for the job-scheduler Submission of a simple job-script Echo commands to stdout Single-core batch job Runs with default resource requests Assigned a unique job number (2573) Output sent to home directory by default Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 77

Simple batch job Job-script: ~/simple_jobscript.sh #!/bin/bash echo "Starting new job" sleep 120 echo "Finished my job [alces-cluster@login1(cluster) myjob]$ qsub simple_jobscript.sh Your job 2573 ("simple_jobscript.sh") has been submitted [alces-cluster@login1(cluster) myjob]$ qstat job-id prior name user state queue slots ja-task-id ------------------------------------------------------------------------------- 2573 2.02734 simple_job alces-cluster r byslot.q@node21 1 [alces-cluster@login1(cluster) myjob]$ cat ~/simple_jobscript.sh.o2573 Starting new job Finished my job [alces-cluster@login1(cluster) myjob]$ Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 78

Viewing the queue status Use the qstat command to view queue status: [alces-cluster@login1(cluster) myjob]$ qstat job-id prior name user state submit/start at queue slots ----------------------------------------------------------------------------------------------------- 2575 2.06725 imb-any.sh alces-cluste r 02/14/2013 17:15:38 byslot.q@node08.cluster.local 32 2576 2.06725 imb-any.sh alces-cluste r 02/14/2013 17:15:43 byslot.q@node10.cluster.local 32 2577 2.15234 imb-any.sh alces-cluste r 02/14/2013 17:15:43 byslot.q@node15.cluster.local 48 2578 1.90234 simple_job alces-cluste r 02/14/2013 17:15:58 byslot.q@node11.cluster.local 1 2579 0.00000 simple_job alces-cluste qw 02/14/2013 17:16:25 1 Running jobs (r) are shown with queues used Queuing jobs (qw) are waiting to run Use qstat j <jobid> for more information on why queuing jobs have not started yet Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 79

Removing Jobs from the queue Use the qdel command to remove a job: [alces-cluster@login1(cluster) myjob]$ qstat job-id prior name user state submit/start at queue slots ---------------------------------------------------------------------------------------------------- 2577 1.90234 imb-any.sh alces-cluste r 02/14/2013 17:15:43 byslot.q@node15.cluster.local 48 2579 1.27734 simple_job alces-cluste qw 02/14/2013 17:16:25 1 [alces-cluster@login1(cluster) myjob]$ qdel 2577 alces-cluster has registered the job 2577 for deletion [alces-cluster@login1(cluster) myjob]$ qstat job-id prior name user state submit/start at queue slots ----------------------------------------------------------------------------------------------------- 2579 1.27734 simple_job alces-cluste qw 02/14/2013 17:16:25 1 Nodes are automatically cleared up Users can delete their own jobs only Operator can delete any jobs Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 80

Job-scheduler instructions qsub and qrsh can receive instructions from users Common instructions include: Provide a name for your job Control how output files are written Request email notification of job status Request additional resources More CPU cores More memory A longer run-time Access to special purpose compute nodes Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 81

Job-scheduler instructions Job-scheduler instructions may be given as parameters to your qsub or qrsh command Use -N <name> to set a job name Job name is visible via a qstat command [alces-cluster@login1(cluster) ~]$ qrsh N hello [alces-cluster@node06(cluster) ~]$ [alces-cluster@node06(cluster) ~]$ qstat job-id prior name user state submit/start at queue slots ---------------------------------------------------------------------------------------- 2580 1.90234 hello alces-cluste r 02/24/2013 17:29:41 byslot.q@node06 1 Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 82

Job-scheduler instructions Non-interactive jobs can provide instructions as part of their job-script Begin instructions with #$ identifier Multiple lines can be processed [alces-cluster@login1(cluster) myjob]$ cat simple_jobscript.sh #!/bin/bash #$ -N my_job echo "Starting new job" sleep 120 echo "Finished my job" [alces-cluster@login1(cluster) myjob]$ qsub simple_jobscript.sh Your job 2581 ("my_job") has been submitted [alces-cluster@login1(cluster) myjob]$ qstat job-id prior name user state submit/start at queue slots -------------------------------------------------------------------------------------------- 2581 1.40234 my_job alces-cluste r 02/14/2013 17:36:26 byslot.q@node03 1 Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 83

Setting output file location Provide a location to store the output of a job Use the -o <filename> option Supply full path and filename The variable $JOB_ID is set automatically to your job ID number [alces-cluster@login1(cluster) myjob]$ cat simple_jobscript.sh #!/bin/bash #$ -o ~/outputfiles/myjob.$job_id echo "Starting new job" sleep 120 echo "Finished my job Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 84

Requesting email notification Provide an email address for job status Use the -M <email address> option Specify conditions to report using -m <b e a> Send email when job Begins, Ends or Aborts [alces-cluster@login1(cluster) myjob]$ cat simple_jobscript.sh #!/bin/bash #$ -o ~/outputfiles/myjob.$job_id #$ -M user@work.com m bea echo "Starting new job" sleep 120 echo "Finished my job Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 85

Running task arrays Used to run an array of serial jobs Input and output files can be separated Individual tasks are independent; may run anywhere Use the -t <start>-<end> option [alces-cluster@login1(cluster) myjob]$ cat simple_taskarray.sh #!/bin/bash #$ -t 1-100 cwd o ~/results/taskjob.$job_id module load apps/fastapp Fastapp i ~/inputdata/intput.$sge_task_id o \ ~/results/out.$sge_task_id Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 86

Requesting additional resources If no additional resources are requested, jobs are automatically assigned default resources One CPU core Up to 4GB RAM Maximum of 24-hour runtime (wall-clock) Execute on standard compute nodes These limits are automatically enforced Jobs must request different limits if required Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 87

Requesting more CPU cores Multi-threaded and parallel jobs require users to request the number of CPU cores needed The scheduler uses a Parallel Environment (PE) Must be requested by name Number of slots (CPU cores) must be requested Enables scheduler to prepare nodes for the job Three PE available on cluster smp / smp-verbose mpislots / mpislots-verbose mpinodes / mpinodes-verbose Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 88

smp Parallel Environment Designed for SMP / multi-threaded jobs Job must be contained within a single node Requested with number of slots / CPU-cores Verbose variant shows setup information in job output Memory limit is requested per slot Request with -pe smp-verbose <slots> Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 89

smp Parallel Environment [alces-cluster@login1(cluster) smptest]$ cat runsmp.sh #!/bin/bash #$ -pe smp-verbose 7 -o ~/smptest/results/smptest.out.$job_id ~/smptest/hello [alces-cluster@login1(cluster) smptest]$ cat ~/smptest/results/smptest.out.2583 ======================================================= SGE job submitted on Mon Feb 14 14:01:31 GMT 2013 JOB ID: 2583 JOB NAME: runsmp.sh PE: smp-verbose QUEUE: byslot.q MASTER node17.cluster.local ======================================================= 2: Hello World! 5: Hello World! 6: Hello World! 1: Hello World! 4: Hello World! 0: Hello World! 3: Hello World! ================================================== [alces-cluster@login1(cluster) smptest]$ Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 90

mpislots Parallel Environment Designed for multi-core MPI jobs Job may use slots on many nodes MPI hostfile is generated by scheduler Requested with number of slots / CPU-cores Verbose variant shows setup information in job output Memory limit is requested per slot Default is 4GB per slot (CPU core) Request with -pe mpislots <slots> Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 91

mpislots Parallel Environment [alces-cluster@login1(cluster) ~]$ cat mpijob.sh #!/bin/bash #$ -pe mpislots-verbose 48 -cwd N imb-job -o ~/imb/results/imb-job.$job_id V module load apps/imb mpirun IMB-MPI1 [alces-cluster@login1(cluster) ~]$ cat ~/imb/results/imb-job.2584 ======================================================= SGE job submitted on Mon Feb 14 14:11:31 GMT 2013 3 hosts used JOB ID: 2584 JOB NAME: imb-job PE: mpislots-verbose QUEUE: byslot.q MASTER node07.cluster.local Nodes used: node07 node24 node02 ======================================================= ** A machine file has been written to /tmp/sge.machines.2584 on node07.cluster.local ** ======================================================= If an output file was specified on job submission Job Output Follows: ======================================================= Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 92

mpinodes Parallel Environment Designed for MPI jobs that use whole nodes Job has a number of entire nodes dedicated to it MPI hostfile is generated by scheduler Requested with number of nodes (16-cores each) Verbose variant shows setup information in job output Memory limit is requested per slot Default is 64GB per slot (complete node) Request with -pe mpinodes <slots> Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 93

mpinodes Parallel Environment [alces-cluster@login1(cluster) ~]$ cat mpinodesjob.sh #!/bin/bash #$ -pe mpinodes-verbose 3 -cwd N imb-job -o ~/imb/results/imb-job.$job_id V module load apps/imb mpirun IMB-MPI1 [alces-cluster@login1(cluster) ~]$ cat ~/imb/results/imb-job.2585 ======================================================= SGE job submitted on Mon Feb 14 15:20:41 GMT 2013 3 hosts used JOB ID: 2585 JOB NAME: imb-job PE: mpinodes-verbose QUEUE: bynode.q MASTER node05.cluster.local Nodes used: node05 node04 node22 ======================================================= ** A machine file has been written to /tmp/sge.machines.2585 on node05.cluster.local ** ======================================================= If an output file was specified on job submission Job Output Follows: ======================================================= Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 94

Requesting more memory Default memory request is 4GB per core Request more using -l h_vmem=<amount> [alces@login1(cluster) ~]$ qsub l h_vmem=8g jobscript.sh Jobs requesting more memory are likely to queue for longer Need to wait for resources to be available to run Jobs requesting less memory are likely to run sooner Scheduler will use all available nodes and memory Memory limits are enforced by the scheduler Jobs exceeding their limit will be automatically stopped Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 95

How much memory should I request? Run the job with a large memory request When complete, use qacct j <jobid> [alces@login1(cluster) ~]$ qsub -l h_vmem=64g simple_jobscript.sh [alces@login1(cluster) ~]$ qacct -j 2586 ============================================================== qname byslot.q hostname node13.cluster.local group users owner alces-cluster project default.prj department alces.ul jobname my_job jobnumber 2586 cpu 0.003 maxvmem 204.180M Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 96

Requesting a longer runtime Default runtime is 24-hours Time is measured for slot occupancy when job starts Request more using -l h_rt=<hours:mins:secs> [alces@login1(cluster) ~]$ qsub l h_rt=04:30:00 jobscript.sh This is a maximum runtime Your job will be accounted for if you finish early Job time limits are enforced by the scheduler Jobs exceeding their limit will be automatically stopped Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 97

Why not always request large runtimes? The scheduler performs backfilling When resources are being reserved for a large job, the scheduler will automatically run short jobs on idle CPU cores The scheduler needs to know how long your job may run for to allow it to jump the queue Users queuing large jobs can request reservation Use the -R y option to request that resources are reserved for large parallel jobs Reserved resources are included in your usage Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 98

Requesting special purpose nodes The vhmem01 node has 1TB of memory Access is restricted to particular users and groups Request this system with -l vhmem=true Remember to request amount of memory needed [alces-cluster@login1(cluster) myjob]$ cat simple_jobscript.sh #!/bin/bash #$ -N bigjob -h h_rt=72:0:0 #$ -l vhmem=true l h_vmem=32g #$ -pe smp-verbose 32 module load apps/mysmpjob mysmpbin Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 99

Requesting node exclusivity Ensures that your job has a complete node Access is restricted to particular users and groups Request exclusivity with -l exclusive=true Your job may queue for significantly longer [alces-cluster@login1(cluster) myjob]$ cat simple_jobscript.sh #!/bin/bash #$ -N excjob -h h_rt=2:0:0 #$ -l exclusive=true l h_vmem=2g module load apps/verybusyapp VeryBusyApp -all Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 100

Example job-script templates Example job-scripts available on the cluster /opt/gridware/examples/templates/ Example job-script simple.sh simple-array.sh smp.sh multislot.sh multinode.sh Purpose Single-slot serial job Array of serial jobs Single-node, multi-threaded job Multi-node, multi-thread MPI job (assigned by slot) Multi-node, multi-thread MPI job (assigned by whole nodes) Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 101

Sharing resources Job priority affects queuing jobs only Functional shares Static definitions of how much resource a user/group can use Urgency The amount of time a job has been waiting to run Priority of queuing jobs automatically increases over time Fair-share policy Takes past usage of the cluster into account Half-life of 14 days for resource usage Ensures that cluster is not unfairly monopolised Use qstat to show the priority of your job Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 102

Sharing resources Resource quotas Static limits set by administrators Allows the maximum resource available to a user or group to be controlled Default quotas include: Limit access to the vhmem01 node Access to exclusivity flag Limit the maximum number of CPU slots Use qquota to show your resource quotas Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 103

Sharing resources Resource quotas Static limits set by administrators Allows the maximum resource available to a user or group to be controlled Default quotas include: Limit access to the vhmem01 node Access to exclusivity flag Limit the maximum number of CPU slots Use qquota to show your resource quotas Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 104

Queue administration Administrator can perform queue management Delete jobs (qdel) Enable and disable queues (qmod) Disabled queues finish running jobs but do not start more [admin@headnode1(cluster) ~]$ qstat u \* job-id prior name user state submit/start at queue slots ------------------------------------------------------------------------------------------------------- 2587 2.03927 imb-any.sh alces-cluste r 02/20/2013 12:11:43 byslot.q@node20.cluster.local 24 2588 2.03927 imb-any.sh alces-cluste r 02/21/2013 11:11:21 byslot.q@node19.cluster.local 24 2589 2.03927 imb-any.sh alces-cluste r 02/05/2013 12:41:44 byslot.q@node05.cluster.local 24 2590 2.15234 imb-any.sh alces-cluste r 02/08/2013 08:21:51 byslot.q@node12.cluster.local 43 2591 1.10234 my_job alces-cluste r 02/14/2013 02:42:06 byslot.q@node11.cluster.local 1 [admin@headnode1(cluster) ~]$ qdel 2587 admin has registered the job 2587 for deletion [admin@headnode1(cluster) ~]$ qmod -d byslot.q@node05 admin@service.cluster.local changed state of "byslot.q@node05.cluster.local" (disabled) [admin@headnode1(cluster) ~]$ Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 105

Execution host status Use the qhost command to view host status [alces-cluster@login1(cluster) ~]$ qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - node01 linux-x64 16 0.00 62.9G 1.1G 16.0G 0.0 node02 linux-x64 16 0.00 62.9G 1.2G 16.0G 0.0 node03 linux-x64 16 0.00 62.9G 1.2G 16.0G 0.0 node04 linux-x64 16 1.59 62.9G 4.2G 16.0G 0.0 node05 linux-x64 16 3.03 62.9G 1.2G 16.0G 0.0 node06 linux-x64 16 4.01 62.9G 1.2G 16.0G 0.0 node07 linux-x64 16 16.00 62.9G 8.2G 16.0G 0.0 node08 linux-x64 16 16.00 62.9G 5.2G 16.0G 0.0 node09 linux-x64 32 32.00 62.9G 56.2G 16.0G 0.0 node10 linux-x64 32 32.00 62.9G 56.2G 16.0G 0.0 node11 linux-x64 32 33.10 62.9G 56.7G 16.0G 0.0 node12 linux-x64 32 32.94 62.9G 32.2G 16.0G 4.0 node13 linux-x64 64 0.00 62.9G 1.5G 16.0G 0.0 node14 linux-x64 64 0.00 62.9G 1.5G 16.0G 0.0 node15 linux-x64 64 12.40 252.9G 31.2G 16.0G 0.0 node16 linux-x64 64 64.00 252.9G 250.1G 16.0G 2.0 Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 106

Requesting Support Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 107

Support services Site administrator On-site point of contact for all users Service delivery lead for site Organises, priorities and referees user requests Determines usage policy, time sharing, etc. Responsible for user data management Ability to organise data files stored on the cluster Can set user quotas, change file ownerships and groups Requests vendor service via support tickets Changes to the service must be authorized by site admin Delegated responsibility during absence to another admin Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 108

Requesting assistance Open a service desk ticket Must contain a description of the problem Username or group with the problem Example job output files that show the issue Node number or system component fault Method for replicating the problem Request priority and business case justification Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 109

Getting started Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 110

User accounts Requesting a user account Contact your site administrator; please provide Your existing IT Services username (but not password) The name of your supervisor or PI Once your account is created, access the system using: Your existing IT Services username Your existing IT Services password Alces Software Limited 2013 COMMERCIAL IN CONFIDENCE 111

www.alces-software.com Alces Software Limited 2013