Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Size: px

Start display at page:

Download "Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014"

Roy Cook
8 years ago
Views:

1 Using WestGrid Patrick Mann, Manager, Technical Operations Jan.15, 2014

2 Winter 2014 Seminar Series Date Speaker Topic 5 February Gino DiLabio Molecular Modelling Using HPC and Gaussian 26 February Jonathan Dursi Responding to Canada s Research Computing Needs 12 March Scott Northrup Introduction to GPU Computing Using CUDA 26 March Humaira Kamal and Alan Wagner A Signpost on the Road to Exascale For more information on these and other seminars see

Northrup Introduction to GPU Computing Using CUDA 26 March Humaira Kamal and Alan Wagner A Signpost on

3 User Basics To use WestGrid systems effectively you will need to know: Where to get help and information Which systems are suited to your project How to log on Basic Linux commands How to define and submit batch jobs

Which systems are suited to your project How to log

4 Help and Support WestGrid website: Technical Specifications, QuickStart Guides, Software.. System status and notices Events, colloquia, news,... WestGrid Support: Novice to expert Logon issues to in-depth parallelization No question too big or too small Account problems:

5 WestGrid Cluster Schematic User Desktop Login Node(s) Linux Scheduler Node(s) Node Node Node Node Node Node SSH over Internet Internal Cluster Network Cluster (compute nodes) Shared Disc System /home and /global/scratch /home backup

over Internet Internal Cluster Network Cluster (compute

6 Cluster Compute Nodes Node (Linux box) Node (Linux box) CPU Core Core... CPU Core Core... CPU Core Core... CPU Core Core.... Shared RAM Shared RAM Interconnect (InfiniBand usually) Nodes usually have 2 CPUs, with 6 or 8 cores/cpu. Usually GB/node (2 GB/core). 100 s of nodes in one cluster. InfiniBand interconnects (with varying bandwidth and latency) Specialty systems with MUCH more memory/node. Specialty systems that look like a single node with lots of cores. Shared Memory: 1 node (multicore) Distributed Memory: cores on >1 node

Usually 12-24 GB/node (2 GB/core). 100 s of nodes in one cluster.

7 System Selection 1 Aim: Optimally match software requirements and characteristics with systems Fast turnaround (Users!) Efficient use of resources (Systems Management!) Software System Packaged, Homegrown, Parallelizability, Scalability, Memory, Output Architecture, Size, Memory, Interconnects, Storage, Batch Policy

) Software System Packaged, Homegrown, Parallelizability, Scalability, Memory, Output

8 System Selection 2 Software? (off-the-shelf, licensed, homegrown) Memory requirements? Parallelization? Scalability Shared or Distributed memory (or both) Research Program Characteristics? Lots of little jobs (parameter space and optimization) A few really big jobs (simulations) Code development...

Scalability Shared or Distributed memory (or both) Research Program

9 System Selection 3 Small-memory serial. Undemanding parallel Shared memory (OpenMP) Distributed memory (MPI parallel) Graphics, visualization or GPU acceleration Gaussian Other special software (MATLAB,..) Archive and backup Hermes, Bugaboo, Jasper, Orcinus Breezy, Hungabee (larger memory) Bugaboo, Grex, Jasper, Lattice, Nestor, Orcinus, Parallel Bugaboo, Nestor: large associated storage Lattice: small memory (1.5 GB/core) Grex: large memory (4 GB/core) Parallel Grex (licensed) Check the QuickStart and software guides Silo (very large 3.15 PB storage system)

10 System Selection 4 Lots of systems, some special purpose, some general purpose. Each has its own software set. Lots of generic software, but some packages are only on specific machines (see software pages). Users may work on multiple systems. Hard to choose. Recommend (strongly) talking to an analyst

Lots of generic software, but some packages are only on specific machines

11 Connecting to Cluster The login nodes (and all nodes) run Linux. Command-line shell to write text commands So need to login via a standard terminal We use SSH (as does most of the world) Linux and MacOS have built-in clients Windows: various packages: PuTTY

terminal We use SSH (as does most of the world) Linux and MacOS have built-in

12 Linux You do need to know the basics of Linux and the Commandline Lots of tutorials and books out there. See the New Users QuickStart guide:

13 Graphical Applications Editors, Visualization and other Graphical Interfaces X-Windows is the Linux windowing system Linux editors, visualization packages and anything graphical use X Used by MacOS, and can be installed in Windows. (free) Linux: ssh -X username@system.westgrid.ca

by MacOS, and can be installed in Windows. http://sourceforge.

14 File Transfer Standard tools based on SSH transport. scp sftp rsync Secure copy Secure file transfer protocol File synchronization (complicated, but really useful) Linux, MacOS: built-in Windows: WinSCP, Filezilla Lots of beautiful graphical front-ends out there! Annoying issue with line-endings in files from windows

(complicated, but really useful) Linux, MacOS: built-in Windows: WinSCP,

15 WestGrid Core Network: Very Fast Internal Network connecting all sites CANARIE National Network connecting Compute Canada sites (and all Universities and institutions) Especially to Silo backup/archival system Powerful Grid tools and Globus Online Inter-Site File Transfer

institutions) Especially to Silo backup/archival system Powerful Grid tools and

16 Useful Linux Software Many useful, standard software packages included on all WestGrid systems: Programming Editors (nedit, emacs, vi, ) Compilers (Intel, GNU, Fortran, C++,..) Scripting (Python has become a common scientific language) Parallel programming (OpenMP, Open MPI) Base scientific libraries (BLAS, LAPACK,..) As usual: see the QuickStart and software pages, or ask.

.) Scripting (Python has become a common scientific language) Parallel programming (OpenMP,

17 Job Basics Login nodes: Data management Editing and compiling code Quick tests Job management Real work done on the worker (compute) nodes Jobs submitted to batch system (queued) Jobs dispatched as fairly as possible to worker nodes Not interactive: create script and submit

nodes Jobs submitted to batch system (queued) Jobs dispatched as

18 Batch Jobs A batch job is defined by a Linux shell script with directives that tell the scheduler what resources the job needs: memory, cores, walltime (and lots of fine detail stuff) Jobs exceeding these pre-defined resource limits may be terminated! (eg, Walltime limit) Jobs with incompatible requirements (eg cores/node) may be queued, but never run.

exceeding these pre-defined resource limits may be terminated!

19 Job Management Submit a job Status of jobs Delete a job (queued or running) Predicted start time qsub <job script> qstat [-f] <job id> qdel <job id> showstart <job id> Check scheduling showq [--help] -u <user name> Linux command-line utilities Run them as usual Linux commands man qsub qsub --help Standard Linux manual page. Short synopsis

<job id> Check scheduling showq [--help] -u <user name> Linux command-line utilities

20 Sample Job Script hello.pbs #!/bin/bash # Standard Linux first line #PBS -l procs=1 # Scheduling directive (lots!) #PBS -j oe # join standard and error outputs date echo Hello World. echo This job is running on $(/bin/hostname). Submit the job qsub hello.pbs

) #PBS -j oe # join standard and error outputs date echo Hello World.

21 Job Submission ~/PresentationTests$ qsub hello.pbs b0 The response gives the job id: ~/PresentationTests$ qstat Job ID Name User Time Use S Queue b0 hello.pbs pjmann 0 Q q1

22 Job Results run completes (try a few qstat s and/or showstart) pjmann@bugaboo ~/PresentationTests$ ls hello.pbs hello.pbs.e hello.pbs.o pjmann@bugaboo ~/PresentationTests$ cat hello.pbs.o Thu Jan 9 12:03:57 PST 2014 Hello World. This job is running on s31

23 Starting Out Recommendations: run lots of small example test jobs. get a simple one working, and build up from there We all know the debugging 80:20 (or 90:10, or 99:01) build-up iteratively play with job management (qsub, qstat, showq, )

24 Debugging Job output can show lots of information Mail job completion info (lots there, #PBS directive) Explicitly define information requirements (Lots of detailed PBS directives) Ask for help... Debugging

25 Some nodes are reserved for interactive use Larger/Longer test jobs and interactive work (< 3 hours) Interactive Jobs qsub -I

26 Job scheduling is a complex and difficult task. Each site schedules their own jobs MOAB fair-share scheduling Fair-share Job Scheduling

27 Fair-share Targets System utilization targets set for projects (groups) and their members. Fair-share allocates job priority depending on these targets. Dependent on resource availability and characteristics. Base Metric: Usage over last couple of weeks (system dependent) If Usage > Target: Priority is decreased proportionally If Usage < Target: Priority is increased proportionally

28 Resource Allocation The Usage Targets are defined by the Resource Allocation Process (RAC = Resource Allocation Committee ) Compute Canada annual process (October) Projects (PI s) complete an application Reviewed by Technical and Scientific panels Decisions in December Targets (allocations) entered into systems Jan.10 Default allocation available for projects which do not have a Resource Allocation

29 Visualization and Software Visualization and Graphics (including GPUs) You can install software/packages. But analysts know about optimization, hardware details, systems details, ASK! Jan Paral, UAlberta, Mercury Solar Wind

30 Asking for Help It helps the analysts if you can include information: 1. The name of the system (lots of folks forget this!). 2. The job id. 3. Your WestGrid user id (especially if you re using a different address). 4. Location of the script/job/datafiles/ 5. And of course details of the errors or issues.

31 Conclusion Support System selection Connecting Linux Jobs Thanks for coming! Questions?

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria Tutorial: Using WestGrid Drew Leske Compute Canada/WestGrid Site Lead University of Victoria Fall 2013 Seminar Series Date Speaker Topic 23 September Lindsay Sill Introduction to WestGrid 9 October Drew