Using WestGrid Patrick Mann, Manager, Technical Operations Jan.15, 2014
Winter 2014 Seminar Series Date Speaker Topic 5 February Gino DiLabio Molecular Modelling Using HPC and Gaussian 26 February Jonathan Dursi Responding to Canada s Research Computing Needs 12 March Scott Northrup Introduction to GPU Computing Using CUDA 26 March Humaira Kamal and Alan Wagner A Signpost on the Road to Exascale For more information on these and other seminars see https://www.westgrid.ca/support/training
User Basics To use WestGrid systems effectively you will need to know: Where to get help and information Which systems are suited to your project How to log on Basic Linux commands How to define and submit batch jobs
Help and Support WestGrid website: www.westgrid.ca Technical Specifications, QuickStart Guides, Software.. System status and notices Events, colloquia, news,... WestGrid Support: support@westgrid.ca Novice to expert Logon issues to in-depth parallelization No question too big or too small Account problems: accounts@westgrid.ca
WestGrid Cluster Schematic User Desktop Login Node(s) Linux Scheduler Node(s) Node Node Node Node Node Node SSH over Internet Internal Cluster Network Cluster (compute nodes) Shared Disc System /home and /global/scratch /home backup
Cluster Compute Nodes Node (Linux box) Node (Linux box) CPU Core Core... CPU Core Core... CPU Core Core... CPU Core Core.... Shared RAM Shared RAM Interconnect (InfiniBand usually) Nodes usually have 2 CPUs, with 6 or 8 cores/cpu. Usually 12-24 GB/node (2 GB/core). 100 s of nodes in one cluster. InfiniBand interconnects (with varying bandwidth and latency) Specialty systems with MUCH more memory/node. Specialty systems that look like a single node with lots of cores. Shared Memory: 1 node (multicore) Distributed Memory: cores on >1 node
System Selection 1 Aim: Optimally match software requirements and characteristics with systems Fast turnaround (Users!) Efficient use of resources (Systems Management!) Software System Packaged, Homegrown, Parallelizability, Scalability, Memory, Output Architecture, Size, Memory, Interconnects, Storage, Batch Policy https://www.westgrid.ca/support/quickstart/new_users#choosing_system
System Selection 2 Software? (off-the-shelf, licensed, homegrown) Memory requirements? Parallelization? Scalability Shared or Distributed memory (or both) Research Program Characteristics? Lots of little jobs (parameter space and optimization) A few really big jobs (simulations) Code development...
System Selection 3 Small-memory serial. Undemanding parallel Shared memory (OpenMP) Distributed memory (MPI parallel) Graphics, visualization or GPU acceleration Gaussian Other special software (MATLAB,..) Archive and backup Hermes, Bugaboo, Jasper, Orcinus Breezy, Hungabee (larger memory) Bugaboo, Grex, Jasper, Lattice, Nestor, Orcinus, Parallel Bugaboo, Nestor: large associated storage Lattice: small memory (1.5 GB/core) Grex: large memory (4 GB/core) Parallel Grex (licensed) Check the QuickStart and software guides Silo (very large 3.15 PB storage system)
System Selection 4 Lots of systems, some special purpose, some general purpose. Each has its own software set. Lots of generic software, but some packages are only on specific machines (see software pages). Users may work on multiple systems. Hard to choose. Recommend (strongly) talking to an analyst
Connecting to Cluster The login nodes (and all nodes) run Linux. Command-line shell to write text commands So need to login via a standard terminal We use SSH (as does most of the world) Linux and MacOS have built-in clients Windows: various packages: PuTTY https://www.westgrid.ca/support/quickstart/new_users#connecting
Linux You do need to know the basics of Linux and the Commandline Lots of tutorials and books out there. See the New Users QuickStart guide: https://www.westgrid.ca/support/quickstart/new_users#working
Graphical Applications Editors, Visualization and other Graphical Interfaces X-Windows is the Linux windowing system Linux editors, visualization packages and anything graphical use X Used by MacOS, and can be installed in Windows. http://sourceforge.net/projects/xming (free) Linux: ssh -X username@system.westgrid.ca https://www.westgrid.ca/support/quickstart/new_users#setting_up
File Transfer Standard tools based on SSH transport. scp sftp rsync Secure copy Secure file transfer protocol File synchronization (complicated, but really useful) Linux, MacOS: built-in Windows: WinSCP, Filezilla Lots of beautiful graphical front-ends out there! Annoying issue with line-endings in files from windows
WestGrid Core Network: Very Fast Internal Network connecting all sites CANARIE National Network connecting Compute Canada sites (and all Universities and institutions) Especially to Silo backup/archival system Powerful Grid tools and Globus Online https://www.westgrid.ca/support/file_transfer Inter-Site File Transfer
Useful Linux Software Many useful, standard software packages included on all WestGrid systems: Programming Editors (nedit, emacs, vi, ) Compilers (Intel, GNU, Fortran, C++,..) Scripting (Python has become a common scientific language) Parallel programming (OpenMP, Open MPI) Base scientific libraries (BLAS, LAPACK,..) As usual: see the QuickStart and software pages, or ask.
Job Basics Login nodes: Data management Editing and compiling code Quick tests Job management Real work done on the worker (compute) nodes Jobs submitted to batch system (queued) Jobs dispatched as fairly as possible to worker nodes Not interactive: create script and submit
Batch Jobs A batch job is defined by a Linux shell script with directives that tell the scheduler what resources the job needs: memory, cores, walltime (and lots of fine detail stuff) Jobs exceeding these pre-defined resource limits may be terminated! (eg, Walltime limit) Jobs with incompatible requirements (eg cores/node) may be queued, but never run. https://www.westgrid.ca/support/running_jobs
Job Management Submit a job Status of jobs Delete a job (queued or running) Predicted start time qsub <job script> qstat [-f] <job id> qdel <job id> showstart <job id> Check scheduling showq [--help] -u <user name> Linux command-line utilities Run them as usual Linux commands man qsub qsub --help Standard Linux manual page. Short synopsis
Sample Job Script hello.pbs #!/bin/bash # Standard Linux first line #PBS -l procs=1 # Scheduling directive (lots!) #PBS -j oe # join standard and error outputs date echo Hello World. echo This job is running on $(/bin/hostname). Submit the job qsub hello.pbs https://www.westgrid.ca/support/running_jobs#sample https://www.westgrid.ca/support/running_jobs#directives
Job Submission pjmann@bugaboo ~/PresentationTests$ qsub hello.pbs 15298317.b0 The response gives the job id: 15298317 pjmann@bugaboo ~/PresentationTests$ qstat 15298317 Job ID Name User Time Use S Queue ------------ ------------ --------------- -------- - ----- 15298317.b0 hello.pbs pjmann 0 Q q1
Job Results run completes (try a few qstat s and/or showstart) pjmann@bugaboo ~/PresentationTests$ ls hello.pbs hello.pbs.e15298317 hello.pbs.o15298317 pjmann@bugaboo ~/PresentationTests$ cat hello.pbs.o15298317 Thu Jan 9 12:03:57 PST 2014 Hello World. This job is running on s31
Starting Out https://www.westgrid.ca/support/quickstart/new_users Recommendations: run lots of small example test jobs. get a simple one working, and build up from there We all know the debugging 80:20 (or 90:10, or 99:01) build-up iteratively play with job management (qsub, qstat, showq, )
Debugging Job output can show lots of information Mail job completion info (lots there, #PBS directive) Explicitly define information requirements (Lots of detailed PBS directives) Ask for help... Debugging
Some nodes are reserved for interactive use Larger/Longer test jobs and interactive work (< 3 hours) Interactive Jobs qsub -I https://www.westgrid.ca/support/running_jobs#interactive
Job scheduling is a complex and difficult task. Each site schedules their own jobs MOAB fair-share scheduling Fair-share Job Scheduling
Fair-share Targets System utilization targets set for projects (groups) and their members. Fair-share allocates job priority depending on these targets. Dependent on resource availability and characteristics. Base Metric: Usage over last couple of weeks (system dependent) If Usage > Target: Priority is decreased proportionally If Usage < Target: Priority is increased proportionally
Resource Allocation The Usage Targets are defined by the Resource Allocation Process (RAC = Resource Allocation Committee ) Compute Canada annual process (October) Projects (PI s) complete an application Reviewed by Technical and Scientific panels Decisions in December Targets (allocations) entered into systems Jan.10 Default allocation available for projects which do not have a Resource Allocation
Visualization and Software Visualization and Graphics (including GPUs) https://www.westgrid.ca/support/visualization You can install software/packages. But analysts know about optimization, hardware details, systems details, ASK! Jan Paral, UAlberta, Mercury Solar Wind
Asking for Help mailto:support@westgrid.ca It helps the analysts if you can include information: 1. The name of the system (lots of folks forget this!). 2. The job id. 3. Your WestGrid user id (especially if you re using a different email address). 4. Location of the script/job/datafiles/ 5. And of course details of the errors or issues.
Conclusion Support System selection Connecting Linux Jobs www.westgrid.ca support@westgrid.ca Thanks for coming! Questions?