Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Similar documents

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Getting Started with HPC

An Introduction to High Performance Computing in the Department

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

HPC at IU Overview. Abhinav Thota Research Technologies Indiana University

WestGrid. Handbook for Researchers at the University of Manitoba. January 2010

Streamline Computing Linux Cluster User Training. ( Nottingham University)

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

HPC Wales Skills Academy Course Catalogue 2015

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Introduction to ACENET Accelerating Discovery with Computational Research May, 2015

Miami University RedHawk Cluster Working with batch jobs on the Cluster

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

The Asterope compute cluster

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Martinos Center Compute Clusters

New High-performance computing cluster: PAULI. Sascha Frick Institute for Physical Chemistry

Introduction to Sun Grid Engine (SGE)

Using the Yale HPC Clusters

NEC HPC-Linux-Cluster

Job Scheduling Explained More than you ever want to know about how jobs get scheduled on WestGrid systems...

Introduction to Supercomputing with Janus

Quick Tutorial for Portable Batch System (PBS)

Running applications on the Cray XC30 4/12/2015

bwgrid Treff MA/HD Sabine Richling, Heinz Kredel Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim 24.

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

Advanced Techniques with Newton. Gerald Ragghianti Advanced Newton workshop Sept. 22, 2011

Hodor and Bran - Job Scheduling and PBS Scripts

Matlab on a Supercomputer

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

The RWTH Compute Cluster Environment

Agenda. Using HPC Wales 2

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology

HPCC USER S GUIDE. Version 1.2 July IITS (Research Support) Singapore Management University. IITS, Singapore Management University Page 1 of 35

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Grid 101. Grid 101. Josh Hegie.

High Performance Computing

Using NeSI HPC Resources. NeSI Computational Science Team

Parallel Debugging with DDT

Introduction to parallel computing and UPPMAX

Job Scheduling with Moab Cluster Suite

Guillimin HPC Users Meeting. Bryan Caron

Using Parallel Computing to Run Multiple Jobs

The CNMS Computer Cluster

Manual for using Super Computing Resources

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

Job scheduler details

Part I Courses Syllabus

bwgrid Treff MA/HD Sabine Richling, Heinz Kredel Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim 20.

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

An introduction to Fyrkat

Data management on HPC platforms

SGE Roll: Users Guide. Version Edition

Juropa. Batch Usage Introduction. May 2014 Chrysovalantis Paschoulas

Batch Scripts for RA & Mio

Berkeley Research Computing. Town Hall Meeting Savio Overview

LANL Computing Environment for PSAAP Partners

Beyond Windows: Using the Linux Servers and the Grid

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

Introduction to Matlab Distributed Computing Server (MDCS) Dan Mazur and Pier-Luc St-Onge December 1st, 2015

How to Run Parallel Jobs Efficiently

SLURM Workload Manager

Grid Engine Users Guide p1 Edition

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Using the Yale HPC Clusters

Overview of HPC Resources at Vanderbilt

Introduction to SDSC systems and data analytics software packages "

ABAQUS High Performance Computing Environment at Nokia

Introduction to MSI* for PubH 8403

Using the Windows Cluster

GRID Computing: CAS Style

How To Run A Steady Case On A Creeper

bwgrid Treff MA/HD Sabine Richling, Heinz Kredel Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim 29.

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

High-Performance Computing

Ra - Batch Scripts. Timothy H. Kaiser, Ph.D. tkaiser@mines.edu

Introduction to HPC Workshop. Center for e-research

The XSEDE Global Federated File System (GFFS) - Breaking Down Barriers to Secure Resource Sharing

GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems

MPI / ClusterTools Update and Plans

Parallel Processing using the LOTUS cluster

Caltech Center for Advanced Computing Research System Guide: MRI2 Cluster (zwicky) January 2014

The Moab Scheduler. Dan Mazur, McGill HPC Aug 23, 2013

PuTTY/Cygwin Tutorial. By Ben Meister Written for CS 23, Winter 2007

Using the Millipede cluster - I

Remote & Collaborative Visualization. Texas Advanced Compu1ng Center

8/15/2014. Best (and more) General Information. Staying Informed. Staying Informed. Staying Informed-System Status

Cluster Computing With R

Installing and running COMSOL on a Linux cluster

Windows HPC 2008 Cluster Launch

High Performance Computing in CST STUDIO SUITE

A Crash course to (The) Bighouse

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

A High Performance Computing Scheduling and Resource Management Primer

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Transcription:

Using WestGrid Patrick Mann, Manager, Technical Operations Jan.15, 2014

Winter 2014 Seminar Series Date Speaker Topic 5 February Gino DiLabio Molecular Modelling Using HPC and Gaussian 26 February Jonathan Dursi Responding to Canada s Research Computing Needs 12 March Scott Northrup Introduction to GPU Computing Using CUDA 26 March Humaira Kamal and Alan Wagner A Signpost on the Road to Exascale For more information on these and other seminars see https://www.westgrid.ca/support/training

User Basics To use WestGrid systems effectively you will need to know: Where to get help and information Which systems are suited to your project How to log on Basic Linux commands How to define and submit batch jobs

Help and Support WestGrid website: www.westgrid.ca Technical Specifications, QuickStart Guides, Software.. System status and notices Events, colloquia, news,... WestGrid Support: support@westgrid.ca Novice to expert Logon issues to in-depth parallelization No question too big or too small Account problems: accounts@westgrid.ca

WestGrid Cluster Schematic User Desktop Login Node(s) Linux Scheduler Node(s) Node Node Node Node Node Node SSH over Internet Internal Cluster Network Cluster (compute nodes) Shared Disc System /home and /global/scratch /home backup

Cluster Compute Nodes Node (Linux box) Node (Linux box) CPU Core Core... CPU Core Core... CPU Core Core... CPU Core Core.... Shared RAM Shared RAM Interconnect (InfiniBand usually) Nodes usually have 2 CPUs, with 6 or 8 cores/cpu. Usually 12-24 GB/node (2 GB/core). 100 s of nodes in one cluster. InfiniBand interconnects (with varying bandwidth and latency) Specialty systems with MUCH more memory/node. Specialty systems that look like a single node with lots of cores. Shared Memory: 1 node (multicore) Distributed Memory: cores on >1 node

System Selection 1 Aim: Optimally match software requirements and characteristics with systems Fast turnaround (Users!) Efficient use of resources (Systems Management!) Software System Packaged, Homegrown, Parallelizability, Scalability, Memory, Output Architecture, Size, Memory, Interconnects, Storage, Batch Policy https://www.westgrid.ca/support/quickstart/new_users#choosing_system

System Selection 2 Software? (off-the-shelf, licensed, homegrown) Memory requirements? Parallelization? Scalability Shared or Distributed memory (or both) Research Program Characteristics? Lots of little jobs (parameter space and optimization) A few really big jobs (simulations) Code development...

System Selection 3 Small-memory serial. Undemanding parallel Shared memory (OpenMP) Distributed memory (MPI parallel) Graphics, visualization or GPU acceleration Gaussian Other special software (MATLAB,..) Archive and backup Hermes, Bugaboo, Jasper, Orcinus Breezy, Hungabee (larger memory) Bugaboo, Grex, Jasper, Lattice, Nestor, Orcinus, Parallel Bugaboo, Nestor: large associated storage Lattice: small memory (1.5 GB/core) Grex: large memory (4 GB/core) Parallel Grex (licensed) Check the QuickStart and software guides Silo (very large 3.15 PB storage system)

System Selection 4 Lots of systems, some special purpose, some general purpose. Each has its own software set. Lots of generic software, but some packages are only on specific machines (see software pages). Users may work on multiple systems. Hard to choose. Recommend (strongly) talking to an analyst

Connecting to Cluster The login nodes (and all nodes) run Linux. Command-line shell to write text commands So need to login via a standard terminal We use SSH (as does most of the world) Linux and MacOS have built-in clients Windows: various packages: PuTTY https://www.westgrid.ca/support/quickstart/new_users#connecting

Linux You do need to know the basics of Linux and the Commandline Lots of tutorials and books out there. See the New Users QuickStart guide: https://www.westgrid.ca/support/quickstart/new_users#working

Graphical Applications Editors, Visualization and other Graphical Interfaces X-Windows is the Linux windowing system Linux editors, visualization packages and anything graphical use X Used by MacOS, and can be installed in Windows. http://sourceforge.net/projects/xming (free) Linux: ssh -X username@system.westgrid.ca https://www.westgrid.ca/support/quickstart/new_users#setting_up

File Transfer Standard tools based on SSH transport. scp sftp rsync Secure copy Secure file transfer protocol File synchronization (complicated, but really useful) Linux, MacOS: built-in Windows: WinSCP, Filezilla Lots of beautiful graphical front-ends out there! Annoying issue with line-endings in files from windows

WestGrid Core Network: Very Fast Internal Network connecting all sites CANARIE National Network connecting Compute Canada sites (and all Universities and institutions) Especially to Silo backup/archival system Powerful Grid tools and Globus Online https://www.westgrid.ca/support/file_transfer Inter-Site File Transfer

Useful Linux Software Many useful, standard software packages included on all WestGrid systems: Programming Editors (nedit, emacs, vi, ) Compilers (Intel, GNU, Fortran, C++,..) Scripting (Python has become a common scientific language) Parallel programming (OpenMP, Open MPI) Base scientific libraries (BLAS, LAPACK,..) As usual: see the QuickStart and software pages, or ask.

Job Basics Login nodes: Data management Editing and compiling code Quick tests Job management Real work done on the worker (compute) nodes Jobs submitted to batch system (queued) Jobs dispatched as fairly as possible to worker nodes Not interactive: create script and submit

Batch Jobs A batch job is defined by a Linux shell script with directives that tell the scheduler what resources the job needs: memory, cores, walltime (and lots of fine detail stuff) Jobs exceeding these pre-defined resource limits may be terminated! (eg, Walltime limit) Jobs with incompatible requirements (eg cores/node) may be queued, but never run. https://www.westgrid.ca/support/running_jobs

Job Management Submit a job Status of jobs Delete a job (queued or running) Predicted start time qsub <job script> qstat [-f] <job id> qdel <job id> showstart <job id> Check scheduling showq [--help] -u <user name> Linux command-line utilities Run them as usual Linux commands man qsub qsub --help Standard Linux manual page. Short synopsis

Sample Job Script hello.pbs #!/bin/bash # Standard Linux first line #PBS -l procs=1 # Scheduling directive (lots!) #PBS -j oe # join standard and error outputs date echo Hello World. echo This job is running on $(/bin/hostname). Submit the job qsub hello.pbs https://www.westgrid.ca/support/running_jobs#sample https://www.westgrid.ca/support/running_jobs#directives

Job Submission pjmann@bugaboo ~/PresentationTests$ qsub hello.pbs 15298317.b0 The response gives the job id: 15298317 pjmann@bugaboo ~/PresentationTests$ qstat 15298317 Job ID Name User Time Use S Queue ------------ ------------ --------------- -------- - ----- 15298317.b0 hello.pbs pjmann 0 Q q1

Job Results run completes (try a few qstat s and/or showstart) pjmann@bugaboo ~/PresentationTests$ ls hello.pbs hello.pbs.e15298317 hello.pbs.o15298317 pjmann@bugaboo ~/PresentationTests$ cat hello.pbs.o15298317 Thu Jan 9 12:03:57 PST 2014 Hello World. This job is running on s31

Starting Out https://www.westgrid.ca/support/quickstart/new_users Recommendations: run lots of small example test jobs. get a simple one working, and build up from there We all know the debugging 80:20 (or 90:10, or 99:01) build-up iteratively play with job management (qsub, qstat, showq, )

Debugging Job output can show lots of information Mail job completion info (lots there, #PBS directive) Explicitly define information requirements (Lots of detailed PBS directives) Ask for help... Debugging

Some nodes are reserved for interactive use Larger/Longer test jobs and interactive work (< 3 hours) Interactive Jobs qsub -I https://www.westgrid.ca/support/running_jobs#interactive

Job scheduling is a complex and difficult task. Each site schedules their own jobs MOAB fair-share scheduling Fair-share Job Scheduling

Fair-share Targets System utilization targets set for projects (groups) and their members. Fair-share allocates job priority depending on these targets. Dependent on resource availability and characteristics. Base Metric: Usage over last couple of weeks (system dependent) If Usage > Target: Priority is decreased proportionally If Usage < Target: Priority is increased proportionally

Resource Allocation The Usage Targets are defined by the Resource Allocation Process (RAC = Resource Allocation Committee ) Compute Canada annual process (October) Projects (PI s) complete an application Reviewed by Technical and Scientific panels Decisions in December Targets (allocations) entered into systems Jan.10 Default allocation available for projects which do not have a Resource Allocation

Visualization and Software Visualization and Graphics (including GPUs) https://www.westgrid.ca/support/visualization You can install software/packages. But analysts know about optimization, hardware details, systems details, ASK! Jan Paral, UAlberta, Mercury Solar Wind

Asking for Help mailto:support@westgrid.ca It helps the analysts if you can include information: 1. The name of the system (lots of folks forget this!). 2. The job id. 3. Your WestGrid user id (especially if you re using a different email address). 4. Location of the script/job/datafiles/ 5. And of course details of the errors or issues.

Conclusion Support System selection Connecting Linux Jobs www.westgrid.ca support@westgrid.ca Thanks for coming! Questions?