Running applications on the Cray XC30 4/12/2015

Similar documents
Hodor and Bran - Job Scheduling and PBS Scripts

Miami University RedHawk Cluster Working with batch jobs on the Cluster

Juropa. Batch Usage Introduction. May 2014 Chrysovalantis Paschoulas

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Streamline Computing Linux Cluster User Training. ( Nottingham University)

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

Introduction to Sun Grid Engine (SGE)

The Application Level Placement Scheduler

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

NYUAD HPC Center Running Jobs

Submitting and Running Jobs on the Cray XT5

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Parallel Debugging with DDT

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

Ra - Batch Scripts. Timothy H. Kaiser, Ph.D. tkaiser@mines.edu

Job scheduler details

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

NEC HPC-Linux-Cluster

Using the Windows Cluster

The RWTH Compute Cluster Environment

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

Using Parallel Computing to Run Multiple Jobs

Using the Yale HPC Clusters

Grid 101. Grid 101. Josh Hegie.

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

Grid Engine Users Guide p1 Edition

Using NeSI HPC Resources. NeSI Computational Science Team

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Batch Scripts for RA & Mio

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

RA MPI Compilers Debuggers Profiling. March 25, 2009

Introduction to HPC Workshop. Center for e-research

Debugging with TotalView

SLURM Workload Manager

MPI / ClusterTools Update and Plans

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

An Introduction to High Performance Computing in the Department

Getting Started with HPC

OLCF Best Practices. Bill Renaud OLCF User Assistance Group

How to Run Parallel Jobs Efficiently

Beyond Windows: Using the Linux Servers and the Grid

Batch Scheduling on the Cray XT3

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

LoadLeveler Overview. January 30-31, IBM Storage & Technology Group. IBM HPC Developer TIFR, Mumbai

Matlab on a Supercomputer

Moab and TORQUE Highlights CUG 2015

Quick Tutorial for Portable Batch System (PBS)

Manual for using Super Computing Resources

Martinos Center Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Introduction to the SGE/OGS batch-queuing system

GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to the CRAY XE6(Lindgren) environment at PDC. Dr. Lilit Axner (PDC, Sweden)

Optimization tools. 1) Improving Overall I/O

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Advanced Techniques with Newton. Gerald Ragghianti Advanced Newton workshop Sept. 22, 2011

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology

The CNMS Computer Cluster

Installing and running COMSOL on a Linux cluster

Aqua Connect Load Balancer User Manual (Mac)

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

High Performance Computing

Load Imbalance Analysis

SGE Roll: Users Guide. Version Edition

Optimizing Shared Resource Contention in HPC Clusters

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

The Asterope compute cluster

Using Reservations to Implement Fixed Duration Node Allotment with PBS Professional

Performance Analysis and Tuning in Windows HPC Server Xavier Pillons Program Manager Microsoft Corp.

The Evolution of Cray Management Services

8/15/2014. Best (and more) General Information. Staying Informed. Staying Informed. Staying Informed-System Status

Multi-core Programming System Overview

HPC at IU Overview. Abhinav Thota Research Technologies Indiana University

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Working with HPC and HTC Apps. Abhinav Thota Research Technologies Indiana University

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

User s Manual

Using the Yale HPC Clusters

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Interoperability between Sun Grid Engine and the Windows Compute Cluster

How do Users and Processes interact with the Operating System? Services for Processes. OS Structure with Services. Services for the OS Itself

Cloud Computing through Virtualization and HPC technologies

Biowulf2 Training Session

The SUN ONE Grid Engine BATCH SYSTEM

To connect to the cluster, simply use a SSH or SFTP client to connect to:

MAQAO Performance Analysis and Optimization Tool

A Performance Data Storage and Analysis Tool

Job Scheduling with Moab Cluster Suite

NorduGrid ARC Tutorial

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

How to control Resource allocation on pseries multi MCM system

Heterogeneous Clustering- Operational and User Impacts

The XSEDE Global Federated File System (GFFS) - Breaking Down Barriers to Secure Resource Sharing

Transcription:

Running applications on the Cray XC30 4/12/2015 1

Running on compute nodes By default, users do not log in and run applications on the compute nodes directly. Instead they launch jobs on compute nodes in one of three available modes: 1. Extreme Scaling Mode (ESM Mode) The default high performance MPP mode 2. Pre/Post Processing Mode A throughput mode designed to allows efficient Multiple Applications Multiple User (MAMU) access for smaller applications. 3. Cluster Compatibility Mode (CCM) Emulation mode designed primarily to support 3 rd ISV applications which cannot be recompiled. 2

Extreme Scaling Mode (ESM) ESM is a high performance mode designed to run larger applications at scale. Important features are: Dedicated compute nodes for each user job No interference from other users sharing the node Minimum quanta of compute is 1 node. Nodes running with low-noise, low-overhead CNL OS Inter-node communication is via native Aries API. The best latency and bandwidth comms are available using ESM. Applications need to be linked with a Cray comms libraries (MPI, Shmem etc) or compiled with Cray language support (UPC, Coarrays) The appropriate parallel runtime environment is automatically set up between nodes ESM is expected to be the default mode for the majority of applications that require at least one node to run. 3

Requesting resources from a batch system As resources are dedicated to users in ESM mode, most supercomputers share nodes via batch systems. ECMWF Cray XC30s use PBS Pro to schedule resources Users submit batch job scripts to a scheduler from a login node (e.g. PBS) for execution at some point in the future. Each job requests resources and predicts how long it will run. The scheduler (running on an external server) then chooses which jobs to run and when, allocating appropriate resources at the start. The batch system will then execute a copy of the user s job script on an a one of the MOM nodes. The scheduler monitors the job throughout it lifetime. Reclaiming the resources when job scripts finish or are killed for overrunning. Each user job scripts will contain two types of command 1. Serial commands that are executed by the MOM node, e.g. quick setup and post processing commands e.g. (rm, cd, mkdir etc) 2. Parallel launches to run on the allocated compute nodes. 1. Launched using the aprun command. 4

Example ECMWF batch script #!/bin/ksh #PBS -N xthi #PBS -l EC_total_tasks=256 #PBS -l EC_threads_per_task=6 #PBS -j oe #PBS -o./output #PBS -l walltime=00:05:00 Request resources from the batch system export OMP_NUM_THREADS=$EC_threads_per_task aprun -n $EC_total_tasks \ -N $EC_tasks_per_node \ -d $EC_threads_per_task./a.out Launch the executable on the compute nodes in parallel 5

Lifecycle of an ECMWF batch script eslogin qsub run.sh Example Batch Job Script run.sh #!/bin/bash #PBS -l EC_total_tasks=256,EC_threads_per_task=6 #PBS l walltime=1:00:00 cd $WORKDIR Serial aprun n 256 d 6 simulation.exe rm r $WORKDIR/tmp Serial Requested Resources Parallel Cray XE Compute Nodes PBS Queue Manager PBS MOM Node 6

Submitting jobs to the batch system Job scripts are submitted to the batch system with qsub: qsub script.pbs Once a job is submitted is assigned a PBS Job ID, e.g. 1842232.sdb To view the state of all the currently queued and running jobs run: qstat To limit to just jobs owned by a specific user qstat u <username> To remove a job from the queue, or cancel a running job qdel <job id> 7

Requesting resources from PBS Jobs provide a list of requirements as #PBS comments in the headers of the submission script, e.g. #PBS l walltime 12:00:00 These can be overriden or suplemented at submission by adding to the qsub command line, e.g. > qsub l walltime 11:59:59 run.pbs Common options are: Option Description -N <name> A name for job, -q <queue> Submit job to a specific queues. -o <output file> A file to write the job s stdout stream in to. -e <error file> A file to write the job s stderr stream in to. -j oe Join stderr stream in to stdout stream as a single file Jobs must also describe how many compute nodes they will need. -l walltime <HH:MM:SS> Maximum wall time job will occupy 8

Glossary of terms To understand how many compute nodes a job needs, we need to understand how parallel jobs are described by Cray. PE/Processing Element A discrete software process with an individual address space. One PE is equivalent to: 1 MPI Rank, 1 Coarray Image, 1 UPC Thread, or 1 SHMEM PE Threads A logically separate stream of execution inside a parent PE that shares the same address space CPU The minimum piece of hardware capable of running a PE. It may share some or all of its hardware resources with other CPUs Equivalent to a single Intel Hyperthread Compute Unit The individual unit of hardware for processing, may be seen described as a core. May provide one or more CPUs. 9

Implementing the Parallel Programming Model on hardware Parallel Application 0 0 1 1 0 1 2 0 1 3 0 1 12 0 1 23 0 1 Threads Processing Element Programming Model Node 0 0 1 0 1 0 1 2 23 Node 1 0 1 0 1 0 1 2 23 1 Software Thread is bound to 1 Hardware CPU Linux Process Hardware implementation 10

Launching ESM Parallel applications ALPS : Application Level Placement Scheduler aprun is the ALPS application launcher It must be used to run application on the XC compute nodes in ESM mode, (either interactively or as a batch job) If aprun is not used, the application will be run on the MOM node (and will most likely fail). aprun launches sets of PEs on the compute nodes. aprun man page contains several useful examples The 4 most important parameters to set are: Description Option Total Number of PEs used by the application -n Number of PEs per compute node -N Number of threads per PE (More precise, the stride between 2 PEs on a node) Number of to CPUs to use per Compute Unit -j -d 11

Running applications on the Cray XC30: Some basic examples Assuming XC30 nodes with 2x12 core Ivybridge processors Each node has: 48 CPUs/Hyperthreads and 24 Compute Units/cores Launching an MPI application on all CPUs of 64 nodes: Using 1 CPU per Compute Unit means a maximum of 24 PEs per node. 64 nodes x 24 ranks/node = 1536 ranks $ aprun n 1536 N 24 j1./model-mpi.exe Launch the same MPI application on 64 nodes but with half as many ranks per node Still using 1 CPU per Compute Unit, but limiting to 12 Compute Units. $ aprun n 768 N 12 j1./model-mpi.exe Doubles the available memory for each PE on the node To use all availble CPUs on 64 nodes. Using 2 CPUs per Compute unit, so 48 PEs per node) $ aprun n 3072 N 48 j2./model-mpi.exe 12

Some examples of hybrid invocation To launch a Hybrid MPI/OpenMP application on 64 nodes 256 total ranks, using 1 CPU per Compute Unit (Max 24 Threads) Use 4 PEs per node and 6 Threads per PE Threads set by exporting OMP_NUM_THREADS $ export OMP_NUM_THREADS=6 $ aprun n 256 N 4 d $OMP_NUM_THREADS j1./model-hybrid.exe Launch the same hybrid application with 2 CPUs per CU Still 256 total ranks, using 2 CPU per Compute Unit (Max 48 Threads) Use 4 PEs per node and 12 Threads per PE $ export OMP_NUM_THREADS=12 $ aprun n 256 N 4 d $OMP_NUM_THREADS j2./model-hybrid.exe Much more detail in later session on advance placement and binding. 13

ECMWF PBS Job Directives ECMWF have created a bespoke set of job directives for PBS that can be used to define the job. The ECMWF job directives provide a direct map between aprun options and PBS jobs. Description Aprun Option EC Job Directive Total PEs -n <n> #PBS l EC_total_tasks=<n> PEs per node -N <N> #PBS l EC_tasks_per_node=<N> Threads per PE -d <d> #PBS l EC_threads per_task=<d> CPUs per CU -j <j> #PBS l EC_hyperthreads=<j> 14

A Note on the mppwidth, mppnppn, mppdepth and select notation Casual examination of older Cray documentation or internet searches may suggest the alternatives, mppwidth, mppnppn and mppdepth for requesting resources. These methods, while still functional, are Cray specific extensions to PBS Pro which have been deprecated by Altair. More recent versions of PBS Pro support select syntax which is in use at other Cray sites. Support for select has been discontinued at ECMWF in favour of a customised schema. 15

Watching a launched job on the Cray XE xtnodestat Shows how XE nodes are allocated and corresponding aprun commands apstat Shows aprun processes status apstat overview apstat a[ apid ] info about all the applications or a specific one apstat n info about the status of the nodes Batch qstat command shows batch jobs 16

Example qstat output Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 123557.sdb g1ma:t_wconsta rdx 0 Q of 123610.sdb test.sh syi 0 Q of 123654.sdb getini_1 emos 00:00:09 R os 123656.sdb getini_1 emos 00:00:00 R os 123675.sdb getini_1 emos 00:00:00 R os 123677.sdb getini_1 emos 00:00:00 R os 123680.sdb getini_1 emos 00:00:09 R os 123682.sdb getini_1 emos 00:00:02 R os 123686.sdb job_nf co5 0 Q of 123694.sdb job_nf co5 0 Q of 123739.sdb job_nf co5 0 Q nf 17

Example xtnodestat output (cct) Current Allocation Status at Thu Feb 06 15:16:14 2014 C0-0 n3 --S---------; n2 SSS--S---------- n1 SSS--S---------- c0n0 SSS---------- s0123456789abcdef Legend: nonexistent node S service node ; free interactive compute node - free batch compute node A allocated (idle) compute or ccm node? suspect compute node W waiting or non-running job X down compute node Y down or admindown service node Z admindown compute node Available compute nodes: 1 interactive, 45 batch 18

Pre/Post Processing Mode This is a new mode for the Cray XC30. Designed to allow multi-user jobs to share compute nodes More efficient for apps running on less than one node Possible interference from other users on the node Uses the same fully featured OS as service nodes Multiple use cases, applications can be: entirely serial embarrassingly parallel e.g. fork/exec, spawn + barrier. shared memory using OpenMP or other threading model. MPI (limited to intra-node MPI only*) Scheduling and launching handled by PBS Similar to any normal PBS cluster deployment. 19

Understanding Module Targets The wrappers, cc, CC and ftn are cross compilation environments that by default target the compute nodes. This means compilers will build binaries explicitly targeting the CPU architecture in the compute nodes It will also link distributed memory libraries by default. Binaries built using the default settings will probably not work on serial nodes or pre/post processing nodes. Users many need to switch the CPU target and/or opt to remove network dependencies. For example when compiling serial applications for use on eslogin or pre/post processing nodes. Targets are changed by adding/removing appropriate modules 20

Cray XC30 Target modules CPU Architecture Targets craype-ivybridge (default) Tell compiler and system libraries to target Intel Ivybridge processors craype-sandybridge Tell compiler and system libraries to build binaries targeting Intel Sandybridge processors Network Targets craype-network-aries (default) Link network libraries into binary (ESM mode). craype-target-local_host Do not link network libraries into the binary (serial apps non-esm mode). For MAMU nodes use mpiexec from module cray-smplauncher instead of aprun if you want support for MPI 21

Cluster Compatibility Mode (CCM) CCM creates small TCP/IP connected clusters of compute nodes. It was designed to support legacy ISV applications which could not be recompiled to use ESM mode. There is a small cost in performance due to the overhead of using TCP/IP rather than native Aries calls. Users also are also responsible for initialising the application on all the participating nodes, including any daemon processes. Still exclusive use of each node by a single users, so only recommended when recompilation unavailable. 22