Introduction to Hybrid Programming

Similar documents

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

MPI and Hybrid Programming Models. William Gropp

Message Passing with MPI

The RWTH Compute Cluster Environment

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Session 2: MUST. Correctness Checking

Case Study on Productivity and Performance of GPGPUs

Lecture 6: Introduction to MPI programming. Lecture 6: Introduction to MPI programming p. 1

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

Parallelization: Binary Tree Traversal

Parallel Programming with MPI on the Odyssey Cluster

Debugging with TotalView

OpenACC Basics Directive-based GPGPU Programming

Programming the Intel Xeon Phi Coprocessor

HPCC - Hrothgar Getting Started User Guide MPI Programming

High performance computing systems. Lab 1

MPI Runtime Error Detection with MUST For the 13th VI-HPS Tuning Workshop

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Parallel Programming Survey

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

OpenMP Programming on ScaleMP

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Retargeting PLAPACK to Clusters with Hardware Accelerators

Experiences with HPC on Windows

Using the Intel Inspector XE

Load Balancing. computing a file with grayscales. granularity considerations static work load assignment with MPI

Parallel Computing. Parallel shared memory computing with OpenMP

A quick tutorial on Intel's Xeon Phi Coprocessor

MPI Application Development Using the Analysis Tool MARMOT

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Using the Windows Cluster

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters

GPU Tools Sandra Wienke

Multi-Threading Performance on Commodity Multi-Core Processors

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

RA MPI Compilers Debuggers Profiling. March 25, 2009

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

HPC Wales Skills Academy Course Catalogue 2015

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

MPI-Checker Static Analysis for MPI

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Performance Characteristics of Large SMP Machines

Parallel Computing. Shared memory parallel programming with OpenMP

A Case Study - Scaling Legacy Code on Next Generation Platforms

Part I Courses Syllabus

Evaluation of CUDA Fortran for the CFD code Strukti

Big Data Visualization on the MIC

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

OpenACC Programming on GPUs

Running applications on the Cray XC30 4/12/2015

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Lightning Introduction to MPI Programming

McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk

Hybrid Programming with MPI and OpenMP

An HPC Application Deployment Model on Azure Cloud for SMEs

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)

System Software for High Performance Computing. Joe Izraelevitz

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Overview of HPC Resources at Vanderbilt

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Allinea Performance Reports User Guide. Version 6.0.6

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

Agenda. Using HPC Wales 2

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

HPC Software Requirements to Support an HPC Cluster Supercomputer

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

GPI Global Address Space Programming Interface

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

High Productivity Computing With Windows

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Parallel Algorithm Engineering

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

MUSIC Multi-Simulation Coordinator. Users Manual. Örjan Ekeberg and Mikael Djurfeldt

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

CUDA programming on NVIDIA GPUs

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

1 Bull, 2011 Bull Extreme Computing

HP-MPI User s Guide. 11th Edition. Manufacturing Part Number : B September 2007

Hodor and Bran - Job Scheduling and PBS Scripts

Overview of HPC systems and software available within

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

Intel Many Integrated Core Architecture: An Overview and Programming Models

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Transcription:

Introduction to Hybrid Programming Hristo Iliev Rechen- und Kommunikationszentrum aixcelerate 2012 / Aachen 10. Oktober 2012 Version: 1.1 Rechen- und Kommunikationszentrum (RZ)

Motivation for hybrid programming Clusters of small supercomputers Increasingly complex nodes many cores, GPUs, Intel MIC, etc. Green computing will make things even more complex Network 2

MPI on a single node MPI is sufficiently abstract to run on a single node It doesn t care where its processes are located Message passing implemented using shared memory and IPC The MPI library takes care Usually faster than sending messages over the network but it is far from optimal MPI processes are implemented as separate OS processes Portable data sharing is hard to achieve Lots of control / problem data has to be duplicated Reusing cached data is practically impossible 3

Hybrid programming (Hierarchical) mixing of different programming paradigms OpenCL GPGPU OpenCL GPGPU OpenMP Shared memory OpenMP Shared memory 0 1 2 3 4 0 1 2 3 4 5 6 7 8 9 5 6 7 8 9 MPI 4

Best of all worlds OpenMP / pthreads Cache reusage, data sharing Simple programming model (not so simple with POSIX threads) Threaded libraries (e.g. MKL) OpenCL MPI Massively parallel GPGPU accelerators and co-processors Fast transparent networking Scalability Downsides Code could be hard to maintain 5 Harder to debug and tune

Higher levels MPI threads interaction Most MPI implementation are threaded (e.g. for non-blocking requests) but not thread-safe Four levels of threading support Level identifier MPI_THREAD_SINGLE MPI_THREAD_FUNNELED Description Only one thread may execute Only the main thread may make MPI calls MPI_THREAD_SERIALIZED Only one thread may make MPI calls at a time MPI_THREAD_MULTIPLE Multiple threads may call MPI at once with no restrictions All implementations support MPI_THREAD_SINGLE, but some does not support MPI_THREAD_MULTIPLE 6

MPI and threads initialisation Initialise MPI with thread support C/C++: int MPI_Init_thread (int *argc, char ***argv, int required, int *provided) Fortran: SUBROUTINE MPI_INIT_THREAD (required, provided, ierr) required tells MPI what level of thread support is required provided tells us what level of threading MPI actually supports could be less than the required level MPI_INIT same as a call with required = MPI_THREAD_SINGLE The thread that has called MPI_INIT_THREAD becomes the main thread The level of thread support is fixed by this call and cannot be changed later 7

MPI and threads initialisation Obtain current level of thread support MPI_Query_thread (provided) provided is filled with the current level of thread support If MPI_INIT_THREAD was called, provided will equal the value returned by the MPI initialisation call If MPI_INIT was called, provided will equal an implementation specific value Find out if current thread is the main one MPI_Is_thread_main (flag) 8

MPI and OpenMP The most widely adopted hybrid programming model in engineering and scientific codes no MPI MPI no OpenMP serial code MPI code OpenMP OpenMP code hybrid code 9

MPI_THREAD_FUNNELED Only make MPI calls from outside the OpenMP parallel regions double data[], localdata[]; MPI_Scatter(data, count, MPI_DOUBLE, localdata, count, MPI_DOUBLE, 0, MPI_COMM_WORLD); #pragma omp parallel for for (int i = 0; i < count; i++) localdata[i] = exp(localdata[i]); No MPI calls from inside the parallel region MPI_Gather(localData, count, MPI_FLOAT, data, count, MPI_DOUBLE, 0, MPI_COMM_WORLD); 10

MPI_THREAD_FUNNELED Only make MPI calls from outside the OpenMP parallel regions or inside an OpenMP MASTER construct float data[]; #pragma omp parallel { #pragma omp master { MPI_Bcast(data, count, MPI_FLOAT, 0, MPI_COMM_WORLD); } #pragma omp barrier } #pragma omp for for (int i = 0; i < count; i++) data[i] = exp(data[i]); 11

MPI_THREAD_SERIALIZED MPI calls should be serialised within (named) OpenMP CRITICAL constructs or using some other synchronisation primitive #pragma omp parallel { MPI_Status status; #pragma omp critical(mpicomm) MPI_Recv(&data, 1, MPI_FLOAT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status); result = exp(data); } #pragma omp critical(mpicomm) MPI_Send(&result, 1, MPI_FLOAT, status.mpi_source, 0, MPI_COMM_WORLD); 12

MPI_THREAD_MULTIPLE No need to synchronise between MPI calls Watch out for possible data races and MPI race conditions #pragma omp parallel { int count; float *buf; MPI_Status status; MPI_Probe(MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status); MPI_Get_count(&status, MPI_FLOAT, &count); } buf = new float[count]; MPI_Recv(buf, count, MPI_FLOAT, status.mpi_source, 0, MPI_COMM_WORLD); MPI_Probe could match the same message in multiple threads! 13

Addressing in Hybrid Codes MPI and OpenMP have orthogonal address spaces: MPI rank [0, #procs-1] MPI_Comm_rank() OpenMP thread ID [0, #threads-1] omp_get_thread_num() Hybrid rank:thread [0, #procs-1] [0, #threads-1] MPI does not provide a direct way to address threads in a process Field Value source Remark source rank Sender process rank Fixed, taken from the rank of the sending process destination rank MPI_Send argument Only one rank per process tag MPI_Send argument Free to choose communicator MPI_Send argument Many communicators can coexist 14

Addressing in Hybrid Codes Tags for threads addressing MPI standard requires that tags provide at least 15 bits of user defined information (query for the MPI_TAG_UB attribute of MPI_COMM_WORLD to obtain the actual upper limit) Use tag value to address destination thread only (+) straightforward to implement threads simply supply their ID to MPI_Recv (+) very large number of threads could be addressed (-) not possible to further differentiate messages (-) no information on the sending thread ID 15

Addressing in Hybrid Codes Tags for threads addressing MPI standard requires that tags provide at least 15 bits of user defined information (query for the MPI_TAG_UB attribute of MPI_COMM_WORLD to obtain the actual upper limit) Multiplex destination thread ID with tag value e.g. 7 bits for tag value (0..127), 8 bits for thread ID (up to 256 threads) (+) possible to differentiate messages (-) receive operation must be split into MPI_Probe / MPI_Recv pair if we would like to emulate MPI_TAG_ANY (-) no information on the sending thread ID 16

Addressing in Hybrid Codes Tags for threads addressing MPI standard requires that tags provide at least 15 bits of user defined information (query for the MPI_TAG_UB attribute of MPI_COMM_WORLD to obtain the actual upper limit) Multiplex sender and destination thread IDs with tag value For communication between two BCS nodes @ RWTH 14 bits are required for thread IDs (up to 128 threads 7 bits) Suitable for MPI implementations that provide larger tag space both Open MPI and Intel MPI have MPI_TAG_UB of 2 31-1 (+) sender information retained together with possible message differentiation (-) code might not be portable to some MPI implementations 17

Addressing in Hybrid Codes Sample implementation sender thread #define MAKE_TAG (tag,stid,dtid) \ (((tag) << 16) ((stid) << 8) (dtid)) // Send data to drank:dtid with tag MPI_Send(data, count, MPI_FLOAT, drank, MAKE_TAG(tag, omp_get_thread_num(), dtid), MPI_COMM_WORLD); 18

Addressing in Hybrid Codes Sample implementation receiver thread #define MAKE_TAG (tag,stid,dtid) \ (((tag) << 16) ((stid) << 8) (dtid)) // Receive data from srank:stid with a specific tag MPI_Recv(data, count, MPI_FLOAT, srank, MAKE_TAG(tag, stid, omp_get_thread_num()), MPI_COMM_WORLD, MPI_STATUS_IGNORE); 19

Addressing in Hybrid Codes Sample implementation receiver thread #define GET_TAG(val) \ ((val) >> 16) #define GET_SRC_TID(val) \ (((val) >> 8) & 0xff) #define GET_DST_TID(val) \ ((val) & 0xff) // Receive data from srank:stid with any tag MPI_Probe(srank, MPI_ANY_TAG, MPI_COMM_WORLD, &status); if (GET_SRC_TID(status.MPI_TAG) == stid && GET_DST_TID(status.MPI_TAG) == omp_get_thread_num()) { MPI_Recv(data, count, MPI_FLOAT, srank, status.mpi_tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } 20

Multiple Communicators In some cases it makes more sense to use multiple communicators comms[0] comms[1] comms[2] comms[3] 21

Multiple Communicators Sample implementation MPI_Comm comms[nthreads], tcomm; #pragma omp parallel private(tcomm) num_threads(nthreads) { MPI_Comm_dup(MPI_COMM_WORLD, &comms[omp_get_thread_num()]); tcomm = comms[omp_get_thread_num()];... // Sender side MPI_Send(data, count, MPI_FLOAT, omp_get_thread_num(), drank, comms[dtid]);... // Receiver side MPI_Recv(data, count, MPI_FLOAT, stid, srank, tcomm, &status);... } 22

MPI thread support @ BULL cluster Thread support in various MPI libraries requested Open MPI Open MPI mt Intel MPI w/o -mt_mpi Open MPI Use module versions ending with mt (e.g. openmpi/1.6.1mt *InfiniBand BTL disables itself when thread level is MPI_THREAD_MULTIPLE! Intel MPI Enabled when compiling with -openmp, -parallel, or -mt_mpi Intel MPI w/ -mt_mpi SINGLE SINGLE SINGLE SINGLE SINGLE FUNNELED SINGLE FUNNELED SINGLE FUNNELED SERIALIZED SINGLE SERIALIZED SINGLE SERIALIZED MULTIPLE SINGLE MULTIPLE* SINGLE MULTIPLE 23

Interoperation with CUDA (experimental!) Modern MPI libraries provide some degree of CUDA interoperation Device pointers as buffer arguments in many MPI calls Direct device to device RDMA transfers over InfiniBand Implemented by Open MPI and MVAPICH2 Experimental CUDA build of Open MPI 1.7 in the BETA modules category gpu-cluster$ module load BETA gpu-cluster$ module switch openmpi openmpi/1.7cuda You would need first to obtain access to the GPU cluster (ask RZ ServiceDesk) 24

Running hybrid code with LSF Sample hybrid job script (Open MPI) #!/usr/bin/env zsh # 16 MPI procs x 6 threads = 96 cores #BSUB -x #BSUB -a openmpi #BSUB -n 16 # 12 cores/node / 6 threads = 2 processes per node #BSUB -R span[ptile=2] module switch openmpi openmpi/1.6.1mt # 6 threads per process export OMP_NUM_THREADS=6 # Pass OMP_NUM_THREADS on to all MPI processes $MPIEXEC $FLAGS_MPI_BATCH x OMP_NUM_THREADS \ program.exe <args> For the most up to date information refer to the HPC Primer 25

Running hybrid code with LSF Sample hybrid job script (Intel MPI) #!/usr/bin/env zsh # 16 MPI procs x 6 threads = 96 cores #BSUB -x #BSUB -a intelmpi #BSUB -n 16 # 12 cores/node / 6 threads = 2 processes per node #BSUB -R span[ptile=2] module switch openmpi intelmpi # 6 threads per process # Pass OMP_NUM_THREADS on to all MPI processes $MPIEXEC $FLAGS_MPI_BATCH genv OMP_NUM_THREADS 6 \ program.exe <args> For the most up to date information refer to the HPC Primer 26

Q & A session 27