Hybrid Programming with MPI and OpenMP



Similar documents
What is Multi Core Architecture?

Parallel Algorithm for Dense Matrix Multiplication

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

Parallel Computing. Parallel shared memory computing with OpenMP

Debugging with TotalView

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Multi-Threading Performance on Commodity Multi-Core Processors

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Parallel Computing. Shared memory parallel programming with OpenMP

Lecture 3. Optimising OpenCL performance

MPI and Hybrid Programming Models. William Gropp

Parallel and Distributed Computing Programming Assignment 1

CUDAMat: a CUDA-based matrix class for Python

Parallelization: Binary Tree Traversal

OpenMP and Performance

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Shared Memory Abstractions for Heterogeneous Multicore Processors

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Spring 2011 Prof. Hyesoon Kim

Introduction to Hybrid Programming

Matrix Multiplication

High Performance Computing

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Introduction to Programming (in C++) Multi-dimensional vectors. Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

GPI Global Address Space Programming Interface

OpenACC 2.0 and the PGI Accelerator Compilers

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Load Balancing. computing a file with grayscales. granularity considerations static work load assignment with MPI

Petascale Software Challenges. William Gropp

OpenMP* 4.0 for HPC in a Nutshell

Scalability evaluation of barrier algorithms for OpenMP

Multicore Parallel Computing with OpenMP

Multi-core Programming System Overview

MAQAO Performance Analysis and Optimization Tool

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

An Introduction to Parallel Programming with OpenMP

CPU Scheduling Outline

#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:

Towards OpenMP Support in LLVM

Introduction to GPU Programming Languages

Parallel Algorithm Engineering

Parallel Programming with MPI on the Odyssey Cluster

PPD: Scheduling and Load Balancing 2

Intel Many Integrated Core Architecture: An Overview and Programming Models

Hadoop Parallel Data Processing

CUDA Programming. Week 4. Shared memory and register

Package Rdsm. February 19, 2015

High performance computing systems. Lab 1

Lecture 1: an introduction to CUDA

A quick tutorial on Intel's Xeon Phi Coprocessor

Here are some examples of combining elements and the operations used:

Retour vers le futur des bibliothèques de squelettes algorithmiques et DSL

Introduction to Cloud Computing

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Programming on Parallel Machines

Introduction to Multicore Programming

Lightning Introduction to MPI Programming

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Getting OpenMP Up To Speed

Architectures for Big Data Analytics A database perspective

An Introduction to Parallel Computing/ Programming

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Introduction to Cloud Computing

Parallel Computing with Mathematica UVACSE Short Course

CSE 6040 Computing for Data Analytics: Methods and Tools

HPC with Multicore and GPUs

Objectives. Overview of OpenMP. Structured blocks. Variable scope, work-sharing. Scheduling, synchronization

Informatica e Sistemi in Tempo Reale

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

DATA ANALYSIS II. Matrix Algorithms

Parallelism and Cloud Computing

Parallel Programming Survey

Programming the Cell Multiprocessor: A Brief Introduction

CHAPTER 1 INTRODUCTION

Detecting the Presence of Virtual Machines Using the Local Data Table

FPGA area allocation for parallel C applications

Deal with big data in R using bigmemory package

Lecture 6: Introduction to MPI programming. Lecture 6: Introduction to MPI programming p. 1

Cloud-based OpenMP Parallelization Using a MapReduce Runtime. Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Linear Programming. March 14, 2014

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

Transcription:

Hybrid Programming with and OpenMP Ricardo Rocha and Fernando Silva Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2015/2016 R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 1 / 14

on Clusters On a cluster of multiprocessors (or multicore processors), we can execute a parallel program by having a process executing in each processor (or core). It may be the case that multiple processes execute in the same multiprocessor (or multicore processor), but still the interactions among those processes are based on message passing. R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 2 / 14

and OpenMP on Clusters A different approach is to design an hybrid program in which only one process executes in each multiprocessor (or multicore processor), and then launch a set of threads equal to the number of processors (or cores) in each machine to execute the parallel regions of the program. R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 3 / 14

and OpenMP on Clusters As an alternative, we could combine both strategies and adapt the division between processes and threads that optimizes the use of the available resources without violating possible constraints or requirements of the problem being solved. R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 4 / 14

Hybrid Programming with and OpenMP Hybrid programming adapts perfectly to current architectures based on clusters of multiprocessors and/or multicore processors, as it induces less communication among different nodes and increases performance of each node without having to increase memory requirements. Applications with two levels of parallelism may use processes to exploit large grain parallelism, occasionally exchanging messages to synchronize information and/or share work, and use threads to exploit medium/small grain parallelism by resorting to a shared address space. Applications with constraints or requirements that may limit the number of processes that can be used (e.g., Fox s algorithm), may take advantage of OpenMP to exploit the remaining computational resources. Applications for which load balancing is hard to achieve with only processes, may benefit from OpenMP to balance work, by assigning a different number of threads to each process as a function of its load. R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 5 / 14

Hybrid Programming with and OpenMP The simplest and safe way to combine with OpenMP is to never use the calls inside the OpenMP parallel regions. When that happens, there is no problem with the calls, given that only the master thread is active during all communications. main(int argc, char **argv) { _Init(&argc, &argv); // master thread only --> calls here #pragma omp parallel { // team of threads --> no calls here // master thread only --> calls here _Finalize(); R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 6 / 14

Matrix-Vector Product (mpi-omp-matrixvector.c) The product of a matrix mat[rows,cols] and a column vector vec[cols] is a row vector prod[rows] such that each prod[i] is the scalar product of row i of the matrix by the column vector. If we have P processes and T threads per process, then each process can compute ROWS/P (P_ROWS) elements of the result and each thread can compute P_ROWS/T (T_ROWS) elements of the result. // distribute the matrix _Scatter(mat, P_ROWS * COLS, _INT, submat, P_ROWS * COLS, _INT, ROOT, _COMM_WORLD); // calculate the matrix-vector product #pragma omp parallel for num_threads(nthreads) for (i = 0; i < P_ROWS; i++) subprod[i] = scalar_product(&submat[i * COLS], vec, COLS); // gather the submatrices _Gather(subprod, P_ROWS, _INT, prod, P_ROWS, _INT, ROOT, _COMM_WORLD); R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 7 / 14

Matrix Multiplication with Fox s Algorithm Given two matrices A[N][N] and B[N][N] and P processes (P=Q*Q), Fox s algorithm divides both matrices in P submatrices of size (N/Q)*(N/Q) and assigns each submatrix to one of the P available processes. To compute the multiplication of its submatrix, each process only communicates with the processes in the same row and in the same column of its own submatrix. for (stage = 0; stage < Q; stage++) { bcast = // pick a submatrix A in each row of processes and _Bcast(); // send it to all processes in the same row // multiply the received submatrix A with current submatrix B #pragma omp parallel for private(i,j,k) for (i = 0; i < N/Q; i++) for (j = 0; j < N/Q; j++) for (k = 0; k < N/Q; k++) C[i][j] += A[i][k] * B[k][j]; _Send() // send submatrix B to process above and _Recv() // receive the submatrix B from process below R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 8 / 14

Safe If a program is parallelized so that it has calls inside OpenMP parallel regions, then multiple threads can call the same communications and at the same time. For this to be possible, it is necessary that the implementation be thread safe. #pragma omp parallel private(tid) { tid = omp_get_thread_num(); if (tid == id1) mpi_calls_f1(); // thread id1 makes calls here else if (tid == id2) mpi_calls_f2(); // thread id2 makes calls here else do_something(); -1 does not support multithreading. Such support was only considered with -2 with the introduction of the _Init_thread() call. R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 9 / 14

_Init_thread() initializes the execution environment (similarly to _Init()) and defines the support level for multithreading: required is the aimed support level provided is the support level provided by the implementation The support level for multithreading can be: _THREAD_SINGLE only one thread will execute (the same as initializing the environment with _Init()) _THREAD_FUNNELED only the master thread can make calls _THREAD_SERIALIZED all threads can make calls, but only one thread at a time can be in such state _THREAD_MULTIPLE all threads can make simultaneous calls without any constraints R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 10 / 14 Initializing with Support for Multithreading int _Init_thread(int *argc, char ***argv, int required, int *provided)

_THREAD_FUNNELED With support level _THREAD_FUNNELED only the master thread can make calls. One way to ensure this is to protect the calls with the omp master directive. However, the omp master directive does not define any implicit synchronization barrier among all threads in the parallel region (at entrance or exit of the omp master directive) in order to protect the call. #pragma omp parallel { #pragma omp barrier // explicit barrier at entrance #pragma omp master // only the master thread makes the call mpi_call(); #pragma omp barrier // explicit barrier at exit R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 11 / 14

_THREAD_SERIALIZED With support level _THREAD_SERIALIZED all threads can make calls, but only one thread at a time can be in such state. One way to ensure this is to protect the calls with the omp single directive that allows to define code blocks that should be executed only by one thread. However, the omp single directive does not define an implicit synchronization barrier at the entrance of the directive. In order to protect the call, it is necessary to set an explicit omp barrier directive at entrance of the omp single directive. #pragma omp parallel { #pragma omp barrier // explicit barrier at entrance #pragma omp single // only one thread makes the call mpi_call(); // explicit barrier at exit R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 12 / 14

_THREAD_MULTIPLE With support level _THREAD_MULTIPLE all threads can make simultaneous calls without any constraints. Given that the implementation is thread safe, there is no need for any additional synchronization mechanism among the threads in the parallel region in order to protect the calls. #pragma omp parallel private(tid) { tid = omp_get_thread_num(); if (tid == id1) mpi_calls_f1(); // thread id1 makes calls here else if (tid == id2) mpi_calls_f2(); // thread id2 makes calls here else do_something(); R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 13 / 14

_THREAD_MULTIPLE The communication among threads of different processes raises the problem of identifying the thread that is involved in the communication (as communications only include arguments to identify the ranks of the processes). A simple way to solve this problem is to use the tag argument to identify the thread involved in the communication. _Comm_rank(_COMM_WORLD, &my_rank); #pragma omp parallel num_threads(nthreads) private(tid) { tid = omp_get_thread_num(); if (my_rank == 0) // process 0 sends NTHREADS messages _Send(a, 1, _INT, 1, tid, _COMM_WORLD); else if (my_rank == 1) // process 1 receives NTHREADS messages _Recv(b, 1, _INT, 0, tid, _COMM_WORLD, &status); R. Rocha and F. Silva (DCC-FCUP) Programming with and OpenMP Parallel Computing 15/16 14 / 14