Parallel Computing. Parallel shared memory computing with OpenMP



Similar documents
Parallel Computing. Shared memory parallel programming with OpenMP

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Parallelization: Binary Tree Traversal

Common Mistakes in OpenMP and How To Avoid Them

Debugging with TotalView

Objectives. Overview of OpenMP. Structured blocks. Variable scope, work-sharing. Scheduling, synchronization

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

High Performance Computing

Why Choose C/C++ as the programming language? Parallel Programming in C/C++ - OpenMP versus MPI

Basic Concepts in Parallelization

An Introduction to Parallel Programming with OpenMP

OpenMP C and C++ Application Program Interface

#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:

What is Multi Core Architecture?

Towards OpenMP Support in LLVM

MPI and Hybrid Programming Models. William Gropp

Parallel Algorithm Engineering

Spring 2011 Prof. Hyesoon Kim

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Informatica e Sistemi in Tempo Reale

Hybrid Programming with MPI and OpenMP

HPCC - Hrothgar Getting Started User Guide MPI Programming

Scalability evaluation of barrier algorithms for OpenMP

To copy all examples and exercises to your local scratch directory type: /g/public/training/openmp/setup.sh

Practical Introduction to

OpenMP Application Program Interface

Multi-Threading Performance on Commodity Multi-Core Processors

OpenACC Basics Directive-based GPGPU Programming

University of Amsterdam - SURFsara. High Performance Computing and Big Data Course

OpenACC 2.0 and the PGI Accelerator Compilers

OpenMP* 4.0 for HPC in a Nutshell

BLM 413E - Parallel Programming Lecture 3

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Introduction to Multicore Programming

OpenMP 1. OpenMP. Jalel Chergui Pierre-François Lavallée. Multithreaded Parallelization for Shared-Memory Machines.

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

High performance computing systems. Lab 1

Lightning Introduction to MPI Programming

Getting OpenMP Up To Speed

Programação pelo modelo partilhada de memória

Introduction to Hybrid Programming

Parallel Programming with MPI on the Odyssey Cluster

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Programming the Intel Xeon Phi Coprocessor

OpenMP Programming on ScaleMP

CSC230 Getting Starting in C. Tyler Bletsch

Multi-core Programming System Overview

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Parallel Programming Survey

OpenMP Application Program Interface

HPC Wales Skills Academy Course Catalogue 2015

An Introduction to Parallel Computing/ Programming

OpenACC Programming on GPUs

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Multicore Parallel Computing with OpenMP

Software and the Concurrency Revolution

aicas Technology Multi Core und Echtzeit Böse Überraschungen vermeiden Dr. Fridtjof Siebert CTO, aicas OOP 2011, 25 th January 2011

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Cloud-based OpenMP Parallelization Using a MapReduce Runtime. Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas

Chapter 2 Parallel Architecture, Software And Performance

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Introduction to OpenMP Programming. NERSC Staff

Introducción a la computación de altas prestaciones!!"

OpenMP ARB. Welcome to the. OpenMP Overview

The programming language C. sws1 1

MAQAO Performance Analysis and Optimization Tool

Parallel and Distributed Computing Programming Assignment 1

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Session 2: MUST. Correctness Checking

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

UTS: An Unbalanced Tree Search Benchmark

Case Study on Productivity and Performance of GPGPUs

Using the Intel Inspector XE

reduction critical_section

A Case Study - Scaling Legacy Code on Next Generation Platforms

Application Performance Analysis Tools and Techniques

Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters

A Performance Monitoring Interface for OpenMP

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)

The SpiceC Parallel Programming System of Computer Systems

OpenACC Programming and Best Practices Guide

P1 P2 P3. Home (p) 1. Diff (p) 2. Invalidation (p) 3. Page Request (p) 4. Page Response (p)

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Performance Evaluation and Analysis of Parallel Computers Workload

Load Balancing. computing a file with grayscales. granularity considerations static work load assignment with MPI

An introduction to Fyrkat

Performance analysis with Periscope

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

OpenCL for programming shared memory multicore CPUs

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Introduction to Cloud Computing

How to Run Parallel Jobs Efficiently

Transcription:

Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014

Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 2

OpenMP Easier parallelisation with threads MP stands for Multi Processing Standard for explicit shared memory parallelisation Extension to existing programming languages (C/C++/Fortran) Incremental parallelisation (Parallelisation of an existing serial program) Homepage: http://www.openmp.org 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 3

OpenMP Idea Parallelising program by special instructions (directives) Model: Fork-Join Focus: parallelisation of loops. 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 4

OpenMP fork/join for Distributes iterations of the loop on the team(threads) data parallelism. sections Divides the work into sections/work packages, each of which is executed by a thread functional parallelism single serial execution for a part of the program 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 5

OpenMP Goals Standardization To establish a standard between various competing shared memory platforms Lean and Mean Create a simple and limited instruction set for programming shared memory computers. Ease of Use Enable an incremental parallelisation of serial programs (Unlike the all-or-nothing approach of MPI) Portability Support of all common programming languages Open forum for users and developers 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 6

OpenMP Pros/Cons Pros simple parallelisation Higher abstraction than threads Sequential version can still be used Standard for shared memory Cons Only shared memory Limited use (loop type) 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 7

OpenMP Architecture Review Board (... from OpenMP Homepage) The OpenMP Architecture Review Board or the OpenMP ARB or just the ARB, is the non-profit corporation that owns the OpenMP brand, oversees the OpenMP specification and produces and approves new versions of the specification. Several companies are involved vendor-independent standard for parallel programming Permanent Members of the ARB: AMD, Cray, Fujitsu, HP, IBM, Intel,Microsoft, NEC, The Portland Group, SGI, Sun Auxiliary Members of the ARB: ASC/LLNL, compunity, EPCC, NASA, RWTH Aachen 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 8

OpenMP History Development 1997/1998: Version 1.0 2000/2002: Version 2.0 2008: Version 3.0 2011: Version 3.1 23.07.2013 : Version 4.0 Supported Commercial Compiler: IBM, Portland, Intel OpenSource: GNU gcc OpenMP 4.0 from gcc 4.9 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 9

A first OpenMP program hello.c 1 # ifdef _OPENMP 2 # include <omp.h> 3 # endif 4 # include <stdio. h> 5 int main ( void ){ 6 int i ; 7 # pragma omp parallel for 8 for ( i = 0; i < 8; ++ i ){ 9 int id = omp_get_thread_num(); 10 printf ("Hello World from thread % d \n ", id ); 11 if ( id ==0) 12 printf ("There are % d threads\n",omp_get_num_threads ()); 13 } 14 return 0; 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 10 15 }

Proceeding Starting a program, a single process is started on the CPU. This process corresponds to the master thread Master thread can now create & manage addit. threads The management of threads (creation, managing and termination) is done by OpenMP without user interaction # pragma omp parallel for instructs that the following tt for loop is distributed to the available threads omp_get_thread_num() indicates the current thread number omp_get_num_threads() indicates the total number of threads 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 11

Compile and Run Compiling gcc -fopenmp hello.c output 1 th@riemann:~$./a.out 2 Hello World from thread 4 3 Hello World from thread 6 4 Hello World from thread 3 5 Hello World from thread 1 6 Hello World from thread 5 7 Hello World from thread 2 8 Hello World from thread 7 9 Hello World from thread 0 10 There are 24 threads 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 12

Compiler directives OpenMP is defined mainly on compiler directives Directive format in C/C++ # pragma omp direktive-name [ clauses...] Fortran compiler not enabled for OpenMP process these directives as comments C-Compiler not enabled for OpenMP ignore unknown directives 1 $ gcc - Wall hello.c # ( gcc version ( < 4.2)) 2 hello.c : In function main : 3 hello.c :12: warning : ignoring # pragma omp parallel the program can be compiled by each (even not OpenMP enabled) compiler. 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 13

Compiler directives II Conditional compilation C/C++: Macro _OPENMP is defined: 1 # ifdef _OPENMP 2 /* Openmp specific code, e.g. */ 3 nummer = omp_get_thread_num () ; 4 # endif omp parallel Creates additional threads, i.e. work is executed by all threads. Original thread (master thread) has thread ID 0. # pragma omp parallel [ clauses ] /* structured block ( no gotos...) */ 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 14

Work-sharing between threads I Loops The work is divided among the threads. E.g. for two threads Thread_1: Loop elements 0,... (N/2) 1 Thread_2: Loop elements (N/2),... N 1 1 # pragma omp parallel [ clauses...] 2 # pragma omp for [ clauses...] 3 for ( i =0; i < N ; i ++) 4 a [ i ]= i * i ; This can be summarized (omp parallel for): 1 # pragma omp parallel for [ clauses...] 2 for ( i =0; i < N ; i ++) 3 a [ i ]= i * i ; 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 15

Work-sharing between threads II parallel The work is distributed. Each thread processes a section: 1 # pragma omp parallel 2 # pragma omp sections 3 { 4 # pragma omp section [ clauses...] 5 [... section A runs parallel to B...] 6 # pragma omp section [ clauses...] 7 [... section B runs parallel to A...] 8 } Again one can combine: 1 # pragma omp parallel sections [ klauseln...] 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 16

Data sharing attribute clauses Scope of data Data clauses that are specified within OpenMP directives controls how variables are handled/shared between threads shared() the data within a parallel region is shared, which means visible and accessible by all threads simultaneously private() The data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 17

Scope of data Values of private variables are undefined during entry and leaving of loops. The following keywords allow to initialize/finalize variables: default(shared private none) Specifies the default value none Each variable has to be declared explicitly as shared() or private() firstprivate() Like private() but all copies are initialized with the values the variables have before the parallel loop/region. lastprivate() Variable keeps the last value from within the loop after leaving the section 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 18

Example private Initialization 1 #include <stdio.h> 2 #include <omp.h> 3 int main(int argc, char* argv[]) 4 { 5 int t=2; 6 int result = 0; 7 int A[100], i=0, j=0; 8 omp_set_num_threads(t); // Explicitly setting of 2 threads 9 for(i=0; i<100; i++){ 10 A[i] = i; 11 result += A[i]; 12 } 13 printf("array-sum BEFORE calculation: %d\n", result); 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 19

Example private Parallel section 1 i=0; 2 int T[2]; 3 T[0] = 0; 4 T[1] = 0; 5 #pragma omp parallel 6 { 7 #pragma omp for 8 for(i=0; i<100; i++){ 9 for(j=0; j<10; j++){ 10 A[i] = A[i] * 2; 11 T[omp_get_thread_num()]++; 12 } 13 } 14 } 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 20

Example private Output 1 i=0; result=0; 2 for(i=0; i<100; i++) 3 result += A[i]; 4 printf("array-sum AFTER calculation: %d\n", result); 5 printf("thread 1: %d calculations\n", T[0]); 6 printf("thread 2: %d calculations\n", T[1]); 7 return 0; 8 } 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 21

Example Calc. without private(j) Without private declaration 1 Array-Sum BEFORE calculation: 4950 2 Array-Sum AFTER calculation: 90425960 3 Thread 1: 450 calculations 4 Thread 2: 485 calculations j is automatically initialized as shared Variable j is shared by both threads. Reason for wrong result 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 22

Example Modifying parallel section Parallel section with private(j) 1 #pragma omp parallel 2 { 3 #pragma omp for private(j) 4 for(i=0; i<100; i++){ 5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 } 9 } 10 } 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 23

Example Calculation with private(j) With private declaration 1 Array-Sum BEFORE calculation: 4950 2 Array-Sum AFTER calculation: 5068800 3 Thread 1: 500 calculations 4 Thread 2: 500 calculations j is now managed individually for each thread, i.e. privately declared Each thread has its own variable, so do not get confused in calculating 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 24

Race conditions Critical Situation (race condition) A race condition is a constellation where the result of an operation depends on the temporal behaviour of certain individual operations Could be handled in OpenMP: Critical section pragma opm critical ( name) Is used to resolve a race condition Let only one thread from the team executes the condition 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 25

Example critical Parallel section 1 #pragma omp parallel 2 { 3 #pragma omp for private(j) 4 for(i=0; i<100; i++){ 5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 NumOfIters++; /*added*/ 9 } 10 } 11 } 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 26

Example critical Without critical declaration 1 Array-Sum BEFORE calculation: 4950 2 Array-Sum AFTER calculation: 5068800 3 Thread 1: 500 calculations 4 Thread 2: 500 calculations 5 NumOfIters: 693 NumOfIters provides wrong result Threads hinder each other by incrementation private declaration is no solution, since all threads should be counted 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 27

Example critical Parallel section with critical declaration 1 #pragma omp parallel 2 { 3 #pragma omp for private(j) 4 for(i=0; i<100; i++){ 5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 #pragma omp critical /*added critical*/ 9 NumOfIters++; 10 #pragma omp end critical 11 } 12 } 13 } 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 28

Example critical With critical declaration 1 Array-Sum BEFORE calculation: 4950 2 Array-Sum AFTER calculation: 5068800 3 Thread 1: 500 calculations 4 Thread 2: 500 calculations 5 NumOfIters: 1000 NumOfIters is now executed serial Threads don t hinder each other by incrementation NumOfIters is increased by each thread one after another. 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 29

Reduction Reduction of data Critical sections could be a bottle neck of a calculation. In our example, there is another way to measure the number of iterations safely: the reduction reduction The reduction-clause identifies specific, commonly used variables. In this variable multiple threads can accumulate values Collects contributions by different threads like reduction in MPI 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 30

Example reduction Parallel section with reduction() 1 #pragma omp parallel 2 { 3 #pragma omp for private(j) reduction(+: NumOfIters) 4 5 for(i=0; i<100; i++){ 6 for(j=0; j<10; j++){ 7 A[i] = A[i] * 2; 8 T[omp_get_thread_num()]++; 9 NumOfIters++; 10 } 11 } 12 } 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 31

Example reduction With reduction() 1 Array-Sum BEFORE calculation: 4950 2 Array-Sum BEFORE calculation: 5068800 3 Thread 1: 500 calculations 4 Thread 2: 500 calculations 5 NumOfIters: 1000 Reduction is accomplished by instruction reduction after loop-parallelisation Requires arithmetic operation and the reduction variable, separated by colons: reduction(+ : NumOfIters) 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 32

Conditional parallelisation if clauses It may be desirable to parallelise loops only when the effort is well justified. F.i. to assure that using multiple threads the run time is lesser compared to serial execution Properties Allows to decide at runtime whether a loop is executed in parallel (fork-join) or is executed serial 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 33

Example if clauses Parallel section with if 1 #pragma omp parallel 2 { 3 #pragma omp for private(j) reduction(+:numofiters) if(n >500) 4 for(i=0; i<n; i++){ 5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 NumOfIters++; 9 } 10 } 11 } 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 34

Synchronization OpenMP provides several ways to coordinate thread execution. An important component is the synchronization of threads An application area we have already met: Reduction in case of the race condition in the critical section. Here we had implicit synchronization so that all threads execute the critical section sequentially The behaviour we find also in the atomic synchronisation 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 35

barrier-synchronisation barrier Threads are waiting until all have reached a common point. 1 # pragma omp 2 { 3 # pragma omp for nowait 4 for (i=0;i< N;i++) a[i] = b[i] + c[i]; 5 # pragma omp barrier 6 # pragma omp for 7 for (i=0;i<n;i++) d[i] = a[i] + b[i]; 8 } 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 36

atomic-synchronisation atomic similar to critical only permitted for specific operations i.e. (x++, ++x, x-, -x) 1 #pragma omp atomic 2 NumOfIters++; 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 37

master/single-synchronisation only one thread to execute a region (e.g. I/O) master Master executes the code section 1 # pragma omp master { 2 [ Code which should be executed once 3 ( by Master Thread ) ] 4 } single any thread could execute the code section 1 # pragma omp single { 2 [ Code which should be executed once 3 ( not necessarily by Master Thread ] 4 } 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 38

flush/ordered-synchronisation flush ensures a consistent view of memory. i.e. each thread gets its own temporary state M T of common memory M similar to a cache # pragma omp flush ( variables ) ordered specifies that the order of execution of the iterations inside the block must be the same as for serial program execution only permitted within a for loop # pragma omp for ordered {... }; 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 39

OpenMP vs. MPI I Advantages OpenMP Code is easier to parallelise and to understand Easy access to parallelisation Communication is implicit Program development much simpler More suitable as MPI for shared memory (when used correctly) Run-time scheduling is available good coarse & fine grained applications are possible Code can be executed serially 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 40

OpenMP vs. MPI II Disadvantages OpenMP In the form described so far only applicable to shared memory systems Classical synchronization errors (such as deadlocks and races conditions) can occur In coarse-grained parallelism often similar strategies as in MPI are necessary. Synchronization has to be implemented explicitly Parallelisation is mainly loop-parallelisation. 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 41

OpenMP vs. MPI III Advantages MPI Properly applied explicit parallelisation can perform optimal results Optimal communication routines are predefined Synchronization is implicitly associated with routine-calls and therefore less error prone Full control is given to the programmer Run on both shared and distributed memory systems 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 42

OpenMP vs. MPI III Disadvantages MPI Programming with MPI is complex often large changes in the code are necessary Communication between different nodes is relatively slow Not very well suited for fine granularity Global Operations (on all nodes) can be extremely expensive 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 43

OpenMP & MPI I Hello World with MPI & OpenMP 1 #include <stdio. h> 2 #include "mpi.h" 3 #include <omp.h> 4 5 int main(int argc, char argv []) { 6 int numprocs, rank, namelen; 7 char processor_name[mpi_max_processor_name]; 8 int iam = 0, np = 1; 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 44

OpenMP & MPI II 1 MPI_Init(&argc, &argv); 2 MPI_Comm_size(MPI_COMM_WORLD, &numprocs); 3 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 4 MPI_Get_processor_name(processor_name, &namelen); 5 6 #pragma omp parallel default(shared) private(iam, np) 7 { 8 np = omp_get_num_threads(); 9 iam = omp_get_thread_num(); 10 printf ("Hello from thread %d out of %d from process % d out of %d on %s\n", 11 iam, np, rank, numprocs, processor_name); 12 } 13 MPI_Finalize(); 14 } 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 45

OpenMP & MPI III Compiling > mpicc -openmp hello.c -o hello Executing > export OMP_NUM_THREADS=4 > mpirun -np 2 -machinefile machinefile.morab -x OMP_NUM_THREADS./hello 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 46

OpenMP & MPI VI Output 1 Hello from thread 0 out of 4 from process 0 out of 2 on m006 2 Hello from thread 1 out of 4 from process 0 out of 2 on m006 3 Hello from thread 2 out of 4 from process 0 out of 2 on m006 4 Hello from thread 3 out of 4 from process 0 out of 2 on m006 5 Hello from thread 0 out of 4 from process 1 out of 2 on m001 6 Hello from thread 3 out of 4 from process 1 out of 2 on m001 7 Hello from thread 1 out of 4 from process 1 out of 2 on m001 8 Hello from thread 2 out of 4 from process 1 out of 2 on m001 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 47

OpenMP & MPI Outlook Current trend - combined MPI/OpenMP OpenMP on Multi-core processors MPI Communication between processors/nodes Advantage Optimal utilization of resources Disadvantages Even more complicated to program Not necessarily faster 14.07.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 48