Parallel Computing. Shared memory parallel programming with OpenMP



Similar documents
Parallel Computing. Parallel shared memory computing with OpenMP

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

OpenMP Application Program Interface

OpenMP C and C++ Application Program Interface

To copy all examples and exercises to your local scratch directory type: /g/public/training/openmp/setup.sh

High Performance Computing

Objectives. Overview of OpenMP. Structured blocks. Variable scope, work-sharing. Scheduling, synchronization

An Introduction to Parallel Programming with OpenMP

Practical Introduction to

Multi-core Programming System Overview

University of Amsterdam - SURFsara. High Performance Computing and Big Data Course

#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:

What is Multi Core Architecture?

Informatica e Sistemi in Tempo Reale

Parallel Algorithm Engineering

Spring 2011 Prof. Hyesoon Kim

Getting OpenMP Up To Speed

Debugging with TotalView

Introduction to OpenMP Programming. NERSC Staff

OpenACC Basics Directive-based GPGPU Programming

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

OpenMP* 4.0 for HPC in a Nutshell

Multi-Threading Performance on Commodity Multi-Core Processors

Towards OpenMP Support in LLVM

Scalability evaluation of barrier algorithms for OpenMP

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

MPI and Hybrid Programming Models. William Gropp

A Performance Monitoring Interface for OpenMP

Using the Intel Inspector XE

Hybrid Programming with MPI and OpenMP

OpenACC 2.0 and the PGI Accelerator Compilers

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Programming the Intel Xeon Phi Coprocessor

Parallelization: Binary Tree Traversal

Introduction to Hybrid Programming

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Why Choose C/C++ as the programming language? Parallel Programming in C/C++ - OpenMP versus MPI

A Case Study - Scaling Legacy Code on Next Generation Platforms

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

COSCO 2015 Heterogeneous Computing Programming

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

PROACTIVE BOTTLENECK PERFORMANCE ANALYSIS IN PARALLEL COMPUTING USING OPENMP

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12

The programming language C. sws1 1

Introduction to Multicore Programming

MAQAO Performance Analysis and Optimization Tool

OAMulator. Online One Address Machine emulator and OAMPL compiler.

XMOS Programming Guide

Real Time Programming: Concepts

Sources: On the Web: Slides will be available on:

SSC - Concurrency and Multi-threading Java multithreading programming - Synchronisation (I)

OpenCL for programming shared memory multicore CPUs

OMPT: OpenMP Tools Application Programming Interfaces for Performance Analysis

How To Write A Multi Threaded Software On A Single Core (Or Multi Threaded) System

Control 2004, University of Bath, UK, September 2004

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

The OpenACC Application Programming Interface

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

CSC230 Getting Starting in C. Tyler Bletsch

Facing the Challenges for Real-Time Software Development on Multi-Cores

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Software and the Concurrency Revolution

Introduction. What is an Operating System?

Fundamentals of Programming

OpenACC Programming on GPUs

Performance Analysis Tools For Parallel Java Applications on Shared-memory Systems

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Multicore Parallel Computing with OpenMP

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters

BLM 413E - Parallel Programming Lecture 3

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Parallel and Distributed Computing Programming Assignment 1

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

An Implementation Of Multiprocessor Linux

Library (versus Language) Based Parallelism in Factoring: Experiments in MPI. Dr. Michael Alexander Dr. Sonja Sewera.

Leak Check Version 2.1 for Linux TM

High Performance Computing

Transcription:

Parallel Computing Shared memory parallel programming with OpenMP Thorsten Grahs, 27.04.2015

Table of contents Introduction Directives Scope of data Synchronization 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

OpenMP MP stands for Multi Processing De-facto standard Application Program Interface (API) for explicit shared memory parallelism. Extension to existing programming languages (C/C++/Fortran) Incremental parallelism (Parallelization of an existing serial program) Approach Workers who do the work in parallel (threads) cooperate through shared memory Memory accesses instead of explicit messages local model parallelization of the serial code Allows incremental parallelization 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3

OpenMP Goals Standardization To establish a standard between various competing shared memory platforms Lean and Mean Create a simple and limited instruction set for programming shared memory computers. Ease of Use Enable an incremental parallelism of serial programs (Unlike the all-or-nothing approach of MPI) Portability Support of all common programming languages Open forum for users and developers 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4

OpenMP History Open specifications for Multi Processing maintained by the OpenMP Architecture Review Board http://www.openmp.org Supported Commercial Compiler: IBM, Portland, Intel Open Source: GNU gcc OpenMP 4.0 from gcc 4.9 OpenMP 4.0 released July 2013 Development 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5

OpenMP Execution model Thread-based parallelism Compiler Directive Based Explicit Parallelism Fork-Join Model (Focus: parallelism of loops) 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6

OpenMP fork/join for Distributes iterations of the loop on the team(threads) data parallelism. sections Divides the work into sections/work packages, each of which is executed by a thread functional parallelism single serial execution for a part of the program 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7

OpenMP Memory model All threads have access to the same globally shared memory Data in private memory is only accessible by the thread owning this memory No other thread sees the change(s) Data transfer is through shared memory and is completely transparent to the application 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8

OpenMP Pros/Cons Pros simple parallelism Higher abstraction than threads Sequential version can still be used Standard for shared memory Cons Only shared memory Limited use (loop type) 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9

OpenMP Main components Compiler Directives and Clauses appear as comments, executed when the appropriate OpenMP flag is specified Parallel construct Work-sharing constructs Synchronization constructs Data Attribute clauses C/C++: #pragma omp directive-name [clause[clause]...] Fortran free form:!$omp directive-name [clause[clause]...] 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10

Compiling See: http://openmp.org/wp/openmp-compilers/ for the full list. 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11

Environment Variables OMP_NUM_THREADS: sets number of threads OMP_STACKSIZE size [B K M G] : size of the stack for threads OMP_DYNAMIC TRUE FALSE: dynamic thread adjustment OMP_SCHEDULE schedule[,chunk] : iteration scheduling scheme OMP_PROC_BIND TRUE FALSE: bound threads to processors OMP_NESTED TRUE FALSE: nested parallelism...... To set them In csh/tcsh: setenv OMP_NUM_THREADS 4 In sh/bash: export OMP_NUM_THREADS=4 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12

Basic functions Query/specify some specific feature or setting omp_get_thread_num(): get thread ID (0 for master thread) omp_get_num_threads(): get number of threads in the team omp_set_num_threads(int n): set number of threads Allow you to manage fine-grained access (lock) omp_init_lock(lock_var): initializes the OpenMP lock variable lock_var of type omp_lock_t Timing functions omp_get_wtime(): returns elapsed wallclock time omp_get_wtick(): returns timer precision Functions interface: C/C++: #include <omp.h> Fortran: use omp_lib (or include omp_lib.h ) 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13

A first example 4 threads 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14

Hello World hello.c 1 # ifdef _OPENMP 2 # include <omp.h> 3 # endif 4 # include <stdio.h> 5 int main ( void ){ 6 int i ; 7 # pragma omp parallel for 8 for ( i = 0; i < 8; ++ i ){ 9 int id = omp_get_thread_num(); 10 printf ("Hello World from thread % d \n ", id ); 11 if ( id ==0) 12 printf("there are % d threads\n",omp_get_num_threads()); 13 } 14 return 0; 15 } 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15

Proceeding Starting a program, a single process is started on the CPU. This process corresponds to the master thread Master thread can now create & manage addit. threads The management of threads (creation, managing and termination) is done by OpenMP without user interaction # pragma omp parallel for instructs that the following tt for loop is distributed to the available threads omp_get_thread_num() indicates the current thread number omp_get_num_threads() indicates the total number of threads 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16

Compile and Run Compiling gcc -fopenmp hello.c output 1 th@riemann:~$./a.out 2 Hello World from thread 4 3 Hello World from thread 6 4 Hello World from thread 3 5 Hello World from thread 1 6 Hello World from thread 5 7 Hello World from thread 2 8 Hello World from thread 7 9 Hello World from thread 0 10 There are 24 threads 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17

OpenMP pthread translation A sample OpenMP program with its Pthreads translation that might be performed by an OpenMP compiler 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18

Compiler directives OpenMP is defined mainly on compiler directives Directive format in C/C++ # pragma omp construct [ clauses...] constructs are functionalities of the language clauses are parameters to those functionalities construct + clauses = directive C-Compiler not enabled for OpenMP ignore unknown directives 1 $ gcc - Wall hello.c # ( gcc version ( < 4.2)) 2 hello.c : In function main : 3 hello.c :12: warning : ignoring # pragma omp parallel the program can be compiled by each (even not OpenMP enabled) compiler. 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19

Compiler directives II Conditional compilation C/C++: Macro _OPENMP is defined: 1 # ifdef _OPENMP 2 /* Openmp specific code, e.g. */ 3 number = omp_get_thread_num () ; 4 # endif omp parallel Creates additional threads, i.e. work is executed by all threads. Original thread (master thread) has thread ID 0. # pragma omp parallel [ clauses ] /* structured block ( no gotos...) */ 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20

Work-sharing between threads I Loops The work is divided among the threads. E.g. for two threads Thread_1: Loop elements 0,... (N/2) 1 Thread_2: Loop elements (N/2),... N 1 1 # pragma omp parallel [ clauses...] 2 # pragma omp for [ clauses...] 3 for ( i =0; i < N ; i ++) 4 a [ i ]= i * i ; This can be summarized (omp parallel for): 1 # pragma omp parallel for [ clauses...] 2 for ( i =0; i < N ; i ++) 3 a [ i ]= i * i ; 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21

Work-sharing between threads II parallel The work is distributed. Each thread processes a section: 1 # pragma omp parallel 2 # pragma omp sections 3 { 4 # pragma omp section [ clauses...] 5 [... section A runs parallel to B...] 6 # pragma omp section [ clauses...] 7 [... section B runs parallel to A...] 8 } Again one can combine: 1 # pragma omp parallel sections [ clauses...] 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22

Data sharing attribute clauses Scope of data Data clauses that are specified within OpenMP directives controls how variables are handled/shared between threads shared() the data within a parallel region is shared, which means visible and accessible by all threads simultaneously private() The data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23

Scope of data Values of private variables are undefined during entry and leaving of loops. The following keywords allow to initialize/finalize variables: default(shared private none) Specifies the default value none Each variable has to be declared explicitly as shared() or private() firstprivate() Like private() but all copies are initialized with the values the variables have before the parallel loop/region. lastprivate() Variable keeps the last value from within the loop after leaving the section 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24

Example private Initialization 1 #include <stdio.h> 2 #include <omp.h> 3 int main(int argc, char* argv[]) 4 { 5 int t=2; 6 int result = 0; 7 int A[100], i=0, j=0; 8 omp_set_num_threads(t); // Explicitly setting of 2 threads 9 for(i=0; i<100; i++){ 10 A[i] = i; 11 result += A[i]; 12 } 13 printf("array-sum BEFORE calculation: %d\n", result); 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25

Example private Parallel section 1 i=0; 2 int T[2]; 3 T[0] = 0; 4 T[1] = 0; 5 #pragma omp parallel 6 { 7 #pragma omp for 8 for(i=0; i<100; i++){ 9 for(j=0; j<10; j++){ 10 A[i] = A[i] * 2; 11 T[omp_get_thread_num()]++; 12 } 13 } 14 } 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

Example private Output/print 1 i=0; result=0; 2 for(i=0; i<100; i++) 3 result += A[i]; 4 printf("array-sum AFTER calculation: %d\n", result); 5 printf("thread 1: %d calculations\n", T[0]); 6 printf("thread 2: %d calculations\n", T[1]); 7 return 0; 8 } 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27

Example Calc. without private(j) Output without private declaration 1 Array-Sum BEFORE calculation: 4950 2 Array-Sum AFTER calculation: 90425960 3 Thread 1: 450 calculations 4 Thread 2: 485 calculations j is automatically initialized as shared Variable j is shared by both threads. Reason for wrong result 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28

Example Modifying parallel section Parallel section with private(j) 1 #pragma omp parallel 2 { 3 #pragma omp for private(j) 4 for(i=0; i<100; i++){ 5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 } 9 } 10 } 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29

Example Calculation with private(j) Output with private declaration 1 Array-Sum BEFORE calculation: 4950 2 Array-Sum AFTER calculation: 5068800 3 Thread 1: 500 calculations 4 Thread 2: 500 calculations j is now managed individually for each thread, i.e. privately declared Each thread has its own variable, so do not get confused in calculating 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

Critical section Critical section Used (similar to the examples in pthreads) to avoid ill-conditioned runtime behaviour Defined in OpenMP as #pragma opm critical... critical section...... #pragma omp end critical Could be used to resolve a race condition Let only one thread from the team executes the condition 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31

Example critical Parallel section 1 #pragma omp parallel 2 { 3 #pragma omp for private(j) 4 for(i=0; i<100; i++){ 5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 NumOfIters++; /*added*/ 9 } 10 } 11 } 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32

Example critical Output without critical declaration 1 Array-Sum BEFORE calculation: 4950 2 Array-Sum AFTER calculation: 5068800 3 Thread 1: 500 calculations 4 Thread 2: 500 calculations 5 NumOfIters: 693 NumOfIters provides wrong result Threads hinder each other by incrementation private declaration is no solution, since all threads should be counted 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33

Example critical Parallel section with critical declaration 1 #pragma omp parallel 2 { 3 #pragma omp for private(j) 4 for(i=0; i<100; i++){ 5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 #pragma omp critical /*added critical*/ 9 NumOfIters++; 10 #pragma omp end critical 11 } 12 } 13 } 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34

Example critical Output with critical declaration 1 Array-Sum BEFORE calculation: 4950 2 Array-Sum AFTER calculation: 5068800 3 Thread 1: 500 calculations 4 Thread 2: 500 calculations 5 NumOfIters: 1000 NumOfIters is now executed serial Threads don t hinder each other by incrementation NumOfIters is increased by each thread one after another. 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35

Reduction Reduction of data Critical sections could be a bottle neck of a calculation. In our example, there is another way to measure the number of iterations safely: the reduction reduction A reduction operator is a binary operation (such as addition or multiplication). A reduction is a computation that repeatedly applies the same reduction operator to a sequence of operands in order to get a single result. All of the intermediate results of the operation should be stored in the same variable: the reduction variable. 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36

Reduction Usage #pragma omp for private(j) reduction(op: var) The reduction-clause identifies specific, commonly used variables. We have to give the reduction operator op and the reduction variable var The operator op could one of this: +, *, -, /, &, ^,, &&, In the variable var multiple threads can accumulate values Collects contributions by different threads like reduction in MPI 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37

Example reduction Parallel section with reduction() 1 #pragma omp parallel 2 { 3 #pragma omp for private(j) reduction(+: NumOfIters) 4 5 for(i=0; i<100; i++){ 6 for(j=0; j<10; j++){ 7 A[i] = A[i] * 2; 8 T[omp_get_thread_num()]++; 9 NumOfIters++; 10 } 11 } 12 } 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38

Example reduction Output with reduction() 1 Array-Sum BEFORE calculation: 4950 2 Array-Sum BEFORE calculation: 5068800 3 Thread 1: 500 calculations 4 Thread 2: 500 calculations 5 NumOfIters: 1000 Reduction is accomplished by instruction reduction after loop-parallelism Requires arithmetic operation and the reduction variable, separated by colons: reduction(+ : NumOfIters) 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39

Conditional parallelism if clauses It may be desirable to parallelise loops only when the effort is well justified. F.i. to assure that using multiple threads the run time is lesser compared to serial execution Properties Allows to decide at runtime whether a loop is executed in parallel (fork-join) or is executed serial 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40

Example if clauses Parallel section with if 1 #pragma omp parallel 2 { 3 #pragma omp for private(j) reduction(+:numofiters) if(n>500) 4 for(i=0; i<n; i++){ 5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 NumOfIters++; 9 } 10 } 11 } 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41

Synchronization In a parallel region threads proceed asynchronously. Sometimes coordination is necessary OpenMP provides several ways to coordinate thread execution. An important component is the synchronization of threads An application area we have already met: Reduction in case of the race condition in the critical section. Here we had implicit synchronization so that all threads execute the critical section sequentially The behaviour we find also in the atomic synchronisation 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42

barrier construct At the barrier all threads wait and continue only when all threads have reached the barrier The barrier guarantees that ALL the code above has been executed We have Explicit barriers #pragma omp barrier Implicit barrier At the end of the worksharing constructs (i.e. for/do, sections, single constructs) It can be removed with the clause nowait 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 43

barrier-synchronisation barrier Threads are waiting until all have reached a common point. 1 # pragma omp 2 { 3 # pragma omp for nowait 4 for (i=0;i< N;i++) a[i] = b[i] + c[i]; 5 # pragma omp barrier 6 # pragma omp for 7 for (i=0;i<n;i++) d[i] = a[i] + b[i]; 8 } 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 44

atomic construct atomic - a similar construct #pragma omp atomic [clause] <statement> The atomic construct applies only to statements that update the value of a variable Ensures that no other thread updates the variable between reading and writing It is a special lightweight form of a critical Only read/write are serialized, and only if two or more threads access the same memory address 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 45

atomic-synchronisation Behaviour is related to critical Only permitted for specific operations i.e. (x++, ++x, x-, -x) 1 #pragma omp atomic 2 NumOfIters++; 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 46

master/single-synchronisation #pragma omp master [clause] [ Code which should be executed once by Master Thread] Only master thread to execute a region (e.g. I/O) #pragma omp single [clause] [ Code which should be executed once] Only one thread execute a region (e.g. I/O) This is not necessarily the master thread! 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 47

Resume OpenMP is the De-facto standard for shared memory parallelism Easy to use Code can be fast parallalized. Supported for all commonly used compilers Drawbacks If your problem is becomes big enough you has to use distributed memory approaches Not so ellaborated control over the execution order like in message passing. 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 48

Further readings OpenMP tutorials Blaise Barney, Lawrence Livermore National Laboratory https://computing.llnl.gov/tutorials/openmp/ Guide into OpenMP Easy multithreading programming for C++ By Joel Yliluoma http://bisqwit.iki.fi/story/howto/openmp/ Introduction to High Performance Computing for Scientists and Engineers G. Hager & G. Wellein Ch. 6 Shared memory parallel programming with OpenMP Ch. 7 Efficient OpenMP programming Chapman & Hall, 2011 27.04.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 49