OpenMP. Date: 20/03/2012

Transcription

1 OpenMP Date: 20/03/2012 1

2 Introduction OpenMP (Open Multi-Processing) is an API (application programming interface) that supports multi-platform shared memory multiprocessing programming......in C, C++, and Fortran,...on most processor architectures...and operating systems, including Linux, Unix, AIX, Solaris, Mac OS X, and Microsoft Windows platforms. OpenMP is managed by the non-profit technology consortium OpenMP Architecture Review Board, jointly defined by a group of major computer hardware and software vendors. AMD, IBM, Intel, Cray, HP, Fujitsu, NVIDIA, NEC, Microsoft, Texas Instruments, Oracle Corporation, and more. 2

3 Introduction (2) The OpenMP API consists of a set of (1) compiler directives, (2) library routines, and (3) environment variables that influence run-time behavior. The OpenMP API defines a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer. An application built with the hybrid model of parallel programming can run on a computer cluster using both OpenMP and MPI (Message Passing Interface). Or more transparently through the use of OpenMP extensions for non-shared memory systems. 3

4 OpenMP Parallelism Fork-join parallelism Master thread spawns a set of threads as needed. An illustration of multithreading where the master thread forks off a number of threads which execute blocks of code (A,B,C,D) in parallel tasks (I,II,III).

5 Syntax format Compiler directives C/C++ #pragma omp construct [clause [clause] ] Fortran C$OMP construct [clause [clause] ]!$OMP construct [clause [clause] ] *$OMP construct [clause [clause] ] Strong promise: since directives are used, no changes need to be made to a program for a compiler that does not support OpenMP.

6 Open MP Programming Model Directive #pragma omp directive [clause list] Program executes serially until it encounters a parallel directive #pragma omp parallel [clause list] /* structured block of code */ Clause list is used to specify conditions Conditional parallelism - if (cond) Degree of concurrency - num_threads(int) Data Handling - such as private(vlist), firstprivate(vlist), shared(vlist)

7 OpenMP Programming Model (2) A number of compilers from various vendors or open source communities implement the OpenMP API: GNU (gcc), IBM, Intel, Portland, PathScale, Microsoft, and more. For example, the recent GNU (gcc) Linux compiler provides OpenMP by default In addition to compiler directives, OpenMP needs certain library routines and environmental variables: In C/C++ the omp.h header file must be included. #include <omp.h> Fortran uses omp_lib module. USE omp_lib A trivial test program can be used to test the compiler and the environment (file hello.c): 7

8 OpenMP Programming Model (3) #include <omp.h> #include <stdio.h> int main() { #pragma omp parallel printf("hello world from thread %d, nthreads %d!\n", omp_get_thread_num(), omp_get_num_threads()); } To enable OpenMP, the compiler needs a proper option, such as -fopenmp in gcc and gfortran: -bash-4.1$ gcc -fopenmp -o hello hello.c -bash-4.1$./hello Hello world from thread 0, nthreads 4! Hello world from thread 3, nthreads 4! Hello world from thread 2, nthreads 4! Hello world from thread 1, nthreads 4! 8

9 Example: Simple Parallel Loop Parallel for loops are typical OpenMP use OpenMP is generally used to parallelize loops Find most time consuming loops Split iterations up between threads C/C++: /* Original serial code */ void simple(int n, float *a, float *b) { int i; } for (i=1; i<n; i++) b[i] = (a[i] + a[i-1]) / 2.0; 9

10 Example: Simple Parallel Loop (2) C/C++: /* Parallel code with OpenMP */ void simple(int n, float *a, float *b) { int i; #pragma omp parallel for for (i=1; i<n; i++) /* i is private by default */ b[i] = (a[i] + a[i-1]) / 2.0; } 10

11 Example: Simple Parallel Loop (3) The same parallel example in Fortran: SUBROUTINE SIMPLE(N, A, B) INTEGER I, N REAL B(N), A(N)!$OMP PARALLEL DO!I is private by default DO I=2,N B(I) = (A(I) + A(I-1)) / 2.0 ENDDO!$OMP END PARALLEL DO END SUBROUTINE SIMPLE 11

12 Thread Interaction OpenMP operates using shared memory Threads communicate via shared variables Unintended sharing can lead to race conditions Output changes due to thread scheduling Race conditions can be controlled using synchronization But, synchronization is expensive Alternatively, the way data is stored can be changed to minimize the need for synchronization

13 OpenMP Directives 5 categories Parallel Regions Work sharing Data Environment Synchronization Runtime functions / environment variables Basically the same both in C/C++ and Fortran

14 The core elements The core elements of OpenMP are the constructs for thread creation, workload distribution (work sharing), data-environment management, thread synchronization, user-level runtime routines and environment variables.

15 The core elements Thread creation omp parallel Fork additional threads to carry out the work in parallel. The original process will be the master thread with thread ID 0. See the previous code example (C program) displaying "Hello world" using multiple threads. Work-sharing constructs Used to specify how to assign independent work to one or all of the threads. omp for or omp do (loop constructs) are used to split up loop iterations among the threads. sections: assigning consecutive but independent code blocks to different threads. single: specifying a code block that is executed by only one thread, a barrier is implied in the end. master: similar to single, but the code block will be executed by the master thread only and no barrier implied in the end. 15

16 The Core Elements (data environment management) OpenMP is a shared memory programming model Most variables in OpenMP code are visible to all threads by default. Sometimes private variables are necessary to avoid race conditions and there is a need to pass values between the sequential part and the parallel region (the code block executed in parallel), so data sharing attribute clauses can be used by appending them to the OpenMP directive. shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables except the loop iteration counter. private: the data within a parallel region is private to each thread. By default, the loop iteration counters in the OpenMP loop constructs are private. default: allows the programmer to state that the default data scoping within a parallel region will be either shared, or none for C/C++. firstprivate: like private except initialized to original value. lastprivate: like private except original value is updated after construct. reduction: a safe way of joining work from all threads after construct. 16

17 The Core Elements (synchronization) Synchronization clauses critical: the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data from race conditions. atomic: the memory update (write, or read-modify-write) in the next instruction will be performed atomically. It does not make the entire statement atomic; only the memory update is atomic. A compiler might use special hardware instructions for better performance than when using critical. ordered: the structured block is executed in the order in which iterations would be executed in a sequential loop barrier: each thread waits until all of the other threads of a team have reached this point. A work-sharing construct has an implicit barrier synchronization at the end. nowait: specifies that threads completing assigned work can proceed without waiting for all threads in the team to finish. In the absence of this clause, threads encounter a barrier synchronization at the end of the work sharing construct. 17

18 An example (synchronization) double area, pi, x; int i, n; area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i + 0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area / n;

19 The Core Elements (scheduling) Scheduling clauses schedule(type, chunk): This is useful if the work sharing construct is a do-loop or for-loop. The iteration(s) in the work sharing construct are assigned to threads according to the scheduling method defined by this clause. The three types of scheduling are: 1. static: Here, all the threads are allocated iterations before they execute the loop iterations. The iterations are divided among threads equally by default. However, specifying an integer for the parameter "chunk" will allocate "chunk" number of contiguous iterations to a particular thread. 2. dynamic: Here, some of the iterations are allocated to a smaller number of threads. Once a particular thread finishes its allocated iteration, it returns to get another one from the iterations that are left. The parameter "chunk" defines the number of contiguous iterations that are allocated to a thread at a time. 3. guided: A large chunk of contiguous iterations are allocated to each thread dynamically (as above). The chunk size decreases exponentially with each successive allocation to a minimum size specified in the parameter "chunk" 19

20 An example of scheduling and data environment management #pragma omp parallel for private(j) schedule(static, 2) for (i = 0; i < n; i++) for (j = 0; j < m; j++) x[j][j] = g(i, x[j-1]); Data environment management clause: private Scheduling clause: schedule(static, 2) The chunk size (2) can be adjusted to meet load balancing issues, etc.

21 The Core Elements (if condition & initialization) IF control if: This will cause the threads to parallelize the task only if a condition is met. Otherwise the code block executes serially. Initialization firstprivate: the data is private to each thread, but initialized using the value of the variable using the same name from the master thread. lastprivate: the data is private to each thread. The value of this private data will be copied to a global variable using the same name outside the parallel region if current iteration is the last iteration in the parallelized loop. A variable can be both firstprivate and lastprivate. threadprivate: The data is a global data, but it is private in each parallel region during the runtime. The difference between threadprivate and private is the global scope associated with threadprivate and the preserved value across parallel regions. 21

22 An example of conditional execution Overhead of fork/join is high If a loop is small, you don t want to parallellize But, you may not know how big until runtime Conditional clause for parallel execution if ( expression ) area = 0.0; #pragma omp parallel for private(x) if (n > 5000) for (i = 0; i < n; i++) { x = (i + 0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area / n;

23 The Core Elements (data copying & reduction) Data copying copyin: similar to firstprivate for private variables, threadprivate variables are not initialized, unless using copyin to pass the value from the corresponding global variables. No copyout is needed because the value of a threadprivate variable is maintained throughout the execution of the whole program. copyprivate: used with single to support the copying of data values from private objects on one thread (the single thread) to the corresponding objects on other. Reduction reduction(operator intrinsic : list): the variable has a local copy in each thread, but the values of the local copies will be summarized (reduced) into a global shared variable. This is very useful if a particular operation (specified in "operator") on a datatype that runs iteratively so that its value at a particular iteration depends on its value at a previous iteration. The steps that lead up to the operational increment are parallelized, but the threads gather up and wait before updating the datatype, then increments the datatype in order to avoid racing condition. This would be required in parallelizing Numerical Integration of functions and Differential Equations, as a common example. 23

24 An Example of Reductions Sometimes each thread should calculate a part of a value then collapse all that into a single value Done with reduction clause area = 0.0; #pragma omp parallel for private(x) reduction (+:area) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

25 The Core Elements (misc) Others flush: The value of this variable is restored from the register to the memory for using this value outside of a parallel part master: Executed only by the master thread (the thread which forked off all the others during the execution of the OpenMP directive). No implicit barrier; other team members (threads) not required to reach. User-level runtime routines Used to modify/check the number of threads, detect if the execution context is in a parallel region, how many processors in current system, set/unset locks, timing functions, etc. Environment variables A method to alter the execution features of OpenMP applications. Used to control loop iterations scheduling, default number of threads, etc. For example OMP_NUM_THREADS is used to specify number of threads for an application. 25

26 OpenMP Functions The OpenMP functions can be used to get information about the runtime environment and settings: int omp_get_num_procs() int omp_get_num_threads() int omp_get_thread_num() void omp_set_num_threads(int)

27 OpenMP Environment Variables OpenMP parallelism may be controlled via environment variables OMP_NUM_THREADS Sets number of threads in parallel sections OMP_DYNAMIC When = TRUE, allows number of threads to be set at runtime OMP_NESTED When = TRUE, enables nested parallelism OMP_SCHEDULE Controls the scheduling assignment Example - export OMP_SCHEDULE= static,4

28 Demo Monte-Carlo estimation for Pi. 28

29 #include <stdio.h> #include <stdlib.h> #include <omp.h> Serial code main(int argc, char *argv[]) { /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit 1/4 circle */ unsigned short xi[3]; /* random number seed */ int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ } xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = 0; count = 0; for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } pi = 4.0 * count / samples; printf( Estimate of pi: %7.5f\n, pi);

30 #include <stdio.h> #include <stdlib.h> #include <omp.h> main(int argc, char *argv[]) { /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ unsigned short xi[3]; /* random number seed */ int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ samples = atoi(argv[1]); Parallel Version #pragma omp parallel { xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; printf("i am thread %d\n", xi[2]); #pragma omp for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } } pi = 4.0 * (double)count / (double)samples; printf("count = %d, Samples = %d, Estimate of pi: %7.5f\n", count, samples, pi); }

31 References [1] [2] [3] Akhter, Roberts; Multi-Core Programming; Intel press. [4] Mattson, Sanders, Massingill; Patterns for Parallel Programming; Addison Wesley. 31