Practical Introduction to

1 Practical Introduction to http://tinyurl.com/cq-intro-openmp-20151006 By: Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@calculquebec.ca, Bart.Oldeman@mcgill.ca

Partners and Sponsors 2

3 Outline of the workshop Theoretical / practical introduction Parallelizing your serial code What is OpenMP? Why do we need it? How do we run OpenMP codes (on the Guillimin cluster)? How to program with OpenMP? Program structure Basic functions Examples

4 Outline of the workshop Practical exercises on Guillimin Login, setup environment, launch OpenMP code Analyzing and running examples Modifying and tuning OpenMP codes

5 Exercise 1: Log in to Guillimin, setting up the environment 1) Log in to Guillimin: ssh class##@guillimin.hpc.mcgill.ca 2) Check for loaded software modules: $ module list 3) See all available modules: $ module av 4) Load necessary modules: $ module add ifort icc 5) Check loaded modules again

Parallelizing your serial code Models for parallel computing (as an ordinary user sees it...) Implicit Parallelization minimum work for you Threaded libraries (MKL, ACML, GOTO, etc...) Compiler directives (OpenMP) Good for desktops and shared memory machines Explicit Parallelization work is required! You tell what should be done on what CPU Low-level option for shared memory machines: POSIX Threads (pthreads) Solution for distributed clusters (MPI: shared nothing!) 6

7 OpenMP Shared Memory API Open Multi-Processing: An Application Program Interface for multi-threaded programs in a shared-memory environment. http://www.openmp.org Consists of Compiler directives Runtime library routines Environment variables Allows for relatively simple incremental parallelization. Not distributed, but can be combined with MPI (hybrid: see Advanced MPI workshop).

8 Shared memory approach Thread 0 Private memory Shared memory Thread 1 Private memory Most memory is shared by all threads. Each thread also has some private memory: variables explicitly declared private, local variables in functions and subroutines.

OpenMP: fork/join model Master Thread Serial Region Parallel Region Serial Region Parallel Region Serial Region Worker Threads Fork Join Fork Join Master + Workers = Team Implementations use thread pools so worker threads sleep from join to fork. 9

What is OpenMP for a user? OpenMP is NOT a language! OpenMP is NOT a compiler or specific product OpenMP is a de-facto industry standard, a specification for an Application Program Interface (API). You use its directives, routines, and environment variables. You compile and link your code with specific flags. History: version 1.0 (1997), 2.5 (2005), 3.0 (2008), 3.1 (2011), 4.0 (2013). Different implementations : GCC (4.2+), Intel, PGI, Visual C++, Solaris Studio, CLang (3.7+),... 10

11 Basic features of OpenMP program Include basic definitions (#include <omp.h>, INCLUDE omp lib.h, or USE omp lib). Parallel region declared by a directive of the form #pragma omp parallel (C) or!$omp PARALLEL (Fortran), declaring which variables are private. Optional: code only compiled for OpenMP: use OPENMP preprocessor symbol (C) or!$ prefix (Fortran).

12 Example: Hello from N cores Fortran PROGRAM hello!$ USE omp_lib IMPLICIT NONE INTEGER rank, size rank = 0 size = 1!$OMP PARALLEL PRIVATE(rank, size)!$ size = omp_get_num_threads()!$ rank = omp_get_thread_num() WRITE(*,*) Hello from processor,& rank, of, size!$omp END PARALLEL END PROGRAM hello C #include <stdio.h> #ifdef _OPENMP #include <omp.h> #endif int main (int argc, char * argv[]) { int rank = 0, size = 1; #ifdef _OPENMP #pragma omp parallel private(rank, size) #endif { #ifdef _OPENMP rank = omp_get_thread_num(); size = omp_get_num_threads(); #endif printf("hello from processor %d" " of %d\n", rank, size ); return 0;

13 POSIX Threads Hello from N cores // pthreads.c #include <stdio.h> #include <pthread.h> #define SIZE 4 void *hello(void *arg) { printf("hello from processor %d of %d\n", *(int *)arg, SIZE); return NULL; int main(int argc, char* argv[]) { int i, p[size]; pthread_t threads[size]; for (i = 1; i < SIZE; i++) { /* Fork threads */ p[i] = i; pthread_create(&threads[i], NULL, hello, &p[i]); p[0] = 0; hello(&p[0]); /* thread 0 greets as well */ for (i = 1; i < SIZE; i++) /* Join threads. */ pthread_join(threads[i], NULL); return 0;

14 Compiling your OpenMP code NOT defined by the standard A special compilation flag must be used. On the Guillimin cluster: module add gcc gcc -fopenmp hello.c -o hello gfortran -fopenmp hello.f90 -o hello module add ifort icc icc -openmp hello.c -o hello ifort -openmp hello.f90 -o hello module add pgi pgcc -mp hello.c -o hello pgfortran -mp hello.f90 -o hello

15 Running your OpenMP code Important: environment variable OMP NUM THREADS. export OMP NUM THREADS=4./hello Hello from processor 2 of 4 Hello from processor 0 of 4 Hello from processor 3 of 4 Hello from processor 1 of 4 unset OMP NUM THREADS pgcc -mp hello.c -o hello./hello Hello from processor 0 of 1 gcc -fopenmp hello.c -o hello./hello Hello from processor 3 of 8 (...)

16 Running your OpenMP code On your laptop or desktop, just compile and run your code as above. On Guillimin cluster, use batch system to submit non-trivial OpenMP jobs! Example: hello.pbs: #!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:05:00 #PBS -V #PBS -N hello cd $PBS_O_WORKDIR export OMP_NUM_THREADS=2./hello > hello.out Submit your job: $ qsub hello.pbs

17 Exercise 2: Hello, compilation 1) Copy all files to your home directory: $ cp -a /software/workshop/cq-formation-openmp/* / 2) Compile your code: $ ifort -openmp hello.f90 -o hello $ icc -openmp hello.c -o hello

18 Exercise 2: Hello, job submission 3) View the file hello.pbs : #!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:05:00 #PBS -V #PBS -N hello cd $PBS_O_WORKDIR export OMP_NUM_THREADS=2./hello > hello.out

19 Exercise 2: Hello, job submission 4) Submit your job: $ qsub hello.pbs 5) Check the job status: $ qstat -u $USER $ showq -u $USER 6) Check the output (hello.out)

Exercise 2: Hello, compile and run Alternatively, using interactive qsub, or on your own Mac/Linux/Cygwin/MSYS computer: 1) Interactive login: $ qsub -I -l nodes=1:ppn=2,walltime=7:00:00 or create and then copy all files to a directory: yourlaptop> git clone -b mcgill \ https://github.com/calculquebec/cq-formation-intro-openmp.git cd cq-formation-intro-openmp 2) Compile your code: > gfortran -fopenmp hello.f90 -o hello > gcc -fopenmp hello.c -o hello 3) Run your code: > # can use any value here; default: number of cores > export OMP NUM THREADS=2 >./hello 20

OpenMP directives Format: sentinel directive [clause,] where sentinel is #pragma omp or!$omp. Examples: #pragma omp parallel (C),!$OMP PARALLEL,!$OMP END PARALLEL (Fortran): Parallel region construct. #pragma omp for: A workshare construct that makes a loop parallel (!$OMP DO in Fortran). #pragma omp parallel for: A combined construct: defines a parallel region that only contains the loop. #pragma omp barrier: A synchronization directive: all threads wait for each other here. 21

OpenMP clauses Examples: Data scope: private, shared, and default.!$omp parallel private(i) shared(x) The variable i is private to the thread but the variable x is shared with all other threads. Default: all variables shared except loop variables (C: outer, Fortran:all), and variables declared inside block.!$omp parallel default(shared) private(i) All variables are shared except i.!$omp parallel default(none) private(i) Using default(none) requires listing each variable or else the compiler complains (helps debugging!). 22

OpenMP important library routines int omp get max threads(void); Get maximum number of threads used here. void omp set num threads(int); Set number of threads for next parallel region. int omp get thread num(void); Get current thread number in parallel region. int omp get num threads(void); Get number of threads in parallel region. double omp get wtime(void); Portable wall clock timing routine. More exist, for example for locks and nested regions. 23

24 OpenMP main environment variables OMP NUM THREADS Sets the maximum number of threads used (default: compiler dependent but often the number of available (hyper)threads). OMP SCHEDULE Used for run-time scheduling. More exist, for example to control nested parallelism.

25 parallel for Example: void addvectors(const int *a, const int *b, int *c, int n) { int i; #pragma omp parallel for for (i = 0; i < n; i++) c[i] = a[i] + b[i]; Here i is automatically made private because it is the loop variable. All other variables are shared. Loop split between threads, for example with two threads, for n=10, thread 0 does index 0 to 4 and thread 1 does index 5 to 9.

26 parallel for Eliminate overhead from fork and join (in practise: synchronization) by using just one parallel region for two vector additions: void addvectors(const int *a, const int *b, int *c, int n) { int i; #pragma omp for for (i = 0; i < n; i++) c[i] = a[i] + b[i];... #pragma omp parallel { addvectors(a, b, c, n); addvectors(b, c, d, n);

27 parallel for, nowait and barrier void addvectors(const int *a, const int *b, int *c, int n) { int i; #pragma omp for nowait for (i = 0; i < n; i++) c[i] = a[i] + b[i];... #pragma omp parallel { addvectors(a, b, c, n); #pragma omp barrier addvectors(b, c, d, n/2); addvectors(e, f, g, n); omp for by default implies a barrier where all threads wait at the end of the loop. Eliminate synchronization overhead using nowait clause. But need to add explicit barrier (dependency)!

28 parallel for: if clause The if clause allows conditional parallel regions: if n is too small, the overhead is not worth it: void addvectors(const int *a, const int *b, int *c, int n) { int i; #pragma omp for for (i = 0; i < n; i++) c[i] = a[i] + b[i];... #pragma omp parallel if (n > 10000) { addvectors(a, b, c, n);

parallel for: scheduling schedule(static, 10000) allocates chunks of 10000 loop iterations to every thread: void addvectors(const int *a, const int *b, int *c, int n) { int i; #pragma omp for schedule(static, 10000) for (i = 0; i < n; i++) c[i] = a[i] + b[i]; Use dynamic instead of static to dynamically assign threads, if one finishes it is assigned the next chunk. Useful for unequal work within iterations. guided instead of dynamic: chunk sizes decrease as less work is left to do. runtime: use OMP SCHEDULE environment variable. 29

30 Nested loops For perfectly nested rectangular loops we can parallelize multiple loops in the nest with the collapse clause: Argument is number of loops to collapse. Will form a single loop of length NxM and then parallelize and schedule that. Useful if N is close to the number of threads so parallelizing the outer loop may not have good load balance More efficient alternative to (advanced) nested parallelism #pragma omp parallel for collapse(2) for (int i=0; i<n; i++) { for (int j=0; j<m; j++) {...

parallel: manual scheduling (SPMD) SPMD=Single Program Multiple Data, like in MPI. void addvectors(const int *a, const int *b, int *c, int n) { for (i = 0; i < n; i++) c[i] = a[i] + b[i];... int tid, nthreads, low, high; #pragma omp parallel default(none) private(tid, nthreads,\ low, high) shared(a, b, c, n) { tid = omp_get_thread_num(); nthreads = omp_get_num_threads(); low = (n * tid) / nthreads; high = (n * (tid + 1)) / nthreads; addvectors(&a[low], &b[low], &c[low], high-low); Calculate which thread does which loop iterations. Note: no barrier. 31

SPMD vs. worksharing Worksharing (omp for/omp do) is easiest to implement. SPMD (do work based on thread ID) may give better performance but is harder to implement. SPMD like in MPI: Instead of using large shared arrays, use smaller arrays private to threads: mark all (non-read-only) global and persistent (static/save) variables threadprivate, and communicate using buffers and barriers. Fewer cache misses using more private data may give better performance. More advanced topic: see Advanced OpenMP workshop in 2016. 32

The sections are individual code blocks that are distributed over the threads. More flexible alternative (OpenMP 3.0): omp task, useful when traversing dynamic data structures (lists, trees, etc.). 33 sections (SPMD construct) Example: #pragma omp parallel sections { #pragma omp section addvectors(a, b, c, n); #pragma omp section printf("hello world!\n"); #pragma omp section printf("i may or may not be the third thread\n");

workshare (Fortran) Example: integer a(10000), b(10000), c(10000), d(10000)!$omp PARALLEL!$OMP WORKSHARE c(:) = a(:) + b(:)!$omp END WORKSHARE NOWAIT!$OMP WORKSHARE d(:) = a(:)!$omp END WORKSHARE NOWAIT!$OMP END PARALLEL Array assignments in Fortran are distributed among threads like loops. Note that nowait in Fortran comes at the end. 34

35 Exercise 3: Modifying Hello Ask each CPU to do its own computation by inserting code as follows: IF (rank == 0) THEN a=sqrt(2.0) b=0.0 WRITE(*,*) a,b=,a,b, on proc,rank END IF IF (rank == 1) THEN a=0.0 b=sqrt(3.0) WRITE(*,*) a,b=,a,b, on proc,rank END IF Recompile as before and submit to the queue...

36 Exercise 4: Modifying Hello Do (almost) the same thing, now using omp sections!$omp SECTIONS!$OMP SECTION a=sqrt(2.0) b=0.0 WRITE(*,*) a,b=,a,b, on proc,rank!$omp SECTION a=0.0 b=sqrt(3.0) WRITE(*,*) a,b=,a,b, on proc,rank!$omp END SECTIONS Recompile as before and submit to the queue...

Race conditions (Data races) Example: innerprod.c, innerprod.f90 ip = 0; #pragma omp parallel for private(i) shared(ip,a,b) for (i = 0; i < N; i++) ip += a[i] * b[i]; Problem: could be internally run as: for (i = 0; i < N; i++) { int register = ip; register += a[i] * b[i]; ip = register; Threads may sum to their private CPU registers at the same time and overwrite ip, losing the addition from the other threads! 37

38 Race conditions Example: a={1,2, b={3,4, inner product 1*3+2*4=11, two threads. ip register0 register1 tid=0 int register = ip; 0 0 unknown tid=1 int register = ip; 0 0 0 tid=0 register += a[0] * b[0]; 0 3 0 tid=1 register += a[1] * b[1]; 0 3 8 tid=0 ip = register; 3 3 8 tid=1 ip = register; 8 3 8 Wrong result: ip=8. But may just as well be: ip=3: non-deterministic.

39 Solution: critical section Example: ip = 0; #pragma omp parallel for private(i) shared(ip,a,b) for (i = 0; i < N; i++) #pragma omp critical ip += a[i] * b[i]; critical makes sure only one thread can run the (compound) statement at a time Problem: slow!

40 Solution: atomic section Example: ip = 0; #pragma omp parallel for private(i) shared(ip,a,b) for (i = 0; i < N; i++) #pragma omp atomic ip += a[i] * b[i]; atomic is like critical but can only apply to a specific memory location. Less general but less overhead.

41 Solution: local summing Example: ip = 0; #pragma omp parallel private(i,localip) shared(ip,a,b) { localip = 0; #pragma omp for nowait for (i = 0; i < N; i++) localip += a[i] * b[i]; #pragma omp atomic ip += localip; Still needs atomic but only times the number of threads, not times N, greatly reducing overhead and improving performance.

Solution: local summing using array Example: ip = 0; int *localips = malloc(omp_get_max_threads()*sizeof(*localips)); #pragma omp parallel private(i,tid) shared(ip,a,b,localips) { tid = omp_get_thread_num(); localips[tid] = 0; #pragma omp for for (i = 0; i < N; i++) localips[tid] += a[i] * b[i]; #pragma omp single for (i = 0; i < omp_get_num_threads(); i++) ip += localips[i]; One single thread does the final summing. Could also use omp master here, to force the master thread to do the final summing. omp master works like if (tid==0). 42

43 Solution: local summing using array Example: ip = 0; int *localips = malloc(omp_get_max_threads()*sizeof(*localips)); #pragma omp parallel private(i,tid) shared(ip,a,b,localips) { tid = omp_get_thread_num(); localips[tid] = 0; #pragma omp for for (i = 0; i < N; i++) localips[tid] += a[i] * b[i]; #pragma omp single for (i = 0; i < omp_get_num_threads(); i++) ip += localips[i]; master and single are especially useful for I/O. Problem here: false cache sharing of the array localips.

Solution: local summing using array Example: ip = 0; int *localips = malloc(omp_get_max_threads()*sizeof(*localips)); #pragma omp parallel private(i,tid,localip) shared(ip,a,b,localips) { tid = omp_get_thread_num(); localip = 0; #pragma omp for nowait for (i = 0; i < N; i++) localip += a[i] * b[i]; localips[tid] = localip; #pragma omp barrier #pragma omp single for (i = 0; i < omp_get_num_threads(); i++) ip += localips[i]; Minimizes false sharing, moving it outside the loop. Note the barrier! 44

45 Solution: local summing using array Could also use padding in localips, but need to know size of L1 cache (e.g. 64 bytes): #define PAD 16 ip = 0; int (*localips)[pad] = malloc(omp_get_max_threads()*sizeof(*localips)); #pragma omp parallel private(i,tid) shared(ip,a,b,localips) { tid = omp_get_thread_num(); localips[tid][0] = 0; #pragma omp for for (i = 0; i < N; i++) localips[tid][0] += a[i] * b[i]; #pragma omp single for (i = 0; i < omp_get_num_threads(); i++) ip += localips[i][0];

46 Easiest solution: use reduction Example: ip = 0; #pragma omp parallel for reduction(+:ip) for (i = 0; i < N; i++) ip += a[i] * b[i]; Reduction is the most straightforward solution. Caveat: only works on scalars, and cannot control rounding errors caused by floating point calculations. For vectors use one of the other methods, Fortran, or OpenMP 4.0 user defined reductions.

47 Exercise 5: Compilation error See omp bug.c or omp bug.f90, courtesy of Blaise Barney, Lawrence Livermore National Laboratory and find the compilation error in: #pragma omp parallel for \ shared(a,b,c,chunk) \ private(i,tid) \ schedule(static,chunk) { tid = omp_get_thread_num(); for (i=0; i < N; i++) { c[i] = a[i] + b[i]; printf("tid= %d i= %d c[i]= %f\n", tid, i, c[i]); /* end of parallel for construct */

48 Exercise 6: Computing π Consider pi collect.c(f90), π = 4 arctan 1 = 4 ( 1) i 1 i=0 2i + 1 = 4 4 3 + 4 5 4 7 + 4 9 +... Let s add timings: double t1, t2 t1 = omp_get_wtime();... t2 = omp_get_wtime(); printf("time = %.16f\n", t2-t1); Try some of the other alternatives (atomic, critical) to a reduction and measure the performance.

49 Exercise 7: Matrix multiplication See the file mm.c or mm.f90. Make the initialization and multiplication parallel and measure the speedup.

50 Further information: The standard itself, news, development, links to tutorials: http://www.openmp.org Intel tutorial on YouTube (from Tim Mattson): http://tinyurl.com/openmp-tutorial New: OpenMP 4.0: thread affinity, SIMD, accelerators (GPUs, coprocessors), in GCC 4.9+, Intel compilers 14 and 15 (used in November 12 Xeon Phi and 2016 Advanced OpenMP workshops). Questions? Write the guillimin support team at guillimin@calculquebec.ca