Parallel Programming in C with MPI and OpenMP

Similar documents
Programação pelo modelo partilhada de memória

Parallel Computing. Shared memory parallel programming with OpenMP

Parallel Computing. Parallel shared memory computing with OpenMP

OpenMP C and C++ Application Program Interface

What is Multi Core Architecture?

#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:

Objectives. Overview of OpenMP. Structured blocks. Variable scope, work-sharing. Scheduling, synchronization

Spring 2011 Prof. Hyesoon Kim

OpenMP Application Program Interface

OpenMP Application Program Interface

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

To copy all examples and exercises to your local scratch directory type: /g/public/training/openmp/setup.sh

Multi-Threading Performance on Commodity Multi-Core Processors

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

OpenMP 1. OpenMP. Jalel Chergui Pierre-François Lavallée. Multithreaded Parallelization for Shared-Memory Machines.

High Performance Computing

Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors

Parallel and Distributed Computing Programming Assignment 1

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

OpenMP* 4.0 for HPC in a Nutshell

Using the Intel Inspector XE

Getting OpenMP Up To Speed

Parallel Algorithm Engineering

Parallelization: Binary Tree Traversal

MPI and Hybrid Programming Models. William Gropp

OpenACC 2.0 and the PGI Accelerator Compilers

PROACTIVE BOTTLENECK PERFORMANCE ANALYSIS IN PARALLEL COMPUTING USING OPENMP

University of Amsterdam - SURFsara. High Performance Computing and Big Data Course

Hybrid Programming with MPI and OpenMP

An Introduction to Parallel Programming with OpenMP

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Chapter 6 Load Balancing

An Introduction to Parallel Computing/ Programming

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Basic Concepts in Parallelization

OpenMP and Performance

Chapter 6, The Operating System Machine Level

Common Mistakes in OpenMP and How To Avoid Them

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Parallel Programming Survey

OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging

Symmetric Multiprocessing

reduction critical_section

Informatica e Sistemi in Tempo Reale

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

Phys4051: C Lecture 2 & 3. Comment Statements. C Data Types. Functions (Review) Comment Statements Variables & Operators Branching Instructions

Shared Address Space Computing: Programming

Sources: On the Web: Slides will be available on:

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Towards OpenMP Support in LLVM

CS 141: Introduction to (Java) Programming: Exam 1 Jenny Orr Willamette University Fall 2013

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Using Predictive Adaptive Parallelism to Address Portability and Irregularity

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Performance Analysis Tools For Parallel Java Applications on Shared-memory Systems

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Debugging with TotalView

Data Structure Oriented Monitoring for OpenMP Programs

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)

Practical Introduction to

Scalability evaluation of barrier algorithms for OpenMP

Real Time Programming: Concepts

BLM 413E - Parallel Programming Lecture 3

Introduction to OpenMP Programming. NERSC Staff

Multicore Parallel Computing with OpenMP

Sequential to Parallel HPC Development

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

GPU Hardware Performance. Fall 2015

OpenACC Basics Directive-based GPGPU Programming

Introduction to Scheduling Theory

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Developing Multithreaded Applications: A Platform Consistent Approach. V. 2.0, February 2005

A Performance Monitoring Interface for OpenMP

The SpiceC Parallel Programming System of Computer Systems

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

CYCLIC: A LOCALITY-PRESERVING LOAD-BALANCING ALGORITHM FOR PDES ON SHARED MEMORY MULTIPROCESSORS

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Control 2004, University of Bath, UK, September 2004

Performance Evaluation and Optimization of A Custom Native Linux Threads Library

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

2) Write in detail the issues in the design of code generator.

Introduction to Multicore Programming

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Operating System Engineering: Fall 2005

Storage Classes CS 110B - Rule Storage Classes Page 18-1 \handouts\storclas

The C Programming Language course syllabus associate level

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Course Development of Programming for General-Purpose Multicore Processors

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

The World According to the OS. Operating System Support for Database Management. Today s talk. What we see. Banking DB Application

OpenACC Programming and Best Practices Guide

Matching Memory Access Patterns and Data Placement for NUMA Systems

Supporting OpenMP on Cell

Transcription:

Parallel Programming in C with MPI and OpenMP Michael J. Quinn

Chapter 17 Shared-memory Programming

Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions Performance improvements More general data parallelism Functional parallelism

OpenMP OpenMP: An application programming interface (API) for parallel programming on multiprocessors Compiler directives Library of support functions OpenMP works in conjunction with Fortran, C, or C++

What s OpenMP Good For? C + OpenMP sufficient to program multiprocessors C + MPI + OpenMP a good way to program multicomputers built out of multiprocessors IBM RS/6000 SP Fujitsu AP3000 Dell High Performance Computing Cluster

Shared-memory Model P r o c e s s o r P r o c e s s o r P r o c e s s o r P r o c e s s o r M e m o r y Processors interact and synchronize with each other through shared variables.

Fork/Join Parallelism Initially only master thread is active Master thread executes sequential code Fork: Master thread creates or awakens additional threads to execute parallel code Join: At end of parallel code created threads die or are suspended

Fork/Join Parallelism M a s t e r T h r e a d O t h e r t h r e a d s f o r k T i m e j o i n f o r k j o i n

Shared-memory Model vs. Message-passing Model (#1) Shared-memory model Number active threads 1 at start and finish of program, changes dynamically during execution Message-passing model All processes active throughout execution of program

Incremental Parallelization Sequential program a special case of a shared-memory parallel program Parallel shared-memory programs may only have a single parallel loop Incremental parallelization: process of converting a sequential program to a parallel program a little bit at a time

Shared-memory Model vs. Message-passing Model (#2) Shared-memory model Execute and profile sequential program Incrementally make it parallel Stop when further effort not warranted Message-passing model Sequential-to-parallel transformation requires major effort Transformation done in one giant step rather than many tiny steps

Parallel for Loops C programs often express data-parallel operations as for loops for (i = first; i < size; i += prime) marked[i] = 1; OpenMP makes it easy to indicate when the iterations of a loop may execute in parallel Compiler takes care of generating code that forks/joins threads and allocates the iterations to threads

Pragmas Pragma: a compiler directive in C or C++ Stands for pragmatic information A way for the programmer to communicate with the compiler Compiler free to ignore pragmas Syntax: #pragma omp <rest of pragma>

Parallel for Pragma Format: #pragma omp parallel for for (i = 0; i < N; i++) a[i] = b[i] + c[i]; Compiler must be able to verify the run- time system will have information it needs to schedule loop iterations

Canonical Shape of for Loop Control Clause for(index start;index index index index index end; index ) index inc inc index index inc index inc index index index inc

Execution Context Every thread has its own execution context Execution context: address space containing all of the variables a thread may access Contents of execution context: static variables dynamically allocated data structures in the heap variables on the run-time stack additional run-time stack for functions invoked by the thread

Shared and Private Variables Shared variable: has same address in execution context of every thread Private variable: has different address in execution context of every thread A thread cannot access the private variables of another thread

Shared and Private Variables i n t m a i n ( i n t a r g c, c h a r * a r g v [ ] ) { i n t b [ 3 ] ; c h a r * c p t r ; i n t i ; c p t r = m a l l o c ( 1 ) ; # p r a g m a o m p p a r a l l e l f o r f o r ( i = 0 ; i < 3 ; i + + ) b [ i ] = i ; H e a p S t a c k M i b a s t e r T h r e a d ( T h r e a d 0 ) c p t r i i T h r e a d 1

Function omp_get_num_procs Returns number of physical processors available for use by the parallel program int omp_get_num_procs (void)

Function omp_set_num_threads Uses the parameter value to set the number of threads to be active in parallel sections of code May be called at multiple points in a program void omp_set_num_threads (int t)

Pop Quiz: Write a C program segment that sets the number of threads equal to the number of processors that are available.

Declaring Private Variables for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j],a[i][k]+tmp); Either loop could be executed in parallel We prefer to make outer loop parallel, to reduce number of forks/joins We then must give each thread its own private copy of variable j

private Clause Clause: an optional, additional component to a pragma Private clause: directs compiler to make one or more variables private private ( <variable list> )

Example Use of private Clause #pragma omp parallel for private(j) for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j],a[i][k]+tmp);

firstprivate Clause Used to create private variables having initial values identical to the variable controlled by the master thread as the loop is entered Variables are initialized once per thread, not once per loop iteration If a thread modifies a variable s value in an iteration, subsequent iterations will get the modified value

Example Use of firstprivate Clause X[0] = complex_function(); for (i=0; i<n; i++) { for (j=1; j<4; j++) x[j] = g(i, x[j-1]); answer[i] = x[1] x[3]; } X[0] = complex_function(); #pragma omp parallel for private(j) firstprivate(x) for (i=0; i<n; i++) { for (j=1; j<4; j++) x[j] = g(i, x[j-1]); answer[i] = x[1] x[3]; }

lastprivate Clause Sequentially last iteration: iteration that occurs last when the loop is executed sequentially lastprivate clause: used to copy back to the master thread s copy of a variable the private copy of the variable from the thread that executed the sequentially last iteration

Example Use of lastprivate Clause for (i=0; i<n; i++) { x[0] = 1.0; for (j=1; j<4; j++) x[j] = x[j-1] * (i+1); sum_of_powers[i] = x[0] + x[1] + x[2] + x[3]; } n_cubed = x[3]; #pragma omp parallel for private(j) lastprivate(x) for (i=0; i<n; i++) { x[0] = 1.0; for (j=1; j<4; j++) x[j] = x[j-1] * (i+1); sum_of_powers[i] = x[0] + x[1] + x[2] + x[3]; } n_cubed = x[3];

Critical Sections double area, pi, x; int i, n;... area = 0.0; for (i = 0; i < n; i++) { x += (i+0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

Race Condition Consider this C program segment to compute using the rectangle rule: double area, pi, x; int i, n;... area = 0.0; for (i = 0; i < n; i++) { x = (i+0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

Race Condition (cont.) If we simply parallelize the loop... double area, pi, x; int i, n;... area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i+0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

Race Condition (cont.)... we set up a race condition in which one process may race ahead of another and not see its change to shared variable area area 11.667 15.432 15.230 Answer should be 18.995 Thread A 11.667 15.432 Thread B 11.667 15.230 area += 4.0/(1.0 + x*x)

Race Condition Time Line V a l u e o f a r e a T h r e a d A T h r e a d B 1 1. 6 6 7 1 1. 6 6 7 + 3. 7 6 5 1 5. 4 3 2 + 3. 5 6 3 1 5. 2 3 0

critical Pragma Critical section: a portion of code that only thread at a time may execute We denote a critical section by putting the pragma #pragma omp critical in front of a block of C code

Correct, But Inefficient, Code double area, pi, x; int i, n;... area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i+0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area / n;

Source of Inefficiency Update to area inside a critical section Only one thread at a time may execute the statement; i.e., it is sequential code Time to execute statement significant part of loop By Amdahl s Law we know speedup will be severely constrained

Reductions Reductions are so common that OpenMP provides support for them May add reduction clause to parallel for pragma Specify reduction operation and reduction variable OpenMP takes care of storing partial results in private variables and combining partial results after the loop

reduction Clause The reduction clause has this syntax: reduction (<op>( :<variable> <variable>) Operators +Sum * Product & Bitwise and Bitwise or ^ Bitwise exclusive or && Logical and Logical or

-finding Code with Reduction Clause double area, pi, x; int i, n;... area = 0.0; #pragma omp parallel for \ private(x) reduction(+:area) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

Performance Improvement #1 Too many fork/joins can lower performance Inverting loops may help performance if Parallelism is in inner loop After inversion, the outer loop can be made parallel Inversion does not significantly lower cache hit rate

Performance Improvement #2 If loop has too few iterations, fork/join overhead is greater than time savings from parallel execution The if clause instructs compiler to insert code that determines at run-time whether loop should be executed in parallel; e.g., #pragma omp parallel for if(n > 5000)

Performance Improvement #3 We can use schedule clause to specify how iterations of a loop should be allocated to threads Static schedule: all iterations allocated to threads before any iterations executed Dynamic schedule: only some iterations allocated to threads at beginning of loop s execution. Remaining iterations allocated to threads that complete their assigned iterations.

Static vs. Dynamic Scheduling Static scheduling Low overhead May exhibit high workload imbalance Dynamic scheduling Higher overhead Can reduce workload imbalance

Chunks A chunk is a contiguous range of iterations Increasing chunk size reduces overhead and may increase cache hit rate Decreasing chunk size allows finer balancing of workloads

schedule Clause Syntax of schedule clause schedule (<type>( <type>[, [,<chunk> ]) Schedule type required, chunk size optional Allowable schedule types static: static allocation dynamic: dynamic allocation guided: guided self-scheduling runtime: type chosen at run-time based on value of environment variable OMP_SCHEDULE

Scheduling Options schedule(static): block allocation of about n/t contiguous iterations to each thread schedule(static,c): interleaved allocation of chunks of size C to threads schedule(dynamic): dynamic one-at-a-time allocation of iterations to threads schedule(dynamic,c): dynamic allocation of C iterations at a time to threads

Scheduling Options (cont.) schedule(guided): guided self-scheduling with minimum chunk size 1 schedule(guided, C): dynamic allocation of chunks to tasks using guided self-scheduling heuristic. Initial chunks are bigger, later chunks are smaller, minimum chunk size is C. schedule(runtime): schedule chosen at run-time based on value of OMP_SCHEDULE; Unix example: setenv OMP_SCHEDULE static,1

More General Data Parallelism Our focus has been on the parallelization of for loops Other opportunities for data parallelism processing items on a to do list for loop + additional code outside of loop

Processing a To Do List H e a p j o b _ p t r S h a r e d V a r i a b l e s t a s k _ p t r t a s k _ p t r M a s t e r T h r e a d T h r e a d 1

Sequential Code (1/2) int main (int argc, char *argv[]) { struct job_struct *job_ptr; struct task_struct *task_ptr; }... task_ptr = get_next_task (&job_ptr); while (task_ptr!= NULL) { complete_task (task_ptr); task_ptr = get_next_task (&job_ptr); }...

Sequential Code (2/2) char *get_next_task(struct job_struct **job_ptr) { struct task_struct *answer; } if (*job_ptr == NULL) answer = NULL; else { answer = (*job_ptr)->task; *job_ptr = (*job_ptr)->next; } return answer;

Parallelization Strategy Every thread should repeatedly take next task from list and complete it, until there are no more tasks We must ensure no two threads take same take from the list; i.e., must declare a critical section

parallel Pragma The parallel pragma precedes a block of code that should be executed by all of the threads Note: execution is replicated among all threads

Use of parallel Pragma #pragma omp parallel private(task_ptr) { task_ptr = get_next_task (&job_ptr); while (task_ptr!= NULL) { complete_task (task_ptr); task_ptr = get_next_task (&job_ptr); } }

Critical Section for get_next_task char *get_next_task(struct job_struct **job_ptr) { struct task_struct *answer; #pragma omp critical { if (*job_ptr == NULL) answer = NULL; else { answer = (*job_ptr)->task; *job_ptr = (*job_ptr)->next; } } return answer; }

Functions for SPMD-style Programming The parallel pragma allows us to write SPMD-style programs In these programs we often need to know number of threads and thread ID number OpenMP provides functions to retrieve this information

Function omp_get_thread_num This function returns the thread identification number If there are t threads, the ID numbers range from 0 to t-1 The master thread has ID number 0 int omp_get_thread_num (void)

Function omp_get_num_threads Function omp_get_num_threads returns the number of active threads If call this function from sequential portion of program, it will return 1 int omp_get_num_threads (void)

for Pragma The parallel pragma instructs every thread to execute all of the code inside the block If we encounter a for loop that we want to divide among threads, we use the for pragma #pragma omp for

Example Use of for Pragma #pragma omp parallel private(i,j) for (i = 0; i < m; i++) { low = a[i]; high = b[i]; if (low > high) { printf ("Exiting (%d)\n", i); break; } #pragma omp for for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i]; }

single Pragma Suppose we only want to see the output once The single pragma directs compiler that only a single thread should execute the block of code the pragma precedes Syntax: #pragma omp single

Use of single Pragma #pragma omp parallel private(i,j) for (i = 0; i < m; i++) { low = a[i]; high = b[i]; if (low > high) { #pragma omp single printf ("Exiting (%d)\n", i); break; } #pragma omp for for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i]; }

nowait Clause Compiler puts a barrier synchronization at end of every parallel for statement In our example, this is necessary: if a thread leaves loop and changes low or high,, it may affect behavior of another thread If we make these private variables, then it would be okay to let threads move ahead, which could reduce execution time

Use of nowait Clause #pragma omp parallel private(i,j,low,high) for (i = 0; i < m; i++) { low = a[i]; high = b[i]; if (low > high) { #pragma omp single printf ("Exiting (%d)\n", i); break; } #pragma omp for nowait for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i]; }

Functional Parallelism To this point all of our focus has been on exploiting data parallelism OpenMP allows us to assign different threads to different portions of code (functional parallelism)

Functional Parallelism Example v = alpha(); w = beta(); x = gamma(v, w); y = delta(); printf ("%6.2f\n", epsilon(x,y)); a l p h a b e t a May execute alpha, beta, and delta in parallel g a m m a d e l t a e p s i l o n

parallel sections Pragma Precedes a block of k blocks of code that may be executed concurrently by k threads Syntax: #pragma omp parallel sections

section Pragma Precedes each block of code within the encompassing block preceded by the parallel sections pragma May be omitted for first parallel section after the parallel sections pragma Syntax: #pragma omp section

Example of parallel sections #pragma omp parallel sections { #pragma omp section /* Optional */ v = alpha(); #pragma omp section w = beta(); #pragma omp section y = delta(); } x = gamma(v, w); printf ("%6.2f\n", epsilon(x,y));

Another Approach a l p h a b e t a g a m m a d e l t a Execute alpha and beta in parallel. Execute gamma and delta in parallel. e p s i l o n

sections Pragma Appears inside a parallel block of code Has same meaning as the parallel sections pragma If multiple sections pragmas inside one parallel block, may reduce fork/join costs

Use of sections Pragma #pragma omp parallel { #pragma omp sections { v = alpha(); #pragma omp section w = beta(); } #pragma omp sections { x = gamma(v, w); #pragma omp section y = delta(); } } printf ("%6.2f\n", epsilon(x,y));

Summary (1/3) OpenMP an API for shared-memory parallel programming Shared-memory model based on fork/join parallelism Data parallelism parallel for pragma reduction clause

Summary (2/3) Functional parallelism (parallel sections pragma) SPMD-style programming (parallel pragma) Critical sections (critical pragma) Enhancing performance of parallel for loops Inverting loops Conditionally parallelizing loops Changing loop scheduling

Summary (3/3) Characteristic Suitable for multiprocessors OpenMP MPI Yes Yes Suitable for multicomputers Supports incremental parallelization Minimal extra code Explicit control of memory hierarchy No Yes Yes No Yes No No Yes