Control vs. Simplicity Tradeoff Mohamed Zahran (aka Z)

Similar documents
Parallel Computing. Shared memory parallel programming with OpenMP

Parallel Computing. Parallel shared memory computing with OpenMP

OpenMP C and C++ Application Program Interface

High Performance Computing

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

OpenMP Application Program Interface

Parallel Algorithm Engineering

An Introduction to Parallel Programming with OpenMP

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Multi-Threading Performance on Commodity Multi-Core Processors

OpenMP Application Program Interface

Objectives. Overview of OpenMP. Structured blocks. Variable scope, work-sharing. Scheduling, synchronization

#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:

To copy all examples and exercises to your local scratch directory type: /g/public/training/openmp/setup.sh

Towards OpenMP Support in LLVM

What is Multi Core Architecture?

Hybrid Programming with MPI and OpenMP

Multi-core Programming System Overview

OpenMP 1. OpenMP. Jalel Chergui Pierre-François Lavallée. Multithreaded Parallelization for Shared-Memory Machines.

Spring 2011 Prof. Hyesoon Kim

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Introduction to OpenMP Programming. NERSC Staff

Programação pelo modelo partilhada de memória

University of Amsterdam - SURFsara. High Performance Computing and Big Data Course

Practical Introduction to

OpenACC 2.0 and the PGI Accelerator Compilers

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

OpenMP* 4.0 for HPC in a Nutshell

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Common Mistakes in OpenMP and How To Avoid Them

Basic Concepts in Parallelization

The OpenACC Application Programming Interface

Scalability evaluation of barrier algorithms for OpenMP

Programming the Intel Xeon Phi Coprocessor

High Performance Computing

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

A Performance Monitoring Interface for OpenMP

The programming language C. sws1 1

Getting OpenMP Up To Speed

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

PROACTIVE BOTTLENECK PERFORMANCE ANALYSIS IN PARALLEL COMPUTING USING OPENMP

OpenACC Basics Directive-based GPGPU Programming

Introduction to Python

Multicore Parallel Computing with OpenMP

BLM 413E - Parallel Programming Lecture 3

Supporting OpenMP on Cell

Course Development of Programming for General-Purpose Multicore Processors

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Parallel Programming Survey

Introduction to Hybrid Programming

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

MPI and Hybrid Programming Models. William Gropp

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

OpenCL for programming shared memory multicore CPUs

Introduction to Cloud Computing

HPC Wales Skills Academy Course Catalogue 2015

A Case Study - Scaling Legacy Code on Next Generation Platforms

Improving System Scalability of OpenMP Applications Using Large Page Support

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Shared Address Space Computing: Programming

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

The SpiceC Parallel Programming System of Computer Systems

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

End-user Tools for Application Performance Analysis Using Hardware Counters

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

MAQAO Performance Analysis and Optimization Tool

Lecture 1: an introduction to CUDA

OpenACC Programming and Best Practices Guide

GPU Hardware Performance. Fall 2015

Using the Intel Inspector XE

Chapter 5 Names, Bindings, Type Checking, and Scopes

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

CIS570 Modern Programming Language Implementation. Office hours: TDB 605 Levine

OpenMP and Performance

OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging

OMPT: OpenMP Tools Application Programming Interfaces for Performance Analysis

CUDA programming on NVIDIA GPUs

Petascale Software Challenges. William Gropp

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Sources: On the Web: Slides will be available on:

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

picojava TM : A Hardware Implementation of the Java Virtual Machine

Performance Tools. Tulin Kaman. Department of Applied Mathematics and Statistics

Optimizing Loop-Level Parallelism in Cray XMT Applications

Introduction to GPU hardware and to CUDA

Real Time Programming: Concepts

All ju The State of Software Development Today: A Parallel View. June 2012

Shared Memory Abstractions for Heterogeneous Multicore Processors

GPU Parallel Computing Architecture and CUDA Programming Model

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

White Paper. Intel Xeon Phi Coprocessor DEVELOPER S QUICK START GUIDE. Version 1.5

OpenACC Programming on GPUs

Transcription:

Lecture 7: CSCI-GA.3033-012 Multicore Processors: Architecture & Programming OpenMP: Control vs. Simplicity Tradeoff Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Small and Easy Motivation int main() { // Do this part in parallel printf( "Hello, World!\n" ); } return 0;

Small and Easy Motivation int main() { omp_set_num_threads(16); // Do this part in parallel #pragma omp parallel { printf( "Hello, World!\n" ); } } return 0;

Simple! OpenMP can parallelize many serial programs with relatively few annotations that specify parallelism and independence OpenMP is a small API that hides cumbersome threading calls with simpler directives

Interesting Insights About OpenMP These insights are coming from HPC folks though! Source: www.sdsc.edu/~allans/cs260/lectures/openmp.ppt

OpenMP In a Nutshell API Multithreaded programming Assumes shared memory multiprocessor Works with C/C++, and Fortran example: gcc -fopenmp Consists of: compiler directives library routines environment variables Strengths: portability, simplicity, and scalability Weakness: less control to the programmer

Goals of OpenMP Standardization among platforms/architectures Easy of use Portability

OpenMP uses the fork-join model of parallel execution. All OpenMP programs begin with a single thread: master thread (ID = 0 ) FORK: the master thread then creates a team of parallel threads. JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate

Isn t Nested Parallelism Interesting? Fork Fork Join Join

Important! The following are implementation dependent: Nested parallelism Dynamically alter number of threads It is entirely up to the programmer to ensure that I/O is conducted correctly within the context of a multithreaded program. Threads can "cache" their data and are not required to maintain exact consistency with real memory all of the time. The programmer is responsible for insuring that the variable is FLUSHed by all threads as needed.

Directives Runtime Libraries Envir. Variables #pragma omp [clause, clause, ] Case sensitive Only one directive-name may be specified per directive Each directive applies to at most one succeeding statement, which must be a structured block. Long directive lines can be "continued" on succeeding lines by escaping the newline character with a backslash ("\") at the end of a directive line.

Directives Runtime Libraries Envir. Variables #pragma omp [clause, clause, ] Example: #pragma omp parallel [clause...] newline if (scalar_expression) private (list) shared (list) default (shared none) firstprivate (list) reduction (operator: list) copyin (list) num_threads (integer-expression) structured_block

#include <omp.h> main () { int nthreads, tid; /* Fork a team of threads with each thread having a private tid variable */ #pragma omp parallel private(tid) { /* Obtain and print thread id */ tid = omp_get_thread_num(); printf("hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); } } /* All threads join master thread and terminate */ }

#include <omp.h> main () { int nthreads, tid; Runtime Libraries /* Fork a team of threads with each thread having a private tid variable */ #pragma omp parallel private(tid) { /* Obtain and print thread id */ tid = omp_get_thread_num(); printf("hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); } } /* All threads join master thread and terminate */ }

How Many Threads? Setting of the NUM_THREADS clause Use of the omp_set_num_threads() library function Setting of the OMP_NUM_THREADS environment variable Implementation default - usually the number of cores. Threads are numbered from 0 (master thread) to N-1

Dividing Work Among Threads Important: Work sharing directives do NOT launch new threads.

Dividing Work Among Threads #pragma omp for [clause...] schedule (type [,chunk]) ordered private (list) firstprivate (list) lastprivate (list) shared (list) reduction (operator: list) collapse (n) nowait for_loop Number of iterations must be known in advance No gotos into or out of the loop.

#pragma omp parallel private(f) { f=7; #pragma omp for for (i=0; i<20; i++) a[i] = b[i] + f * (i+1); } /* omp end parallel */

Dividing Work Among Threads #include <omp.h> #define CHUNKSIZE 100 #define N 1000 main () { int i, chunk; float a[n], b[n], c[n]; for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; } #pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for schedule(dynamic,chunk) nowait for (i=0; i < N; i++) c[i] = a[i] + b[i]; } /* end of parallel section */

Dividing Work Among Threads #pragma omp sections [clause...] { #pragma omp section structured_block #pragma omp section } structured_block Implicit barrier here, unless you use nowait.

#pragma omp parallel { #pragma omp sections {{ a=...; b=...; } #pragma omp section { c=...; d=...; } #pragma omp section { e=...; f=...; } #pragma omp section { g=...; h=...; } } /*omp end sections*/ } /*omp end parallel*/

Dividing Work Among Threads Specifies that the enclosed code is to be executed by only one thread in the team. #pragma omp single [clause...] structured_block

Example About Consistency Model Code: Initially A = Flag = 0 P1 P2 A = 23; while (Flag!= 1) {;} Flag = 1;... = A; Possible execution sequence on each processor: P1 P2 Write A 23 Read Flag //get 0 Write Flag 1 Read Flag //get 1 Read A //what do you get? Do You see the problem?

Example About Consistency Code: Initially A = Flag = 0 Model P1 P2 A = 23; flush; while (Flag!= 1) {;} Flag = 1;... = A; Execution: P1 writes data into A Flush waits till write to A is completed P1 then writes data to Flag Therefore, if P2 sees Flag = 1, it is guaranteed that it will read the correct value of A even if memory operations in P1 before flush and memory operations after flush are reordered by the hardware or compiler.

What Does OpenMP Say Here? #pragma omp flush (list) Thread-visible variables are written back to memory at this point. A bit complicated scenario arises if two threads execute the flush at the same time with common variables between them. If you do not specify a list, then all variables will be flushed

Synchronization: Critical Directive Enclosed code executed by all threads, but restricted to only one thread at a time C/C++: #pragma omp critical [ ( name ) ] structured-block A thread waits at the beginning of a critical region until no other thread in the team is executing a critical region with the same name.

cnt = 0; f=7; #pragma omp parallel { #pragma omp for for (i=0; i<20; i++) { if (b[i] == 0) { #pragma omp critical cnt ++; } /* endif */ a[i] = b[i] + f * (i+1); } /* end for */ } /*omp end parallel */

Directives Runtime Libraries omp_get_thread_num() The number of threads remains unchanged for a parallel region The above function returns an integer Master thread has ID 0 Different parallel regions may have different number of threads. Envir. Variables omp_get_num_threads() Returns, as integer, the number of threads in the current parallel region.

Directives Are we in a parallel region? OMP_IN_PARALLEL() Runtime Libraries Envir. Variables How many processors in the system? OMP_GET_NUM_PROCS()

Directives Runtime Libraries Environment variables allow the end-user to control the parallel code. All environment variable names are uppercase. Example: OMP_NUM_THREADS Sets the maximum number of threads to use during execution. For example: setenv OMP_NUM_THREADS 8 Envir. Variables

What OpenMP Does NOT Do Not Automatic parallelization - User explicitly specifies parallel execution - Compiler does not ignore user directives even if wrong Not meant for distributed memory parallel systems Not necessarily implemented identically by all vendors Not Guaranteed to make the most efficient use of shared memory

Example:

Example:

Questions: Pthreads Vs OpenMP When will you use Pthreads and when will you use OpenMP? Writing the same program in Pthreads and OpenMP, which one do you think will have better performance? Scalability? If you are using OpenMP, do you have any freedom to help the underlying hardware?

Conclusions OpenMP is an easier way than Pthreads for multithreaded programming OpenMP depends on compiler directives, runtime library, and environment variable. May aspects of OpenMP are still implementation dependent, so you need to be careful!