University of Amsterdam - SURFsara. High Performance Computing and Big Data Course

Similar documents
Parallel Computing. Shared memory parallel programming with OpenMP

Parallel Computing. Parallel shared memory computing with OpenMP

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

High Performance Computing

#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Towards OpenMP Support in LLVM

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

Using the Intel Inspector XE

HPC Wales Skills Academy Course Catalogue 2015

OpenACC 2.0 and the PGI Accelerator Compilers

Objectives. Overview of OpenMP. Structured blocks. Variable scope, work-sharing. Scheduling, synchronization

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

What is Multi Core Architecture?

Scalability evaluation of barrier algorithms for OpenMP

Distributed communication-aware load balancing with TreeMatch in Charm++

High performance computing systems. Lab 1

Parallel and Distributed Computing Programming Assignment 1

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)

Parallelization: Binary Tree Traversal

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Debugging with TotalView

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Load Balancing. computing a file with grayscales. granularity considerations static work load assignment with MPI

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Multicore Parallel Computing with OpenMP

A Case Study - Scaling Legacy Code on Next Generation Platforms

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Sources: On the Web: Slides will be available on:

Parallel Programming with MPI on the Odyssey Cluster

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

An Introduction to Parallel Programming with OpenMP

7th Marathon of Parallel Programming WSCAD-SSC/SBAC-PAD-2012

An HPC Application Deployment Model on Azure Cloud for SMEs

Scalability and Classifications

Hands-on exercise: NPB-OMP / BT

We will learn the Python programming language. Why? Because it is easy to learn and many people write programs in Python so we can share.

Lecture 2 Parallel Programming Platforms

MPI Hands-On List of the exercises

Partitioning under the hood in MySQL 5.5

Session 2: MUST. Correctness Checking

Parallel Algorithm Engineering

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

OpenMP Programming on ScaleMP

To copy all examples and exercises to your local scratch directory type: /g/public/training/openmp/setup.sh

OpenMP* 4.0 for HPC in a Nutshell

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Introduction to Hybrid Programming

CP Lab 2: Writing programs for simple arithmetic problems

MPI-Checker Static Analysis for MPI

Analysis of Binary Search algorithm and Selection Sort algorithm

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Performance Analysis and Optimization Tool

A Multi-layered Domain-specific Language for Stencil Computations

The RWTH Compute Cluster Environment

MAQAO Performance Analysis and Optimization Tool

OpenACC Basics Directive-based GPGPU Programming

Multi-Threading Performance on Commodity Multi-Core Processors

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Lecture 6: Introduction to MPI programming. Lecture 6: Introduction to MPI programming p. 1

UTS: An Unbalanced Tree Search Benchmark

Static vs. Dynamic. Lecture 10: Static Semantics Overview 1. Typical Semantic Errors: Java, C++ Typical Tasks of the Semantic Analyzer

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Grid Engine Users Guide p1 Edition

FPGA area allocation for parallel C applications

Massachusetts Institute of Technology 6.005: Elements of Software Construction Fall 2011 Quiz 2 November 21, 2011 SOLUTIONS.

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters

Using Parallel Computing to Run Multiple Jobs

CUDA Programming. Week 4. Shared memory and register

MUSIC Multi-Simulation Coordinator. Users Manual. Örjan Ekeberg and Mikael Djurfeldt

XMOS Programming Guide

Application Performance Analysis Tools and Techniques

Lecture 1: Data Storage & Index

SURFsara HPC Cloud Workshop

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Systolic Computing. Fundamentals

Questions 1 through 25 are worth 2 points each. Choose one best answer for each.

Spring 2011 Prof. Hyesoon Kim

Data Structure Oriented Monitoring for OpenMP Programs

Introduction to Sun Grid Engine (SGE)

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Building Embedded Systems

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Lecture 8: Routing I Distance-vector Algorithms. CSE 123: Computer Networks Stefan Savage

Part I Courses Syllabus

Hybrid Programming with MPI and OpenMP

Transcription:

University of Amsterdam - SURFsara High Performance Computing and Big Data Course Workshop 7: OpenMP and MPI Assignments Clemens Grelck C.Grelck@uva.nl Roy Bakker R.Bakker@uva.nl Adam Belloum A.S.Z.Belloum@uva.nl January 26, 2015

UvA / SURF-SARA HPC and Big Data Course 2 1 Getting started with OpenMP 1.1 Skeleton A very simple skeleton for OpenMP code is the following (let us call it foo.c): #include <omp.h> int main() omp_set_num_threads(8); return 0; We need to explicitly set the number of threads we want to use since we are executing it on a cluster with batch job submission system. In order to compile your code run: gcc -O3 -fopenmp foo.c -o foo. 1.2 Hello World Of course, your first program should be Hello World, but as we are using OpenMP we want each thread to say hello. Write a simple program that starts a parallel section where each thread says hello and prints out his/her thread id. Also make sure that the master thread (and only the master thread) prints the total number of threads in the parallel section. 1.3 Hello World 2.0 Now let us make Hello World slightly less trivial. Extend your code so that each thread says hello and that it is thread m of n. As a little challenge, you may request the number of active threads only once in the entire program. 2 Reductions Consider the code example in Fig. 1. const int n = 1000; int arr[n]; int i; int result = 0; for(i = 0; i < n; i++) arr[i] = i; for (i = 0; i < n; i++) result += arr[i]; Figure 1: Simple reduction code in (sequential) C

UvA / SURF-SARA HPC and Big Data Course 3 Exercise 3 Exercise 4 Parallelize the first for-loop. Parallelize the second for-loop using a critical section. Parallelize the second for-loop using the reduction clause. Compare your three implementations. Can you explain the performance difference? 3 Mandelbrot Fig. 2 shows a simple sequential C program that generates a Mandelbrot image, similar to the code used during the lecture. You can compile this code with gcc -fopenmp mandel.c -o mandel. When running the resulting binary, redirect the standard output to a file, e.g. mandel.ppm. To show your image run display mandel.ppm. For your comfort the code and a simple makefile are available at: http://staff.fnwi.uva.nl/r.bakker/uva-sara-hpc-2015/. Examine the code and try to parallelize it with OpenMP pragmas. Exercise 2 Parallelizing the outer loop is often a good idea. However, this might require to reorganize some of the code. Try to do this. Exercise 3 Try to parallelize your reorganized code as much as possible and experiment with different setups. How do you obtain the best performance? Exercise 4 Experiment with the different scheduling techniques that OpenMP provides. Can you measure performance differences? Which choice gives you the best runtimes? Exercise 5 Add a dithering step to the Mandelbrot example analogous to the example in the lecture. Can you measure a difference in performance if you use the nowait clause? Exercise 6 Parallelize the middle loop instead of the outer, and play with the collapse clause. Can you measure a performance difference between the three different versions? 4 MPI collective communication Collective communication is a form of structured communication, where instead of a dedicated sender and a dedicated receiver all MPI processes of a given communicator participate. A simple example is broadcast communication where a given MPI process sends a message to all other MPI processes in the communicator. Write a C+MPI function int MYMPI_Bcast( void *buffer, /* INOUT : buffer address */ int count, /* IN : buffer size */ MPI_Datatype datatype, /* IN : datatype of entry */ int root, /* IN : root process (sender) */ MPI_Comm communicator) /* IN : communicator */ that implements broadcast communication by means of point-to-point communication. Each MPI process of the given communicator is assumed to execute the broadcast function with the same message parameters and root process argument. The root process is supposed to send the content of its own buffer to all other processes of the communicator. Any non-root process uses the given buffer for receiving the corresponding data from the root process.

UvA / SURF-SARA HPC and Big Data Course 4 #include <stdio.h> #include <math.h> #include <stdlib.h> void put_pixel(int red, int green, int blue) fputc((char)red,stdout); fputc((char)green,stdout); fputc((char)blue,stdout); int main() double x,xx,y,cx,cy; const int itermax = 10000; /* how many iterations to do */ const double magnify=2.0; /* no magnification */ const int hxres = 1000; /* horizonal resolution */ const int hyres = 1000; /* vertical resolution */ int iter, hy, hx; /* header for PPM output */ printf("p6\n"); printf("%d %d\n255\n",hxres,hyres); for (hy=1; hy<=hyres; hy++) for (hx=1; hx<=hxres; hx++) cx = (((float)hx)/((float)hxres)-0.5)/magnify*3.0-0.7; cy = (((float)hy)/((float)hyres)-0.5)/magnify*3.0; x = 0.0; y = 0.0; for (iter = 1; iter < itermax; iter++) xx = x*x-y*y+cx; y = 2.0*x*y+cy; x = xx; if (x*x+y*y>100.0) iter = 999999; if (iter<99999) put_pixel(0,255,255); else put_pixel(180,0,0); return 0; Figure 2: Simple Mandelbrot implementation in (sequential) C

UvA / SURF-SARA HPC and Big Data Course 5 Implement the above function MYMPI_Bcast using point-to-point communication. Assume a fully-connected network topology, ie every node may send messages to any other node at the same price. Exercise 2 A particular advantage of collective communication operations is that their implementation can take advantage of the underlying physical network topology without making an MPI-based application program specific to any such topology. For your implementation of broadcast assume a 1-dimensional ring topology, where each node only has communication links with its two direct neighbours with (circularly) increasing and decreasing MPI process ids. While any MPI process may, nonetheless, send messages to any other MPI process, messages may need to be routed through a number of intermediate nodes. Communication cost can be assumed to be linear in the number of nodes involved. Aim for an efficient implementation of your function on an (imaginary) ring network topology. No experiments are required for this assignment, but describe your code in detail in a short report: how it is adapted to the given network topology and how many atomic communication events (sending a message from one node to a neighbouring node) are required to complete one broadcast. 5 Distributed Mandelbrot (optional) Parallelise the Mandelbrot code of Fig. 2 to run on scalable architectures using MPI message passing.