MPI Hands-On List of the exercises

Similar documents

CHM 579 Lab 1: Basic Monte Carlo Algorithm

Iterative Solvers for Linear Systems

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Parallel and Distributed Computing Programming Assignment 1

Matrix Multiplication

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

5: Magnitude 6: Convert to Polar 7: Convert to Rectangular

OpenFOAM Optimization Tools

Learn CUDA in an Afternoon: Hands-on Practical Exercises

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

THE NAS KERNEL BENCHMARK PROGRAM

A Pattern-Based Approach to. Automated Application Performance Analysis

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

CUDAMat: a CUDA-based matrix class for Python

HPC Deployment of OpenFOAM in an Industrial Setting

WESTMORELAND COUNTY PUBLIC SCHOOLS Integrated Instructional Pacing Guide and Checklist Computer Math

Sources: On the Web: Slides will be available on:

Introduction to Matlab

6 Scalar, Stochastic, Discrete Dynamic Systems

Glossary of Object Oriented Terms

CUDA programming on NVIDIA GPUs

Evaluation of CUDA Fortran for the CFD code Strukti

Computer programming course in the Department of Physics, University of Calcutta

HPC Wales Skills Academy Course Catalogue 2015

POISSON AND LAPLACE EQUATIONS. Charles R. O Neill. School of Mechanical and Aerospace Engineering. Oklahoma State University. Stillwater, OK 74078

Vector storage and access; algorithms in GIS. This is lecture 6

Poisson Equation Solver Parallelisation for Particle-in-Cell Model

Mathematical Libraries on JUQUEEN. JSC Training Course

Solution of Linear Systems

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

Illustration 1: Diagram of program function and data flow

Visualization of 2D Domains

SOLUTIONS FOR PROBLEM SET 2

Making the Monte Carlo Approach Even Easier and Faster. By Sergey A. Maidanov and Andrey Naraikin

ALLIED PAPER : DISCRETE MATHEMATICS (for B.Sc. Computer Technology & B.Sc. Multimedia and Web Technology)

Random-Number Generation

Mathematical Libraries and Application Software on JUROPA and JUQUEEN

GPU Acceleration of the SENSEI CFD Code Suite

Linear Threshold Units

1 Bull, 2011 Bull Extreme Computing

Fast Multipole Method for particle interactions: an open source parallel library component

High Performance Computing in CST STUDIO SUITE

A new binary floating-point division algorithm and its software implementation on the ST231 processor

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Code Generation Tools for PDEs. Matthew Knepley PETSc Developer Mathematics and Computer Science Division Argonne National Laboratory

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

HSL and its out-of-core solver

Partitioning and Divide and Conquer Strategies

Factorization Theorems

Hash-Storage Techniques for Adaptive Multilevel Solvers and Their Domain Decomposition Parallelization

NEW MEXICO Grade 6 MATHEMATICS STANDARDS

Introduction to the Finite Element Method

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

Information technology Programming languages Fortran Enhanced data type facilities

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Notes on Factoring. MA 206 Kurt Bryan

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

CS 4204 Computer Graphics

OpenMP Programming on ScaleMP

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

Advanced Operating Systems CS428

Compliance and Requirement Traceability for SysML v.1.0a

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Cellular Computing on a Linux Cluster

Performance Monitoring of Parallel Scientific Applications

Binary Image Reconstruction

Systolic Computing. Fundamentals

Programming Languages & Tools

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Fast Arithmetic Coding (FastAC) Implementations

G.H. Raisoni College of Engineering, Nagpur. Department of Information Technology

Breaking The Code. Ryan Lowe. Ryan Lowe is currently a Ball State senior with a double major in Computer Science and Mathematics and

Home Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit

Programming Exercise 3: Multi-class Classification and Neural Networks

Figure 1: Graphical example of a mergesort 1.

Random graphs with a given degree sequence

Parallel Algorithm for Dense Matrix Multiplication

Large-Scale Reservoir Simulation and Big Data Visualization

SAS Software to Fit the Generalized Linear Model

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Lecture 3: Finding integer solutions to systems of linear equations

Differentiating a Time-dependent CFD Solver

Poisson Models for Count Data

How High a Degree is High Enough for High Order Finite Elements?

The Pointless Machine and Escape of the Clones

University of Amsterdam - SURFsara. High Performance Computing and Big Data Course

Load Balancing Techniques

FINITE DIFFERENCE METHODS

We will learn the Python programming language. Why? Because it is easy to learn and many people write programs in Python so we can share.

FACTORING SPARSE POLYNOMIALS

Transcription:

MPI Hands-On List of the exercises 1 MPI Hands-On Exercise 1: MPI Environment.... 2 2 MPI Hands-On Exercise 2: Ping-pong...3 3 MPI Hands-On Exercise 3: Collective communications and reductions... 5 4 MPI Hands-On Exercise 4: Matrix transpose... 8 5 MPI Hands-On Exercise 5: Matrix-matrix product... 9 6 MPI Hands-On Exercise 6: Communicators... 15 7 MPI Hands-On Exercise 7: Read an MPI-IO file...16 8 MPI Hands-On Exercise 8: Poisson s equation...17

2/26 1 MPI Hands-On Exercise 1: MPI Environment MPI Hands-On Exercise 1: MPI Environment All the processes print a different message, depending on their odd or even rank. For example, for the odd-ranked processes, the message will be: I am the odd-ranked process, my rank is M For the even-ranked processes: I am the even-ranked process, my rank is N Remark: You could use the Fortran intrinsic function mod to test if the rank is even or odd. The function mod(n,m) gives the remainder of n divided by m.

3/26 2 MPI Hands-On Exercise 2: Ping-pong MPI Hands-On Exercise 2: Ping-pong Point to point communications: ping-pong between two processes 1 In the first sub-exercise, we will do only a ping (sending a message from process 0 to process 1). 2 In the second sub-exercise, after the ping we will do a pong (process 1 sends the message received from process 0). 3 In the last sub-exercise, we will do a ping-pong with different message sizes. This means: 1 Send a message of 1000 reals from process 0 to process 1 (this is only a ping). 2 Create a ping-pong version where process 1 sends the message received from process 0 and measures the communication with the MPI_WTIME() function. 3 Create a version where the message size vary in a loop and which measures communication durations and bandwidths.

4/26 2 MPI Hands-On Exercise 2: Ping-pong Remarks: The generation of random numbers uniformly distributed in the range [0., 1.[ is made by calling the Fortran random_number subroutine: call random_number(variable) variable can be a scalar or an array The time duration measurements could be done like this:... time_begin=mpi_wtime()... time_end=mpi_wtime() print ( ("... in",f8.6," seconds.") ),time_end-time_begin...

5/26 3 MPI Hands-On Exercise 3: Collective communications and reductions MPI Hands-On Exercise 3: Collective communications and reductions By simulating a toss up on each process, loop until all the processes make the same choice, or until reaching a maximum number of tests. The Fortran function nint(a) returns the nearest integer of the float a. A loop with an unknown number of iterations is written in Fortran with the do while syntax: do while (condition(s))... end do

6/26 3 MPI Hands-On Exercise 3: Collective communications and reductions 0 1 Play Play 0 P0 0 P2 Play P0 Play P2 P1 P1 P3 P3 1 0 Play Play 0 P0 1 P2 Play P0 Play P2 P1 P1 P3 P3 1 1 Stop Stop 1 P0 1 P2 Stop P0 Stop P2 P1 P1 P3 P3 Figure 1: Do toss up until there is unanimity

7/26 3 MPI Hands-On Exercise 3: Collective communications and reductions If each process generates a pseudo-random number using the random_number subroutine, all will generate the same at the first draw and there will therefore have unanimity at the outset, making the problem irrelevant. It is therefore necessary to change the default behavior (legitimate for a reproduction of similar executions of a code on different machines). To do this, we need to fix on each process a different seed value used to initialize the pseudo-random number generator, by calling the random_seed subroutine. As the values must be different on each process, we use the clock time (although the precision is not sufficient on some machines) and the rank. In addition, the size of the seed for the pseudo random number generator is not the same depending on the algorithms and compilers used. To be portable, we need to obtain the size of the seed, by calling the random_seed subroutine with the size argument, then with this size we allocate an array and initializes it. This array is given at the next call to random_seed with the put argument in order to fix the seed for future sequences of pseudo random-number generation.

8/26 4 MPI Hands-On Exercise 4: Matrix transpose MPI Hands-On Exercise 4: Matrix transpose The goal of this exercise is to practice with the derived datatypes. A is a matrix with 5 lines and 4 columns defined on the process 0. Process 0 sends its A matrix to process 1 and transposes this matrix during the send. 1. 6. 11. 16. 2. 7. 12. 17. 1. 2. 3. 4. 5. 3. 8. 13. 18. 6. 7. 8. 9. 10. 4. 9. 14. 19. 11. 12. 13. 14. 15. 5. 10. 15. 20. 16. 17. 18. 19. 20. Process 0 Figure 2: Matrix transpose Process 1 To do this, we need to create two derived datatypes, a derived datatype type_line and a derived datatype type_transpose.

9/26 5 MPI Hands-On Exercise 5: Matrix-matrix product MPI Hands-On Exercise 5: Matrix-matrix product Collective communications: matrix-matrix product C = A B The matrixes are square and their sizes are a multiple of the number of processes. The matrixes A and B are defined on process 0. Process 0 sends a horizontal slice of matrix A and a vertical slice of matrix B to each process. Each process then calculates its diagonal block of matrix C. To calculate the non-diagonal blocks, each process sends to the other processes its own slice of A (see figure 3). At the end, process 0 gathers and verifies the results.

10/26 5 MPI Hands-On Exercise 5: Matrix-matrix product B 0 1 2 3 0 1 2 3 A C Figure 3: Distributed matrix product

11/26 5 MPI Hands-On Exercise 5: Matrix-matrix product The algorithm that may seem the most immediate and the easiest to program, consisting of each process sending its slice of its matrix A to each of the others, does not perform well because the communication algorithm is not well-balanced. It is easy to seen this when doing performance measurements and graphically representing the collected traces. See the files produit_matrices_v1_n3200_p4.slog2, produit_matrices_v1_n6400_p8.slog2 and produit_matrices_v1_n6400_p16.slog2, using the jumpshot of MPE (MPI Parallel Environment).

12/26 5 MPI Hands-On Exercise 5: Matrix-matrix product Figure 4: Parallel matrix product on 4 processes, for a matrix size of 3200 (first algorithm)

13/26 5 MPI Hands-On Exercise 5: Matrix-matrix product Figure 5: Parallel matrix product on 16 processes, for a matrix size of 6400 (first algorithm)

14/26 5 MPI Hands-On Exercise 5: Matrix-matrix product Changing the algorithm in order to shift slices from process to process, we obtain a perfect balance between calculations and communications and have a speedup of 2 compared to the naive algorithm. See the figure produced by the file produit_matrices_v2_n6400_p16.slog2. Figure 6: Parallel matrix product on 16 processes, for a matrix size of 6400 (second algorithm)

15/26 6 MPI Hands-On Exercise 6: Communicators MPI Hands-On Exercise 6: Communicators Using the Cartesian topology defined below, subdivide in 2 communicators following the lines by calling MPI_COMM_SPLIT() v(:)=1,2,3,4 1 w=1. w=2. w=3. w=4. 1 3 5 7 v(:)=1,2,3,4 0 w=1. w=2. w=3. w=4. 0 2 4 6 0 1 2 3 Figure 7: Subdivision of a 2D topology and communication using the obtained 1D topology

16/26 7 MPI Hands-On Exercise 7: Read an MPI-IO file MPI Hands-On Exercise 7: Read an MPI-IO file We have a binary file data.dat with 484 integer values. With 4 processes, it consists of reading the 121 first values on process 0, the 121 next on the process 1, and so on. We will use 4 different methods: Read via explicit offsets, in individual mode Read via shared file pointers, in collective mode Read via individual file pointers, in individual mode Read via shared file pointers, in individual mode To compile and execute the code, use make, and to verify the results, use make verification which runs a visualisation program corresponding to the four cases.

17/26 8 MPI Hands-On Exercise 8: Poisson s equation MPI Hands-On Exercise 8: Poisson s equation Resolution of the following Poisson equation : 2 u x 2 + 2 u y 2 = f(x,y) in [0,1]x[0,1] u(x,y) = 0. on the boundaries f(x,y) = 2. ( x 2 x+y 2 y ) We will solve this equation with a domain decomposition method : The equation is discretized on the domain with a finite difference method. The obtained system is resolved with a Jacobi solver. The global domain is split into sub-domains. The exact solution is known and is u exact(x,y) = xy(x 1)(y 1).

18/26 8 MPI Hands-On Exercise 8: Poisson s equation To discretize the equation, we define a grid with a set of points (x i,y j) x i = i h x for i = 0,...,ntx+1 y j = j h y for j = 0,...,nty +1 h x = h y = h x : h y : ntx : nty : 1 (ntx+1) 1 (nty +1) x-wise step y-wise step number of x-wise interior points number of y-wise interior points In total, there are ntx+2 points in the x direction and nty+2 points in the y direction.

19/26 8 MPI Hands-On Exercise 8: Poisson s equation Let u ij be the estimated solution at position x i = ih x and x j = jh y. The Jacobi solver consist of computing : u n+1 ij = c 0(c 1(u n i+1j +u n i 1j)+c 2(u n ij+1 +u n ij 1) f ij) with: c 0 = 1 h 2 xh 2 y 2 h 2 x +h 2 y c 1 = 1 h 2 x c 2 = 1 h 2 y

20/26 8 MPI Hands-On Exercise 8: Poisson s equation In parallel, the interface values of subdomains must be exchanged between the neighbours. We use ghost cells as receive buffers.

21/26 8 MPI Hands-On Exercise 8: Poisson s equation N W S Figure 8: Exchange points on the interfaces E

22/26 8 MPI Hands-On Exercise 8: Poisson s equation y x u(sx-1,sy) u(sx,sy-1) u(sx,sy) sy sy-1 u(sx,ey+1) u(sx,ey) ey ey+1 sx-1 sx ex ex+1 u(ex+1,sy) u(ex,sy) Figure 9: Numeration of points in different sub-domains

23/26 8 MPI Hands-On Exercise 8: Poisson s equation y x 0 1 2 3 Figure 10: Process rank numbering in the sub-domains

24/26 8 MPI Hands-On Exercise 8: Poisson s equation Process 0 Process 1 File Process 2 Process 3 Figure 11: Writing the global matrix u in a file You need to : Define a view, to see only the owned part of the global matrix u; Define a type, in order to write the local part of matrix u(without interfaces); Apply the view to the file; Write using only one call.

25/26 8 MPI Hands-On Exercise 8: Poisson s equation Initialisation of the MPI environment. Creation of the 2D Cartesian topology/ Determination of the array indexes for each sub-domain. Determination of the 4 neighbour processes for each sub-domain. Creation of two derived datatypes, type_line and type_column. Exchange the values on the interfaces with the other sub-domains. Computation of the global error. When the global error is lower than a specified value (machine precision for example), we consider that we have reached the exact solution. Collecting of the global matrix u (the same one as we obtained in the sequential) in an MPI-IO file data.dat.

26/26 8 MPI Hands-On Exercise 8: Poisson s equation Directory: tp8/poisson A skeleton of the parallel version is proposed: It consists of a main program (poisson.f90) and several subroutines. All the modifications have to be done in the module_parallel_mpi.f90 file. To compile and execute the code, use make and to verify the results, use make verification which runs a reading program of the data.dat file and compares it with the sequential version.