MPI Hands-On List of the exercises 1 MPI Hands-On Exercise 1: MPI Environment.... 2 2 MPI Hands-On Exercise 2: Ping-pong...3 3 MPI Hands-On Exercise 3: Collective communications and reductions... 5 4 MPI Hands-On Exercise 4: Matrix transpose... 8 5 MPI Hands-On Exercise 5: Matrix-matrix product... 9 6 MPI Hands-On Exercise 6: Communicators... 15 7 MPI Hands-On Exercise 7: Read an MPI-IO file...16 8 MPI Hands-On Exercise 8: Poisson s equation...17
2/26 1 MPI Hands-On Exercise 1: MPI Environment MPI Hands-On Exercise 1: MPI Environment All the processes print a different message, depending on their odd or even rank. For example, for the odd-ranked processes, the message will be: I am the odd-ranked process, my rank is M For the even-ranked processes: I am the even-ranked process, my rank is N Remark: You could use the Fortran intrinsic function mod to test if the rank is even or odd. The function mod(n,m) gives the remainder of n divided by m.
3/26 2 MPI Hands-On Exercise 2: Ping-pong MPI Hands-On Exercise 2: Ping-pong Point to point communications: ping-pong between two processes 1 In the first sub-exercise, we will do only a ping (sending a message from process 0 to process 1). 2 In the second sub-exercise, after the ping we will do a pong (process 1 sends the message received from process 0). 3 In the last sub-exercise, we will do a ping-pong with different message sizes. This means: 1 Send a message of 1000 reals from process 0 to process 1 (this is only a ping). 2 Create a ping-pong version where process 1 sends the message received from process 0 and measures the communication with the MPI_WTIME() function. 3 Create a version where the message size vary in a loop and which measures communication durations and bandwidths.
4/26 2 MPI Hands-On Exercise 2: Ping-pong Remarks: The generation of random numbers uniformly distributed in the range [0., 1.[ is made by calling the Fortran random_number subroutine: call random_number(variable) variable can be a scalar or an array The time duration measurements could be done like this:... time_begin=mpi_wtime()... time_end=mpi_wtime() print ( ("... in",f8.6," seconds.") ),time_end-time_begin...
5/26 3 MPI Hands-On Exercise 3: Collective communications and reductions MPI Hands-On Exercise 3: Collective communications and reductions By simulating a toss up on each process, loop until all the processes make the same choice, or until reaching a maximum number of tests. The Fortran function nint(a) returns the nearest integer of the float a. A loop with an unknown number of iterations is written in Fortran with the do while syntax: do while (condition(s))... end do
6/26 3 MPI Hands-On Exercise 3: Collective communications and reductions 0 1 Play Play 0 P0 0 P2 Play P0 Play P2 P1 P1 P3 P3 1 0 Play Play 0 P0 1 P2 Play P0 Play P2 P1 P1 P3 P3 1 1 Stop Stop 1 P0 1 P2 Stop P0 Stop P2 P1 P1 P3 P3 Figure 1: Do toss up until there is unanimity
7/26 3 MPI Hands-On Exercise 3: Collective communications and reductions If each process generates a pseudo-random number using the random_number subroutine, all will generate the same at the first draw and there will therefore have unanimity at the outset, making the problem irrelevant. It is therefore necessary to change the default behavior (legitimate for a reproduction of similar executions of a code on different machines). To do this, we need to fix on each process a different seed value used to initialize the pseudo-random number generator, by calling the random_seed subroutine. As the values must be different on each process, we use the clock time (although the precision is not sufficient on some machines) and the rank. In addition, the size of the seed for the pseudo random number generator is not the same depending on the algorithms and compilers used. To be portable, we need to obtain the size of the seed, by calling the random_seed subroutine with the size argument, then with this size we allocate an array and initializes it. This array is given at the next call to random_seed with the put argument in order to fix the seed for future sequences of pseudo random-number generation.
8/26 4 MPI Hands-On Exercise 4: Matrix transpose MPI Hands-On Exercise 4: Matrix transpose The goal of this exercise is to practice with the derived datatypes. A is a matrix with 5 lines and 4 columns defined on the process 0. Process 0 sends its A matrix to process 1 and transposes this matrix during the send. 1. 6. 11. 16. 2. 7. 12. 17. 1. 2. 3. 4. 5. 3. 8. 13. 18. 6. 7. 8. 9. 10. 4. 9. 14. 19. 11. 12. 13. 14. 15. 5. 10. 15. 20. 16. 17. 18. 19. 20. Process 0 Figure 2: Matrix transpose Process 1 To do this, we need to create two derived datatypes, a derived datatype type_line and a derived datatype type_transpose.
9/26 5 MPI Hands-On Exercise 5: Matrix-matrix product MPI Hands-On Exercise 5: Matrix-matrix product Collective communications: matrix-matrix product C = A B The matrixes are square and their sizes are a multiple of the number of processes. The matrixes A and B are defined on process 0. Process 0 sends a horizontal slice of matrix A and a vertical slice of matrix B to each process. Each process then calculates its diagonal block of matrix C. To calculate the non-diagonal blocks, each process sends to the other processes its own slice of A (see figure 3). At the end, process 0 gathers and verifies the results.
10/26 5 MPI Hands-On Exercise 5: Matrix-matrix product B 0 1 2 3 0 1 2 3 A C Figure 3: Distributed matrix product
11/26 5 MPI Hands-On Exercise 5: Matrix-matrix product The algorithm that may seem the most immediate and the easiest to program, consisting of each process sending its slice of its matrix A to each of the others, does not perform well because the communication algorithm is not well-balanced. It is easy to seen this when doing performance measurements and graphically representing the collected traces. See the files produit_matrices_v1_n3200_p4.slog2, produit_matrices_v1_n6400_p8.slog2 and produit_matrices_v1_n6400_p16.slog2, using the jumpshot of MPE (MPI Parallel Environment).
12/26 5 MPI Hands-On Exercise 5: Matrix-matrix product Figure 4: Parallel matrix product on 4 processes, for a matrix size of 3200 (first algorithm)
13/26 5 MPI Hands-On Exercise 5: Matrix-matrix product Figure 5: Parallel matrix product on 16 processes, for a matrix size of 6400 (first algorithm)
14/26 5 MPI Hands-On Exercise 5: Matrix-matrix product Changing the algorithm in order to shift slices from process to process, we obtain a perfect balance between calculations and communications and have a speedup of 2 compared to the naive algorithm. See the figure produced by the file produit_matrices_v2_n6400_p16.slog2. Figure 6: Parallel matrix product on 16 processes, for a matrix size of 6400 (second algorithm)
15/26 6 MPI Hands-On Exercise 6: Communicators MPI Hands-On Exercise 6: Communicators Using the Cartesian topology defined below, subdivide in 2 communicators following the lines by calling MPI_COMM_SPLIT() v(:)=1,2,3,4 1 w=1. w=2. w=3. w=4. 1 3 5 7 v(:)=1,2,3,4 0 w=1. w=2. w=3. w=4. 0 2 4 6 0 1 2 3 Figure 7: Subdivision of a 2D topology and communication using the obtained 1D topology
16/26 7 MPI Hands-On Exercise 7: Read an MPI-IO file MPI Hands-On Exercise 7: Read an MPI-IO file We have a binary file data.dat with 484 integer values. With 4 processes, it consists of reading the 121 first values on process 0, the 121 next on the process 1, and so on. We will use 4 different methods: Read via explicit offsets, in individual mode Read via shared file pointers, in collective mode Read via individual file pointers, in individual mode Read via shared file pointers, in individual mode To compile and execute the code, use make, and to verify the results, use make verification which runs a visualisation program corresponding to the four cases.
17/26 8 MPI Hands-On Exercise 8: Poisson s equation MPI Hands-On Exercise 8: Poisson s equation Resolution of the following Poisson equation : 2 u x 2 + 2 u y 2 = f(x,y) in [0,1]x[0,1] u(x,y) = 0. on the boundaries f(x,y) = 2. ( x 2 x+y 2 y ) We will solve this equation with a domain decomposition method : The equation is discretized on the domain with a finite difference method. The obtained system is resolved with a Jacobi solver. The global domain is split into sub-domains. The exact solution is known and is u exact(x,y) = xy(x 1)(y 1).
18/26 8 MPI Hands-On Exercise 8: Poisson s equation To discretize the equation, we define a grid with a set of points (x i,y j) x i = i h x for i = 0,...,ntx+1 y j = j h y for j = 0,...,nty +1 h x = h y = h x : h y : ntx : nty : 1 (ntx+1) 1 (nty +1) x-wise step y-wise step number of x-wise interior points number of y-wise interior points In total, there are ntx+2 points in the x direction and nty+2 points in the y direction.
19/26 8 MPI Hands-On Exercise 8: Poisson s equation Let u ij be the estimated solution at position x i = ih x and x j = jh y. The Jacobi solver consist of computing : u n+1 ij = c 0(c 1(u n i+1j +u n i 1j)+c 2(u n ij+1 +u n ij 1) f ij) with: c 0 = 1 h 2 xh 2 y 2 h 2 x +h 2 y c 1 = 1 h 2 x c 2 = 1 h 2 y
20/26 8 MPI Hands-On Exercise 8: Poisson s equation In parallel, the interface values of subdomains must be exchanged between the neighbours. We use ghost cells as receive buffers.
21/26 8 MPI Hands-On Exercise 8: Poisson s equation N W S Figure 8: Exchange points on the interfaces E
22/26 8 MPI Hands-On Exercise 8: Poisson s equation y x u(sx-1,sy) u(sx,sy-1) u(sx,sy) sy sy-1 u(sx,ey+1) u(sx,ey) ey ey+1 sx-1 sx ex ex+1 u(ex+1,sy) u(ex,sy) Figure 9: Numeration of points in different sub-domains
23/26 8 MPI Hands-On Exercise 8: Poisson s equation y x 0 1 2 3 Figure 10: Process rank numbering in the sub-domains
24/26 8 MPI Hands-On Exercise 8: Poisson s equation Process 0 Process 1 File Process 2 Process 3 Figure 11: Writing the global matrix u in a file You need to : Define a view, to see only the owned part of the global matrix u; Define a type, in order to write the local part of matrix u(without interfaces); Apply the view to the file; Write using only one call.
25/26 8 MPI Hands-On Exercise 8: Poisson s equation Initialisation of the MPI environment. Creation of the 2D Cartesian topology/ Determination of the array indexes for each sub-domain. Determination of the 4 neighbour processes for each sub-domain. Creation of two derived datatypes, type_line and type_column. Exchange the values on the interfaces with the other sub-domains. Computation of the global error. When the global error is lower than a specified value (machine precision for example), we consider that we have reached the exact solution. Collecting of the global matrix u (the same one as we obtained in the sequential) in an MPI-IO file data.dat.
26/26 8 MPI Hands-On Exercise 8: Poisson s equation Directory: tp8/poisson A skeleton of the parallel version is proposed: It consists of a main program (poisson.f90) and several subroutines. All the modifications have to be done in the module_parallel_mpi.f90 file. To compile and execute the code, use make and to verify the results, use make verification which runs a reading program of the data.dat file and compares it with the sequential version.