Message Passing Interface (MPI)

Size: px
Start display at page:

Download "Message Passing Interface (MPI)"

Transcription

1 Message Passing Interface (MPI) Jalel Chergui Isabelle Dupays Denis Girou Pierre-François Lavallée Dimitri Lecas Philippe Wautelet

2 MPI Plan I 1 Introduction Availability and updating Definitions Message Passing concepts History Library Environment Description Exemple Point-to-point Communications General Concepts Predefined MPI Datatypes Other Possibilities Example: communication ring Collective communications General concepts Global synchronization: MPI_BARRIER() Broadcast : MPI_BCAST() Scatter : MPI_SCATTER() Gather : MPI_GATHER() Gather-to-all : MPI_ALLGATHER() Extended gather : MPI_GATHERV()...84

3 MPI Plan II 4.8 All-to-all : MPI_ALLTOALL() Global reduction Additions One-sided Communication Introduction Reminder : The Concept of Message-Passing The Concept of One-sided Communication RMA Approach of MPI Memory Window Data Transfer Completion of the Transfer : the synchronization Active Target Synchronization Passive Target Synchronization Conclusions Derived datatypes Introduction Contiguous datatypes Constant stride Other subroutines Examples The datatype "matrix row" The datatype "matrix line"...158

4 MPI Plan III The datatype "matrix block" Homogenous datatypes of variable strides Subarray Datatype Constructor Heterogenous datatypes Subroutines annexes Conclusion Optimizations Introduction Point-to-Point Send Modes Key Terms Synchronous Sends Buffered Sends Standard Sends Ready Sends Performances of Send Modes Computation-Communication Overlap Communicators Introduction Default Communicator Example Groups and Communicators Communicator Resulting from Another

5 MPI Plan IV 8.6 Topologies Cartesian topologies Subdividing a cartesian topology MPI-IO Introduction Presentation Issues Definitions File management Reads/Writes : general concepts Individual Reads/Writes Via explicit offsets Via Individual File Pointers Shared File Pointers Collective Reads/Writes Via explicit offsets Via individual file pointers Via shared file pointers Explicit positioning of pointers in a file Definition of views NonBlocking Reads/Writes Via explicit displacements

6 MPI Plan V Via individual file pointers NonBlocking and collective Reads/Writes Advices Conclusion...393

7 7/246 1 Introduction 1 Introduction 1.1 Availability and updating Definitions Message Passing concepts History Library Environment 3 Point-to-point Communications 4 Collective communications 5 One-sided Communication 6 Derived datatypes 7 Optimizations 8 Communicators 9 MPI-IO 10 Conclusion

8 8/246 1 Introduction 1.1 Availability and updating 1 Introduction 1.1 Availability and updating This document is likely to be updated regularly. The most recent version is available on the Web server of IDRIS, section Support de cours: IDRIS Institut du développement et des ressources en informatique scientifique Rue John Von Neumann Bâtiment 506 BP ORSAY CEDEX France Translated with the help of Yazeed Zabrie.

9 9/246 1 Introduction 1.2 Definitions 1 Introduction 1.2 Definitions ➊ The sequential programming model: the program is executed by one and only one process ; all the variables and constants of the program are allocated in the memory of the process ; a process is executed on a physical processor of the machine. MEMORY Program P Figure 1: Sequential programming model

10 10/246 1 Introduction 1.2 Definitions ➋ In the message passing programming model : the program is written in a classic language (Fortran, C, C++, etc.) ; all the variables of the program are private and reside in the local memory of each process ; each process may execute different parts of a program ; a variable is exchanged between two or many processes via a call to subroutines. Memories Processes Network Programs Figure 2: Message-Passing Programming Model

11 11/246 1 Introduction 1.2 Definitions ➌ The SPMD execution model : Single Program, Multiple Data ; the same program is executed by all the processes ; all machines support this programming model and some support only this one ; it is a particular case of the general model of MPMD (Multiple Program, Multiple Data), which it can also emulate. Memories d0 d1 d2 d3 Processes Network Single program Figure 3: Single Program, Multiple Data

12 12/246 1 Introduction 1.3 Message Passing concepts 1 Introduction 1.3 Message Passing concepts If a message is sent to a process, the latter must receive it 0 1 source t target I send Figure 4: Message Passing I receive

13 13/246 1 Introduction 1.3 Message Passing concepts A message consists of data chunks passing from the sending process to the receiving process In addition to the data (scalar variables, arrays, etc.) to be sent, a message has to contain the following information : the identifier of the sending process ; the datatype ; the length ; the identifier of the receiving process. Memories Processes Network d1 d2 1 2 message Figure 5: Message Construction message sender receiver datatype length DATA d1

14 14/246 1 Introduction 1.3 Message Passing concepts The exchanged messages are interpreted and managed by an environment comparable to telephony, fax, postal mail, , etc. The message is sent to a specified address. The receiving process must be able to classify and interpret the messages which are sent to it. The environment in question is MPI (Message Passing Interface). An MPI application is a group of independent processes executing, each one, their own code and communicating via calls to MPI library subroutines.

15 15/246 1 Introduction 1.4 History 1 Introduction 1.4 History Version 1.0 : in June 1994, the MPI forum (Message Passing Interface Forum), with the participation of about forty organizations, came to the definition of a set of subroutines concerning the MPI message passing library. Version 1.1 : June 1995, with only minor changes. Version 1.2 : in 1997, with minor changes for a better consistency of the naming of some subroutines. Version 1.3 : September 2008, with clarifications in MPI 1.2, according to clarifications themselves made by MPI-2.1 Version 2.0 : released in July 1997, this version brought deliberately non-integrated, essential additions in MPI 1.0 (process dynamic management, one-sided communications, parallel I/O, etc.). Version 2.1 : June 2008, with only clarifications in MPI 2.0 but without any changes. Version 2.2 : September 2009, with only "small" additions

16 16/246 1 Introduction 1.4 History Version 3.0: September 2012 changes and important additions compared to version 2.2 ; main changes : nonblocking collective communications ; revision of the implementation of one-sided communications ; Fortran ( ) bindings ; C++ bindings removed ; interfacing of external tools (for debugging and performance measures) ; etc. see

17 17/246 1 Introduction 1.5 Library 1 Introduction 1.5 Library Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, Version 3.0, High Performance Computing Center Stuttgart (HLRS), Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, Version 2.2, High Performance Computing Center Stuttgart (HLRS), William Gropp, Ewing Lusk et Anthony Skjellum : Using MPI: Portable Parallel Programming with the Message Passing Interface, second edition, MIT Press, 1999 William Gropp, Ewing Lusk et Rajeev Thakur : Using MPI-2, MIT Press, 1999 Peter S. Pacheco : Parallel Programming with MPI, Morgan Kaufman Ed., 1997 Additional references :

18 18/246 1 Introduction 1.5 Library MPI open source implementations : they can be installed on a large number of architectures but their performances are generally below the ones of the constructors implementations. 1 MPICH2 : 2 Open MPI :

19 19/246 1 Introduction 1.5 Library The debuggers 1 Totalview 2 DDT Tools for performance measurements 1 MPE : MPI Parallel Environment 2 Scalasca : Scalable Performance Analysis of Large-Scale Applications 3 Vampir

20 20/246 1 Introduction 1.5 Library Some open source parallel scientific libraries : 1 ScaLAPACK : linear algebraic problem solvers using direct methods. The sources can be downloaded from the website 2 PETSc : linear and non-linear algebraic problem solvers using iterative methods. The sources can be downloaded from the website 3 MUMPS : MUltifrontal Massively Parallel sparse direct Solvers. The sources can be downloaded from the website 4 FFTW : fast Fourier transforms. The sources can be downloaded from the website

21 21/246 2 Environment 1 Introduction 2 Environment 2.1 Description Exemple Point-to-point Communications 4 Collective communications 5 One-sided Communication 6 Derived datatypes 7 Optimizations 8 Communicators 9 MPI-IO 10 Conclusion

22 22/246 2 Environment 2.1 Description 2 Environment 2.1 Description Every program unit calling MPI subroutines has to include a header file. In Fortran, we must use the mpi module introduced in MPI-2 (in MPI-1, it was the mpif.h file), and in C/C++ the mpi.h file. The MPI_INIT() subroutine initializes the necessary environment : integer, intent(out) :: code call MPI_INIT(code) The MPI_FINALIZE() subroutine disables this environment : integer, intent(out) :: code call MPI_FINALIZE(code)

23 23/246 2 Environment 2.1 Description All the operations made by MPI are related to communicators. The default communicator is MPI_COMM_WORLD which includes all the active processes Figure 6: MPI_COMM_WORLD Communicator

24 24/246 2 Environment 2.1 Description At any moment, we can know the number of processes managed by a given communicator by the MPI_COMM_SIZE() subroutine : integer, intent(out) :: nb_procs,code call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) Similarly, the MPI_COMM_RANK() subroutine allows to obtain the process rank (i.e. its instance number, which is a number between 0 and the value sent by MPI_COMM_SIZE() 1) : integer, intent(out) :: rank,code call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code)

25 25/246 2 Environment 2.2 Exemple 2 Environment 2.2 Exemple 1 program who_am_i 2 use mpi 3 implicit none 4 integer :: nb_procs,rank,code 5 6 call MPI_INIT(code) 7 8 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 9 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) print *, I am the process,rank, among,nb_procs call MPI_FINALIZE(code) 14 end program who_am_i > mpiexec -n 7 who_am_i I am the process 3 among 7 I am the process 0 among 7 I am the process 4 among 7 I am the process 1 among 7 I am the process 5 among 7 I am the process 2 among 7 I am the process 6 among 7

26 26/246 3 Point-to-point Communications 1 Introduction 2 Environment 3 Point-to-point Communications 3.1 General Concepts Predefined MPI Datatypes Other Possibilities Example: communication ring Collective communications 5 One-sided Communication 6 Derived datatypes 7 Optimizations 8 Communicators 9 MPI-IO 10 Conclusion

27 27/246 3 Point-to-point Communications 3.1 General Concepts 3 Point-to-point Communications 3.1 General Concepts A point-to-point communication occurs between two processes, one is called the sender process and the other one is called the receiver process Figure 7: Point-to-point communication

28 27/246 3 Point-to-point Communications 3.1 General Concepts 3 Point-to-point Communications 3.1 General Concepts A point-to-point communication occurs between two processes, one is called the sender process and the other one is called the receiver process Sender 5 6 Figure 7: Point-to-point communication

29 27/246 3 Point-to-point Communications 3.1 General Concepts 3 Point-to-point Communications 3.1 General Concepts A point-to-point communication occurs between two processes, one is called the sender process and the other one is called the receiver process Sender Receiver 5 6 Figure 7: Point-to-point communication

30 27/246 3 Point-to-point Communications 3.1 General Concepts 3 Point-to-point Communications 3.1 General Concepts A point-to-point communication occurs between two processes, one is called the sender process and the other one is called the receiver process Sender Receiver 5 6 Figure 7: Point-to-point communication

31 28/246 3 Point-to-point Communications 3.1 General Concepts The sender and the receiver are identified by their rank in the communicator. The so-called message envelope is composed of: 1 the rank of the sender process; 2 the rank of the receiver process; 3 the tag of the message; 4 the communicator which define the process group and communication context. The exchanged data are predefined (integer, real, etc) or personal derived datatypes. There are in each case many communication modes, calling different protocols which will be discussed in Chapter 7.

32 29/246 3 Point-to-point Communications 3.1 General Concepts 1 program point_to_point 2 use mpi 3 implicit none 4 5 integer, dimension( MPI_STATUS_SIZE) :: status 6 integer, parameter :: tag=100 7 integer :: rank,value,code 8 9 call MPI_INIT(code) call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) if (rank == 2) then 14 value= call MPI_SEND(value,1, MPI_INTEGER,5,tag, MPI_COMM_WORLD,code) 16 elseif (rank == 5) then 17 call MPI_RECV(value,1, MPI_INTEGER,2,tag, MPI_COMM_WORLD,status,code) 18 print *, I, process 5, I have received,value, from the process 2 19 end if call MPI_FINALIZE(code) end program point_to_point > mpiexec -n 7 point_to_point I, process 5, I have received 1000 from the process 2

33 30/246 3 Point-to-point Communications 3.2 Predefined MPI Datatypes 3 Point-to-point Communications 3.2 Predefined MPI Datatypes MPI Type MPI_INTEGER MPI_REAL MPI_DOUBLE_PRECISION MPI_COMPLEX MPI_LOGICAL MPI_CHARACTER Fortran Type INTEGER REAL DOUBLE PRECISION COMPLEX LOGICAL CHARACTER Table 1: Predefined MPI Datatypes (Fortran)

34 31/246 3 Point-to-point Communications 3.2 Predefined MPI Datatypes MPI Type C Type MPI_CHAR signed char MPI_SHORT signed short MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double Table 2: Predefined MPI Datatypes (C)

35 32/246 3 Point-to-point Communications 3.3 Other Possibilities 3 Point-to-point Communications 3.3 Other Possibilities On the reception of a message, the rank of the sender and the tag can be replaced respectively by the MPI_ANY_SOURCE and MPI_ANY_TAG wildcards. A communication with the dummy process of rank MPI_PROC_NULL has no effect. MPI_STATUS_IGNORE is a predefined constant that can be used instead of the status variable. MPI_SUCCESS is a predefined constant which allows testing the return code of a MPI subroutine. There are syntactic variants, MPI_SENDRECV() and MPI_SENDRECV_REPLACE(), which combine a send and a receive operations (for the former, the receive memory zone must be different from the send one). It is possible to create more complex data structures using derived datatypes (see the Chapter 6)

36 33/246 3 Point-to-point Communications 3.3 Other Possibilities Figure 8: sendrecv Communication between the Processes 0 and 1 1 program sendrecv 2 use mpi 3 implicit none 4 integer :: rank,value,num_proc,code 5 integer,parameter :: tag= call MPI_INIT(code) 8 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 9 10! We suppose that we have exactly 2 processes 11 num_proc=mod(rank+1,2) call MPI_SENDRECV(rank+1000,1, MPI_INTEGER,num_proc,tag,value,1, MPI_INTEGER, & 14 num_proc,tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE,code) print *, I, process,rank,, I have received,value, from the process,num_proc call MPI_FINALIZE(code) 19 end program sendrecv

37 34/246 3 Point-to-point Communications 3.3 Other Possibilities > mpiexec -n 2 sendrecv I, process 1, I have received 1000 from the process 0 I, process 0, I have received 1001 from the process 1 Warning! It must be noticed that if the MPI_SEND() subroutine is implemented in a synchronous way (see the Chapter 7) in the used MPI implementation, the previous code will deadlock if, rather than to use the MPI_SENDRECV() subroutine, we used a MPI_SEND() subroutine followed by a MPI_RECV() one. In this case, each of the two processes would wait a receive command which will never occur, because the two sends would stay suspended. Therefore, for correctness and portability reasons, it is absolutely necessary to avoid such situations. call MPI_SEND(rank+1000,1, MPI_INTEGER,num_proc,tag, MPI_COMM_WORLD,code) call MPI_RECV(value,1, MPI_INTEGER,num_proc,tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE,code)

38 35/246 3 Point-to-point Communications 3.3 Other Possibilities 1 program joker 2 use mpi 3 implicit none 4 integer, parameter :: m=4,tag1=11,tag2=22 5 integer, dimension(m,m) :: A 6 integer :: nb_procs,rank,code 7 integer, dimension( MPI_STATUS_SIZE):: status 8 integer :: nb_elements,i 9 integer, dimension(:), allocatable :: C call MPI_INIT(code) 12 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 13 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) A(:,:) = 0

39 36/246 3 Point-to-point Communications 3.3 Other Possibilities 16 if (rank == 0) then 17! Initialisation of the matrix A on the process 0 18 A(:,:) = reshape((/ (i,i=1,m*m) /), (/ m,m /)) 19! Sending of 2 elements of the matrix A to the process 1 20 call MPI_SEND(A(1,1),2, MPI_INTEGER,1,tag1, MPI_COMM_WORLD,code) 21! Sending of 3 elements of the matrix A to the process 2 22 call MPI_SEND(A(1,2),3, MPI_INTEGER,2,tag2, MPI_COMM_WORLD,code) 23 else 24! We check before the receive if a message has arrived and from which process 25 call MPI_PROBE( MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,status,code) 26! We check how many elements must be received 27 call MPI_GET_COUNT(status, MPI_INTEGER,nb_elements,code) 28! We allocate the reception array C on each process 29 if (nb_elements /= 0 ) allocate (C(1:nb_elements)) 30! We receive the message 31 call MPI_RECV(C,nb_elements, MPI_INTEGER,status( MPI_SOURCE),status( MPI_TAG), & 32 MPI_COMM_WORLD,status,code) 33 print *, I, process,rank,, I have received,nb_elements, & 34 elements from the process, & 35 status( MPI_SOURCE), My array C is, C(:) 36 end if call MPI_FINALIZE(code) 39 end program joker

40 37/246 3 Point-to-point Communications 3.3 Other Possibilities > mpiexec -n 3 sendrecv1 I, process 1, I have received 2 elements from the process 0 My array C is 1 2 I, process 2, I have received 3 elements from the process 0 My array C is 5 6 7

41 38/246 3 Point-to-point Communications 3.4 Example: communication ring 3 Point-to-point Communications 3.4 Example: communication ring Figure 9: Communication Ring

42 38/246 3 Point-to-point Communications 3.4 Example: communication ring 3 Point-to-point Communications 3.4 Example: communication ring Figure 9: Communication Ring

43 39/246 3 Point-to-point Communications 3.4 Example: communication ring If all the processes execute a send followed by a receive, all the communications could potentially start simultaneously and then will not occur in a ring way (outside the already mentioned portability problem if the implementation of MPI_SEND() is made in a synchronous way in the MPI library):... value=rank+1000 call MPI_SEND(value,1, MPI_INTEGER,num_proc_next,tag, MPI_COMM_WORLD,code) call MPI_RECV(value,1, MPI_INTEGER,num_proc_previous,tag, MPI_COMM_WORLD, & status,code)...

44 40/246 3 Point-to-point Communications 3.4 Example: communication ring To ensure that the communications will really happen in a ring way, like for the exchange of a token between processes, it is necessary to use a different method and to have one process which starts the chain: Figure 10: Communication Ring

45 40/246 3 Point-to-point Communications 3.4 Example: communication ring To ensure that the communications will really happen in a ring way, like for the exchange of a token between processes, it is necessary to use a different method and to have one process which starts the chain: Figure 10: Communication Ring

46 40/246 3 Point-to-point Communications 3.4 Example: communication ring To ensure that the communications will really happen in a ring way, like for the exchange of a token between processes, it is necessary to use a different method and to have one process which starts the chain: Figure 10: Communication Ring

47 40/246 3 Point-to-point Communications 3.4 Example: communication ring To ensure that the communications will really happen in a ring way, like for the exchange of a token between processes, it is necessary to use a different method and to have one process which starts the chain: Figure 10: Communication Ring

48 40/246 3 Point-to-point Communications 3.4 Example: communication ring To ensure that the communications will really happen in a ring way, like for the exchange of a token between processes, it is necessary to use a different method and to have one process which starts the chain: Figure 10: Communication Ring

49 40/246 3 Point-to-point Communications 3.4 Example: communication ring To ensure that the communications will really happen in a ring way, like for the exchange of a token between processes, it is necessary to use a different method and to have one process which starts the chain: Figure 10: Communication Ring

50 40/246 3 Point-to-point Communications 3.4 Example: communication ring To ensure that the communications will really happen in a ring way, like for the exchange of a token between processes, it is necessary to use a different method and to have one process which starts the chain: Figure 10: Communication Ring

51 40/246 3 Point-to-point Communications 3.4 Example: communication ring To ensure that the communications will really happen in a ring way, like for the exchange of a token between processes, it is necessary to use a different method and to have one process which starts the chain: Figure 10: Communication Ring

52 41/246 3 Point-to-point Communications 3.4 Example: communication ring 1 program ring 2 use mpi 3 implicit none 4 integer, dimension( MPI_STATUS_SIZE) :: status 5 integer, parameter :: tag=100 6 integer :: nb_procs,rank,value, & 7 num_proc_previous,num_proc_next,code 8 call MPI_INIT(code) 9 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 10 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) num_proc_next=mod(rank+1,nb_procs) 13 num_proc_previous=mod(nb_procs+rank-1,nb_procs) if (rank == 0) then 16 call MPI_SEND(rank+1000,1, MPI_INTEGER,num_proc_next,tag, & 17 MPI_COMM_WORLD,code) 18 call MPI_RECV(value,1, MPI_INTEGER,num_proc_previous,tag, & 19 MPI_COMM_WORLD,status,code) 20 else 21 call MPI_RECV(value,1, MPI_INTEGER,num_proc_previous,tag, & 22 MPI_COMM_WORLD,status,code) 23 call MPI_SEND(rank+1000,1, MPI_INTEGER,num_proc_next,tag, & 24 MPI_COMM_WORLD,code) 25 end if print *, I, process,rank,, I have received,value, from process, & 28 num_proc_previous call MPI_FINALIZE(code) 31 end program ring

53 42/246 3 Point-to-point Communications 3.4 Example: communication ring > mpiexec -n 7 ring I, process 1, I have received 1000 from process 0 I, process 2, I have received 1001 from process 1 I, process 3, I have received 1002 from process 2 I, process 4, I have received 1003 from process 3 I, process 5, I have received 1004 from process 4 I, process 6, I have received 1005 from process 5 I, process 0, I have received 1006 from process 6

54 43/246 4 Collective communications 1 Introduction 2 Environment 3 Point-to-point Communications 4 Collective communications 4.1 General concepts Global synchronization: MPI_BARRIER() Broadcast : MPI_BCAST() Scatter : MPI_SCATTER() Gather : MPI_GATHER() Gather-to-all : MPI_ALLGATHER() Extended gather : MPI_GATHERV() All-to-all : MPI_ALLTOALL() Global reduction Additions One-sided Communication 6 Derived datatypes 7 Optimizations 8 Communicators 9 MPI-IO 10 Conclusion

55 44/246 4 Collective communications 4.1 General concepts 4 Collective communications 4.1 General concepts The collective communications allow to make a series of point-to-point communications in one single call. A collective communication always concerns all the processes of the indicated communicator. For each process, the call ends when its participation in the collective call is completed, in the sense of point-to-point communications (when the concerned memory area can be changed). It is useless to add a global synchronization (barrier) after a collective call. The management of tags in these communications is transparent and system-dependent. Therefore, they are never explicitly defined during the calling of these subroutines. This has among other advantages that the collective communications never interfere with point-to-point communications.

56 45/246 4 Collective communications 4.1 General concepts There are three types of subroutines : ❶ the one which ensures the global synchronizations : MPI_BARRIER(). ❷ the ones which only transfer data : global distribution of data : MPI_BCAST() ; selective distribution of data : MPI_SCATTER() ; collection of distributed data : MPI_GATHER() ; collection by all the processes of distributed data : MPI_ALLGATHER() ; selective distribution, by all the processes, of distributed data : MPI_ALLTOALL(). ❸ the ones which, in addition to the communications management, carry out operations on the transferred data : reduction operations (sum, product, maximum, minimum, etc.) whether they are of a predefined or personal type : MPI_REDUCE() ; reduction operations with broadcasting of the result (it is in fact equivalent to an MPI_REDUCE() followed by an MPI_BCAST()) : MPI_ALLREDUCE().

57 46/246 4 Collective communications 4.2 Global synchronization: MPI_BARRIER() 4 Collective communications 4.2 Global synchronization: MPI_BARRIER() P0 P1 P2 P3 Barrière Figure 11: Global Synchronization : MPI_BARRIER() integer, intent(out) :: code call MPI_BARRIER( MPI_COMM_WORLD,code)

58 46/246 4 Collective communications 4.2 Global synchronization: MPI_BARRIER() 4 Collective communications 4.2 Global synchronization: MPI_BARRIER() P0 P1 P2 P3 P0 P1 P2 P3 Barrière Figure 11: Global Synchronization : MPI_BARRIER() integer, intent(out) :: code call MPI_BARRIER( MPI_COMM_WORLD,code)

59 46/246 4 Collective communications 4.2 Global synchronization: MPI_BARRIER() 4 Collective communications 4.2 Global synchronization: MPI_BARRIER() P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 Barrière Figure 11: Global Synchronization : MPI_BARRIER() integer, intent(out) :: code call MPI_BARRIER( MPI_COMM_WORLD,code)

60 47/246 4 Collective communications 4.3 Broadcast : MPI_BCAST() 4 Collective communications 4.3 Broadcast : MPI_BCAST() P0 P1 MPI_BCAST() P2 A P3 Figure 12: Broadcast : MPI_BCAST()

61 47/246 4 Collective communications 4.3 Broadcast : MPI_BCAST() 4 Collective communications 4.3 Broadcast : MPI_BCAST() 0 A P0 P1 P2 A P3 MPI_BCAST() P0 A Figure 12: Broadcast : MPI_BCAST()

62 47/246 4 Collective communications 4.3 Broadcast : MPI_BCAST() 4 Collective communications 4.3 Broadcast : MPI_BCAST() 0 A A P0 P1 P2 A P3 MPI_BCAST() P0 A P1 A Figure 12: Broadcast : MPI_BCAST()

63 47/246 4 Collective communications 4.3 Broadcast : MPI_BCAST() 4 Collective communications 4.3 Broadcast : MPI_BCAST() 0 A A A P0 P1 P2 A P3 MPI_BCAST() P0 A P1 A P2 A Figure 12: Broadcast : MPI_BCAST()

64 47/246 4 Collective communications 4.3 Broadcast : MPI_BCAST() 4 Collective communications 4.3 Broadcast : MPI_BCAST() 0 A A A 1 A 2 3 P0 P1 P2 A P3 MPI_BCAST() P0 A P1 A P2 A P3 A Figure 12: Broadcast : MPI_BCAST()

65 48/246 4 Collective communications 4.3 Broadcast : MPI_BCAST() 1 program bcast 2 use mpi 3 implicit none 4 5 integer :: rank,value,code 6 7 call MPI_INIT(code) 8 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 9 10 if (rank == 2) value=rank call MPI_BCAST(value,1, MPI_INTEGER,2, MPI_COMM_WORLD,code) print *, I, process,rank, I have received,value, of the process call MPI_FINALIZE(code) end program bcast > mpiexec -n 4 bcast I, process 2 I have received 1002 of the process 2 I, process 0 I have received 1002 of the process 2 I, process 1 I have received 1002 of the process 2 I, process 3 I have received 1002 of the process 2

66 49/246 4 Collective communications 4.4 Scatter : MPI_SCATTER() 4 Collective communications 4.4 Scatter : MPI_SCATTER() P0 P1 MPI_SCATTER() P2 A0 A1 A2 A3 P3 Figure 13: Scatter : MPI_SCATTER()

67 49/246 4 Collective communications 4.4 Scatter : MPI_SCATTER() 4 Collective communications 4.4 Scatter : MPI_SCATTER() 0 A P0 P1 P2 A0 A1 A2 A3 P3 MPI_SCATTER() P0 A0 Figure 13: Scatter : MPI_SCATTER()

68 49/246 4 Collective communications 4.4 Scatter : MPI_SCATTER() 4 Collective communications 4.4 Scatter : MPI_SCATTER() 0 A0 A P0 P1 P2 A0 A1 A2 A3 P3 MPI_SCATTER() P0 A0 P1 A1 Figure 13: Scatter : MPI_SCATTER()

69 49/246 4 Collective communications 4.4 Scatter : MPI_SCATTER() 4 Collective communications 4.4 Scatter : MPI_SCATTER() 0 A0 A1 A P0 P1 P2 A0 A1 A2 A3 P3 MPI_SCATTER() P0 A0 P1 A1 P2 A2 Figure 13: Scatter : MPI_SCATTER()

70 49/246 4 Collective communications 4.4 Scatter : MPI_SCATTER() 4 Collective communications 4.4 Scatter : MPI_SCATTER() 0 A0 A1 A2 1 A3 2 3 P0 P1 P2 A0 A1 A2 A3 P3 MPI_SCATTER() P0 A0 P1 A1 P2 A2 P3 A3 Figure 13: Scatter : MPI_SCATTER()

71 50/246 4 Collective communications 4.4 Scatter : MPI_SCATTER() 1 program scatter 2 3 use mpi implicit none integer, parameter integer :: nb_values=8 :: nb_procs,rank,block_length,i,code 7 real, allocatable, dimension(:) :: values,data 8 9 call MPI_INIT(code) 10 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 12 block_length=nb_values/nb_procs 13 allocate(data(block_length)) if (rank == 2) then 16 allocate(values(nb_values)) values(:)=(/(1000.+i,i=1,nb_values)/) print *, I, process,rank, send my values array :,& end if values(1:nb_values) call MPI_SCATTER(values,block_length, MPI_REAL,data,block_length, & 23 MPI_REAL,2, MPI_COMM_WORLD,code) print *, I, process,rank,, I have received, data(1:block_length), & of the process 2 26 call MPI_FINALIZE(code) end program scatter > mpiexec -n 4 scatter I, process 2 send my values array : I, process 0, I have received of the process 2 I, process 1, I have received of the process 2 I, process 3, I have received of the process 2 I, process 2, I have received of the process 2

72 51/246 4 Collective communications 4.5 Gather : MPI_GATHER() 4 Collective communications 4.5 Gather : MPI_GATHER() P0 A0 P1 A1 MPI_GATHER() P2 A2 P3 A3 Figure 14: Gather : MPI_GATHER()

73 51/246 4 Collective communications 4.5 Gather : MPI_GATHER() 4 Collective communications 4.5 Gather : MPI_GATHER() 0 A P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHER() P0 P1 P2 A0 P3 Figure 14: Gather : MPI_GATHER()

74 51/246 4 Collective communications 4.5 Gather : MPI_GATHER() 4 Collective communications 4.5 Gather : MPI_GATHER() 0 A0 A P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHER() P0 P1 P2 A0 A1 P3 Figure 14: Gather : MPI_GATHER()

75 51/246 4 Collective communications 4.5 Gather : MPI_GATHER() 4 Collective communications 4.5 Gather : MPI_GATHER() 0 A0 A1 A P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHER() P0 P1 P2 A0 A1 A2 P3 Figure 14: Gather : MPI_GATHER()

76 51/246 4 Collective communications 4.5 Gather : MPI_GATHER() 4 Collective communications 4.5 Gather : MPI_GATHER() 0 A0 A1 A2 1 A3 2 3 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHER() P0 P1 P2 A0 A1 A2 A3 P3 Figure 14: Gather : MPI_GATHER()

77 52/246 4 Collective communications 4.5 Gather : MPI_GATHER() 1 program gather 2 use mpi 3 implicit none 4 integer, parameter :: nb_values=8 5 integer :: nb_procs,rank,block_length,i,code 6 real, dimension(nb_values) :: data 7 real, allocatable, dimension(:) :: values 8 9 call MPI_INIT(code) 10 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) block_length=nb_values/nb_procs allocate(values(block_length)) values(:)=(/(1000.+rank*block_length+i,i=1,block_length)/) print *, I, process,rank, send my values array :,& 19 values(1:block_length) call MPI_GATHER(values,block_length, MPI_REAL,data,block_length, & 22 MPI_REAL,2, MPI_COMM_WORLD,code) if (rank == 2) print *, I, process 2, have received,data(1:nb_values) call MPI_FINALIZE(code) end program gather > mpiexec -n 4 gather I, process 1 send my values array : I, process 0 send my values array : I, process 2 send my values array : I, process 3 send my values array : I, process 2 have received

78 53/246 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() P0 A0 P1 A1 MPI_ALLGATHER() P2 A2 P3 A3 Figure 15: Allgather : MPI_ALLGATHER()

79 53/246 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() A0 A0 0 A0 A P0 A0 P1 A1 P2 A2 P3 A3 MPI_ALLGATHER() P0 A0 P1 A0 P2 A0 P3 A0 Figure 15: Allgather : MPI_ALLGATHER()

80 53/246 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() A0 A0 A1 A1 A0 0 A0 A1 1 2 A1 3 P0 A0 P1 A1 P2 A2 MPI_ALLGATHER() P0 A0 A1 P1 A0 A1 P2 A0 A1 P3 A3 P3 A0 A1 Figure 15: Allgather : MPI_ALLGATHER()

81 53/246 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() A0 0 A2 A0 A1 A1 A0 A0 A2 A2 A1 1 2 A1 3 A2 P0 A0 P1 A1 P2 A2 MPI_ALLGATHER() P0 A0 A1 A2 P1 A0 A1 A2 P2 A0 A1 A2 P3 A3 P3 A0 A1 A2 Figure 15: Allgather : MPI_ALLGATHER()

82 53/246 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() A0 A0 0 A2 A1 A1 A0 A0 A2 A2 1 A1 2 A3 A3 A2 A1 3 A3 P0 A0 P1 A1 P2 A2 A3 MPI_ALLGATHER() P0 A0 A1 A2 A3 P1 A0 A1 A2 A3 P2 A0 A1 A2 A3 P3 A3 P3 A0 A1 A2 A3 Figure 15: Allgather : MPI_ALLGATHER()

83 54/246 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() 1 program allgather 2 3 use mpi implicit none integer, parameter integer :: nb_values=8 :: nb_procs,rank,block_length,i,code 7 real, dimension(nb_values) :: data 8 real, allocatable, dimension(:) :: values 9 10 call MPI_INIT(code) call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 13 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) block_length=nb_values/nb_procs 16 allocate(values(block_length)) values(:)=(/(1000.+rank*block_length+i,i=1,block_length)/) call MPI_ALLGATHER(values,block_length, MPI_REAL,data,block_length, & 21 MPI_REAL, MPI_COMM_WORLD,code) print *, I, process,rank,, I have received,data(1:nb_values) call MPI_FINALIZE(code) end program allgather > mpiexec -n 4 allgather I, process 1, I have received I, process 3, I have received I, process 2, I have received I, process 0, I have received

84 55/246 4 Collective communications 4.7 Extended gather : MPI_GATHERV() 4 Collective communications 4.7 Extended gather : MPI_GATHERV() P0 A0 P1 A1 MPI_GATHERV() P2 A2 P3 A3 Figure 16: Gatherv : MPI_GATHERV()

85 55/246 4 Collective communications 4.7 Extended gather : MPI_GATHERV() 4 Collective communications 4.7 Extended gather : MPI_GATHERV() 0 A P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHERV() P0 P1 P2 A0 P3 Figure 16: Gatherv : MPI_GATHERV()

86 55/246 4 Collective communications 4.7 Extended gather : MPI_GATHERV() 4 Collective communications 4.7 Extended gather : MPI_GATHERV() 0 A1 A P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHERV() P0 P1 P2 A0 A1 P3 Figure 16: Gatherv : MPI_GATHERV()

87 55/246 4 Collective communications 4.7 Extended gather : MPI_GATHERV() 4 Collective communications 4.7 Extended gather : MPI_GATHERV() 0 A1 A0 A P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHERV() P0 P1 P2 A0 A1 A2 P3 Figure 16: Gatherv : MPI_GATHERV()

88 55/246 4 Collective communications 4.7 Extended gather : MPI_GATHERV() 4 Collective communications 4.7 Extended gather : MPI_GATHERV() 0 A1 A0 A2 1 A3 2 3 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHERV() P0 P1 P2 A0 A1 A2 A3 P3 Figure 16: Gatherv : MPI_GATHERV()

89 56/246 4 Collective communications 4.7 Extended gather : MPI_GATHERV() 1 program gatherv 2 use mpi 3 implicit none 4 INTEGER, PARAMETER :: nb_values=10 5 INTEGER :: nb_procs, rank, block_length, i, code 6 REAL, DIMENSION(nb_values) :: data,remainder 7 REAL, ALLOCATABLE, DIMENSION(:) :: values 8 INTEGER, ALLOCATABLE, DIMENSION(:) :: nb_elements_received,displacement 9 10 CALL MPI_INIT(code) 11 CALL MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 12 CALL MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) block_length=nb_values/nb_procs 15 remainder=mod(nb_values,nb_procs) 16 if(remainder > rank) block_length = block_length ALLOCATE(values(block_length)) 18 values(:) = (/(1000.+(rank*(nb_values/nb_procs))+min(rank,remainder)+i, & 19 i=1,block_length)/) PRINT *, I, process, rank, send my values array :,& 22 values(1:block_length) IF (rank == 2) THEN 25 ALLOCATE(nb_elements_received(nb_procs),displacement(nb_procs)) 26 nb_elements_received(1) = nb_values/nb_procs 27 if (remainder > 0) nb_elements_received(1)=nb_elements_received(1) displacement(1) = 0 DO i=2,nb_procs 30 displacement(i) = displacement(i-1)+nb_elements_received(i-1) 31 nb_elements_received(i) = nb_values/nb_procs 32 if (remainder > i-1) nb_elements_received(i)=nb_elements_received(i)+1 33 END DO 34 END IF 35

90 57/246 4 Collective communications 4.7 Extended gather : MPI_GATHERV() CALL MPI_GATHERV(values,block_length, MPI_REAL,data,nb_elements_received,& displacement, MPI_REAL,2, MPI_COMM_WORLD,code) IF (rank == 2) PRINT *, I, processus 2 receives, data(1:nb_values) CALL MPI_FINALIZE(code) end program gatherv > mpiexec -n 4 gatherv I, process 0 send my values array : I, process 2 send my values array : I, process 3 send my values array : I, process 1 send my values array : I, process 2 receives

91 58/246 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() P0 A0 A1 A2 A3 P1 B0 B1 B2 B3 MPI_ALLTOALL() P2 C0 C1 C2 C3 P3 D0 D1 D2 D3 Figure 17: Alltoall : MPI_ALLTOALL()

92 58/246 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() A0 A1 0 A3 A P0 A0 A1 A2 A3 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 P3 D0 D1 D2 D3 MPI_ALLTOALL() P0 A0 P1 A1 P2 A2 P3 A3 Figure 17: Alltoall : MPI_ALLTOALL()

93 58/246 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() A0 A1 B1 B0 A3 0 A2 B2 1 2 B3 3 P0 A0 A1 A2 A3 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 P3 D0 D1 D2 D3 MPI_ALLTOALL() P0 A0 B0 P1 A1 B1 P2 A2 B2 P3 A3 B3 Figure 17: Alltoall : MPI_ALLTOALL()

94 58/246 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() A0 0 C0 A1 B1 B0 A2 A3 C2 C1 B2 1 2 B3 3 C3 P0 A0 A1 A2 A3 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 P3 D0 D1 D2 D3 MPI_ALLTOALL() P0 A0 B0 C0 P1 A1 B1 C1 P2 A2 B2 C2 P3 A3 B3 C3 Figure 17: Alltoall : MPI_ALLTOALL()

95 58/246 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() A0 A1 0 C0 B1 B0 A2 A3 C1 C2 1 B2 2 D1 D0 C3 B3 3 D2 P0 A0 A1 A2 A3 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 P3 D0 D1 D2 D3 D3 MPI_ALLTOALL() P0 A0 B0 C0 D0 P1 A1 B1 C1 D1 P2 A2 B2 C2 D2 P3 A3 B3 C3 D3 Figure 17: Alltoall : MPI_ALLTOALL()

96 59/246 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() 1 program alltoall 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_values=8 6 integer :: nb_procs,rank,block_length,i,code 7 real, dimension(nb_values) :: values,data 8 9 call MPI_INIT(code) 10 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) values(:)=(/(1000.+rank*nb_values+i,i=1,nb_values)/) 14 block_length=nb_values/nb_procs print *, I, process,rank, send my values array :,& 17 values(1:nb_values) call MPI_ALLTOALL(values,block_length, MPI_REAL,data,block_length, & 20 MPI_REAL, MPI_COMM_WORLD,code) print *, I, process,rank,, I have received,data(1:nb_values) call MPI_FINALIZE(code) 25 end program alltoall

97 60/246 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() > mpiexec -n 4 alltoall I, process 1 send my values array : I, process 0 send my values array : I, process 2 send my values array : I, process 3 send my values array : I, process 0, I have received I, process 2, I have received I, process 1, I have received I, process 3, I have received

98 61/246 4 Collective communications 4.9 Global reduction 4 Collective communications 4.9 Global reduction A reduction is an operation applied to a set of elements in order to obtain one single value. Classical examples are the sum of the elements of a vector (SUM(A(:))) or the search of the maximum value element in a vector (MAX(V(:))). MPI proposes high-level subroutines in order to operate reductions on distributed data on a group of processes. The result is obtained on one process (MPI_REDUCE()) or on all (MPI_ALLREDUCE(), which is in fact equivalent to an MPI_REDUCE() followed by an MPI_BCAST()). If several elements are implied by process, the reduction function is applied to each one of them. The MPI_SCAN() subroutine allows also to make partial reductions by considering, for each process, the previous processes of the group and itself. The MPI_OP_CREATE() and MPI_OP_FREE() subroutines allow personal reduction operations.

99 62/246 4 Collective communications 4.9 Global reduction Table 3: Main Predefined Reduction Operations (there are also other logical operations) Name MPI_SUM MPI_PROD MPI_MAX MPI_MIN MPI_MAXLOC MPI_MINLOC MPI_LAND MPI_LOR MPI_LXOR Opération Sum of elements Product of elements Maximum of elements Minimum of elements Maximum of elements and location Minimum of elements and location Logical AND Logical OR Logical exclusive OR

100 63/246 4 Collective communications 4.9 Global reduction Figure 18: Distributed reduction (sum)

101 63/246 4 Collective communications 4.9 Global reduction Figure 18: Distributed reduction (sum)

102 63/246 4 Collective communications 4.9 Global reduction Figure 18: Distributed reduction (sum)

103 63/246 4 Collective communications 4.9 Global reduction Figure 18: Distributed reduction (sum)

104 63/246 4 Collective communications 4.9 Global reduction Figure 18: Distributed reduction (sum)

105 63/246 4 Collective communications 4.9 Global reduction Figure 18: Distributed reduction (sum)

106 63/246 4 Collective communications 4.9 Global reduction Figure 18: Distributed reduction (sum)

107 63/246 4 Collective communications 4.9 Global reduction = Figure 18: Distributed reduction (sum)

108 64/246 4 Collective communications 4.9 Global reduction 1 program reduce 2 use mpi 3 implicit none 4 integer :: nb_procs,rank,value,sum,code 5 6 call MPI_INIT(code) 7 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 8 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 9 10 if (rank == 0) then 11 value= else 13 value=rank 14 endif call MPI_REDUCE(value,sum,1, MPI_INTEGER, MPI_SUM,0, MPI_COMM_WORLD,code) if (rank == 0) then 19 print *, I, process 0, I have the global sum value,sum 20 end if call MPI_FINALIZE(code) 23 end program reduce > mpiexec -n 7 reduce I, process 0, I have the global sum value 1021

109 65/246 4 Collective communications 4.9 Global reduction Figure 19: Distributed reduction (product) with broadcast of the result

110 65/246 4 Collective communications 4.9 Global reduction Figure 19: Distributed reduction (product) with broadcast of the result

111 65/246 4 Collective communications 4.9 Global reduction Figure 19: Distributed reduction (product) with broadcast of the result

112 65/246 4 Collective communications 4.9 Global reduction Figure 19: Distributed reduction (product) with broadcast of the result

113 65/246 4 Collective communications 4.9 Global reduction Figure 19: Distributed reduction (product) with broadcast of the result

114 65/246 4 Collective communications 4.9 Global reduction Figure 19: Distributed reduction (product) with broadcast of the result

115 65/246 4 Collective communications 4.9 Global reduction Figure 19: Distributed reduction (product) with broadcast of the result

116 65/246 4 Collective communications 4.9 Global reduction = Figure 19: Distributed reduction (product) with broadcast of the result

117 65/246 4 Collective communications 4.9 Global reduction = Figure 19: Distributed reduction (product) with broadcast of the result

118 66/246 4 Collective communications 4.9 Global reduction 1 program allreduce 2 3 use mpi 4 implicit none 5 6 integer :: nb_procs,rank,value,product,code 7 8 call MPI_INIT(code) 9 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 10 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) if (rank == 0) then 13 value=10 14 else 15 value=rank 16 endif call MPI_ALLREDUCE(value,product,1, MPI_INTEGER, MPI_PROD, MPI_COMM_WORLD,code) print *, I,process,rank,,I have received the value of the global product,product call MPI_FINALIZE(code) end program allreduce

119 67/246 4 Collective communications 4.9 Global reduction > mpiexec -n 7 allreduce I, process 6, I have received the value of the global product 7200 I, process 2, I have received the value of the global product 7200 I, process 0, I have received the value of the global product 7200 I, process 4, I have received the value of the global product 7200 I, process 5, I have received the value of the global product 7200 I, process 3, I have received the value of the global product 7200 I, process 1, I have received the value of the global product 7200

120 68/246 4 Collective communications 4.10 Additions 4 Collective communications 4.10 Additions The MPI_SCATTERV(), MPI_GATHERV(), MPI_ALLGATHERV() and MPI_ALLTOALLV() subroutines extend MPI_SCATTER(), MPI_GATHER(), MPI_ALLGATHER() and MPI_ALLTOALL() in the case where the number of elements to transmit or gather is different following the processes. Two new subroutines have been added to extend the possibilities of collective subroutines in some particular cases: MPI_ALLTOALLW() : MPI_ALLTOALLV() version where the displacements are expressed in bytes and not in elements, MPI_EXSCAN() : exclusive version of MPI_SCAN().

121 69/246 5 One-sided Communication 1 Introduction 2 Environment 3 Point-to-point Communications 4 Collective communications 5 One-sided Communication 5.1 Introduction Reminder : The Concept of Message-Passing The Concept of One-sided Communication RMA Approach of MPI Memory Window Data Transfer Completion of the Transfer : the synchronization Active Target Synchronization Passive Target Synchronization Conclusions Derived datatypes 7 Optimizations 8 Communicators 9 MPI-IO 10 Conclusion

122 70/246 5 One-sided Communication 5.1 Introduction 5 One-sided Communication 5.1 Introduction There are various approaches to transfer data between two different processes. Among the most commonly used are : ➊ Point-to-point communications by message-passing (MPI, etc.) ; ➋ One-sided communications (direct access to the memory of a distant process). Also called RMA for Remote Memory Access, it is one of the major contributions of MPI.

123 71/246 5 One-sided Communication 5.1 Introduction 5 One-sided Communication 5.1 Introduction Reminder : The Concept of Message-Passing 0 1 source t target I send Figure 20: Message-Passing I receive In message-passing, a sender (origin) sends a message to a destination process (target) which will make all what is necessary to receive this message. This requires that the sender as well as the receiver be involved in the communication. This can be restrictive and difficult to implement in some algorithms (for example when it is necessary to manage a global counter).

124 72/246 5 One-sided Communication 5.1 Introduction 5 One-sided Communication 5.1 Introduction The Concept of One-sided Communication The concept of one-sided communication is not new, MPI having simply unified the already existing constructors solutions (such as shmem (CRAY), lapi (IBM),...) by offering its own RMA primitives. Through these subroutines, a process has a direct access (in read, write or update) to the memory of another remote process. In this approach, the remote process does not have to participate in the data-transfer process. The principle advantages are the following : enhanced performances when the hardware allows it, a simpler programming for some algorithms.

125 73/246 5 One-sided Communication 5.1 Introduction 5 One-sided Communication 5.1 Introduction RMA Approach of MPI The use of MPI RMA is done in three steps: ➊ definition on each process of a memory area (local memory window) visible and eventually accessible to remote processes ; ➋ start of the data transfer directly from the memory of a process to the memory of another process. It is therefore necessary to specify the type, the number and the initial and final localization of data. ➌ completion of current transfers by a step of synchronization, the data are then available.

126 74/246 5 One-sided Communication 5.2 Memory Window 5 One-sided Communication 5.2 Memory Window All the processes participating in an one-sided communication have to specify which part of their memory will be available to the other processes ; it is the notion of memory window. More precisely, the MPI_WIN_CREATE() collective operation allows the creation of an MPI window object. This object is composed, for each process, of a specific memory area called local memory window. For each process, a local memory window is characterized by its initial address, its size in bytes (which can be zero) and the displacement unit size inside this window (in bytes). These characteristics can be different on each process.

127 75/246 5 One-sided Communication 5.2 Memory Window Example: MPI Window Object win1 0 1 First local window Second local window Second local window First local window MPI Window Object win2 Figure 21: Creation of two MPI window objects, win1 and win2

128 76/246 5 One-sided Communication 5.2 Memory Window Once the transfers are finished, the window has to be freed with the MPI_WIN_FREE() subroutine. MPI_WIN_GET_ATTR() allows to know the characteristics of a local memory window by using the keywords MPI_WIN_BASE, MPI_WIN_SIZE or MPI_WIN_DISP_UNIT. Remark : The choice of the displacement unit associated with the local memory window is important (essential in a heterogenous environment and facilitating the encoding in all the cases). The size of a MPI datatype is obtained by calling the MPI_TYPE_SIZE() subroutine.

129 77/246 5 One-sided Communication 5.2 Memory Window 1 program window 2 3 use mpi 4 implicit none 5 6 integer :: code, rank, size_real, win, n=4 7 integer (kind= MPI_ADDRESS_KIND) :: dim_win, size, base, unit 8 real(kind=kind(1.d0)), dimension(:), allocatable :: win_local 9 logical :: flag call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD, rank, code) 13 call MPI_TYPE_SIZE( MPI_DOUBLE_PRECISION,size_real,code) if (rank==0) n=0 16 allocate(win_local(n)) 17 dim_win = size_real*n call MPI_WIN_CREATE(win_local, dim_win, size_real, MPI_INFO_NULL, & 20 MPI_COMM_WORLD, win, code) 21 call MPI_WIN_GET_ATTR(win, MPI_WIN_SIZE, size, flag, code) 22 call MPI_WIN_GET_ATTR(win, MPI_WIN_BASE, base, flag, code) 23 call MPI_WIN_GET_ATTR(win, MPI_WIN_DISP_UNIT, unit, flag, code) 24 call MPI_WIN_FREE(win,code) 25 print *,"process", rank,"size, base, unit = " & 26,size, base, unit 27 call MPI_FINALIZE(code) 28 end program window > mpiexec -n 3 window process 1 size, base, unit = process 0 size, base, unit = process 2 size, base, unit =

130 78/246 5 One-sided Communication 5.3 Data Transfer 5 One-sided Communication 5.3 Data Transfer MPI allows a process to read (MPI_GET()), to write (MPI_PUT()) and to update (MPI_ACCUMULATE()) data located in the local memory window of a remote process. We call the process which makes the calling of the initialization subroutine of the transfer origin and we call the process which has the local memory window which will be used in the transfer procedure target. During the initialization of the transfer, the target process does not call any MPI subroutine. All the necessary informations are specified under the form of parameters during the calling of the MPI subroutine by the origin.

131 79/246 5 One-sided Communication 5.3 Data Transfer In particular, we find : parameters related to the origin : the datatype of elements ; their number ; the memory address of the first element. parameters related to the target : the rank of the target as well as the MPI window object, which uniquely determines a local memory window ; a displacement in this local window ; the number of elements to be transferred and their datatype.

132 80/246 5 One-sided Communication 5.3 Data Transfer 1 program example_put 2 integer :: nb_orig=10, nb_target=10, target=1, win, code 3 integer (kind= MPI_ADDRESS_KIND) :: displacement=40 4 integer, dimension(10) :: B call MPI_PUT(B,nb_orig, MPI_INTEGER,target,displacement,nb_target, MPI_INTEGER,win,code) end program example_put Displacement of 40 units in the local window 0 1 First Local Window First Local Window B(1:10) Figure 22: Example of a MPI_PUT

133 81/246 5 One-sided Communication 5.3 Data Transfer Remarks : The MPI_GET syntax is identical to the MPI_PUT syntax, only the direction of the data transfer is reversed. The RMA transfer data subroutines are nonblocking primitives (deliberate choice of MPI). On the target process, the only accessible data are the ones contained in the local memory window. MPI_ACCUMULATE() admits among its parameters an operation which should be either of the MPI_REPLACE type, or one of the predefined reduction operations : MPI_SUM, MPI_PROD, MPI_MAX, etc. This cannot be in any case an operation defined by the user.

134 82/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization The data transfer starts after the call to one of the nonblocking subroutines (MPI_PUT(),...). But when is the transfer completed and the data really available? After a synchronization by the programmer. These synchronizations can be classified into two types : ➊ Active target synchronization (collective operation, all the processes associated with the window involved in the synchronization) ; ➋ Passive target synchronization (only the origin process calls the synchronization subroutine).

135 83/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization Active Target Synchronization It is made by using the MPI subroutine MPI_WIN_FENCE(). MPI_WIN_FENCE() is a collective operation on all the processes associated with the MPI window object. MPI_WIN_FENCE() acts as a synchronization barrier. It waits for the end of all the data transfers (RMA or not) using the local memory window and initiated since the last call to MPI_WIN_FENCE(). This primitive will help separate the calculation parts of the code (where data of the local memory window are used via load or store) of RMA data transfer parts. An assert argument of the MPI_WIN_FENCE() primitive, of integer type, allows to refine its behavior for better performances. Different values are predefined MPI_MODE_NOSTORE, MPI_MODE_NOPUT, MPI_MODE_NOPRECEDE, MPI_MODE_NOSUCCEED. A zero value for this argument is always valid.

136 84/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization Remarks : The fact of having chosen nonblocking RMA subroutines for initialisation of the transfers and having a synchronization for the completion of actual transfers authorizes the implementation to gather, during the execution, different transfers towards the same target in one single transfer. The latencies effects are therefore reduced and the performances are improved. The collective character of the synchronization results in that we have nothing to do with One Sided Communication... In fact all the processes of the communicator will have to participate in the synchronization, which makes it of less interest!

137 85/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization The Proper Use of MPI_WIN_FENCE() It is important to make sure that between two successive calls to MPI_WIN_FENCE() either there is only local operations (load/store) on variables that are in the local memory window of the process, or there is MPI_PUT() RMA operations or MPI_ACCUMULATE(), but never both at once! 0 1 MPI_WIN_FENCE() MPI_WIN_FENCE() win_loc(:) = win_loc(:) MPI_WIN_FENCE() MPI_WIN_FENCE() MPI_PUT() 1 MPI_WIN_FENCE() MPI_WIN_FENCE() Is the previous program compliant with the proper use of MPI_WIN_FENCE()? All depends on the code portion represented by 1. If this latter does not produce a load/store on the local window (assignment or use of a variable stored in the local window), then it is correct ; otherwise, the result is undefined.

138 86/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization Example tab 1 P0 P1 corresponding line win_loc MPI_PUT l. 1,30 l. 31,35 tab win_loc l. 37 l. 38,42 tab win_loc MPI_GET l. 44 l. 45,49 tab win_loc l. 50 l. 51,55 tab win_loc l. 57 MPI_ACCUMULATE l. 58,63 tab win_loc l. 65 Figure 23: Example corresponding to the ex_fence program

139 87/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 1 program ex_fence 2 use mpi 3 implicit none 4 5 integer, parameter :: assert=0 6 integer :: code, rank, size_real, win, i, nb_elements, target, m=4, n=4 7 integer (kind= MPI_ADDRESS_KIND) :: displacement, dim_win 8 real(kind=kind(1.d0)), dimension(:), allocatable :: win_local, tab 9 10 call MPI_INIT(code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD, rank, code) 12 call MPI_TYPE_SIZE( MPI_DOUBLE_PRECISION,size_real,code) if (rank==0) then 15 n=0 16 allocate(tab(m)) 17 endif allocate(win_local(n)) 20 dim_win = size_real*n call MPI_WIN_CREATE(win_local, dim_win, size_real, MPI_INFO_NULL, & 23 MPI_COMM_WORLD, win, code)

140 88/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 24 if (rank==0) then 25 tab(:) = (/ (i, i=1,m) /) 26 else 27 win_local(:) = end if call MPI_WIN_FENCE(assert,win,code) 31 if (rank==0) then 32 target = 1; nb_elements = 2; displacement = 1 33 call MPI_PUT(tab, nb_elements, MPI_DOUBLE_PRECISION, target, displacement, & 34 nb_elements, MPI_DOUBLE_PRECISION, win, code) 35 end if call MPI_WIN_FENCE(assert,win,code) 38 if (rank==0) then 39 tab(m) = sum(tab(1:m-1)) 40 else 41 win_local(n) = sum(win_local(1:n-1)) 42 endif call MPI_WIN_FENCE(assert,win,code) 45 if (rank==0) then 46 nb_elements = 1; displacement = m-1 47 call MPI_GET(tab, nb_elements, MPI_DOUBLE_PRECISION, target, displacement, & 48 nb_elements, MPI_DOUBLE_PRECISION, win, code) 49 end if

141 89/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 50 call MPI_WIN_FENCE(assert,win,code) 51 if (rank==0) then 52 tab(m) = sum(tab(1:m-1)) 53 else 54 win_local(:) = win_local(:) endif call MPI_WIN_FENCE(assert,win,code) 58 if (rank==0) then 59 nb_elements = m-1; displacement = 1 60 call MPI_ACCUMULATE(tab(2), nb_elements, MPI_DOUBLE_PRECISION, target, & 61 displacement, nb_elements, MPI_DOUBLE_PRECISION, & 62 MPI_SUM, win, code) 63 end if call MPI_WIN_FENCE(assert,win,code) 66 call MPI_WIN_FREE(win,code) if (rank==0) then 69 print *,"process", rank, "tab=",tab(:) 70 else 71 print *,"process", rank, "win_local=",win_local(:) 72 endif call MPI_FINALIZE(code) 75 end program ex_fence

142 90/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization Some Details and Restrictions... It is possible to work on different local memory windows which overlap, even if this is not recommanded (such a use implies so many restrictions). We will presume next never being in this case. It is always important to separate by a call to MPI_WIN_FENCE() a store and by a call to the MPI_PUT() or MPI_ACCUMULATE() subroutine that accesses the same local memory window even at different locations that do not overlap. Between two successive calls to the MPI_WIN_FENCE() subroutine, we have the following constraints : the MPI_PUT() subroutines do not allow the overlapping inside the same local memory window. In other terms, the memory areas involved during calls to many MPI_PUT() subroutines that modify the same local memory window, must not overlap ;

143 91/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization the MPI_ACCUMULATE() subroutines allow the overlapping inside the same local memory window, on the condition that the datatypes and the used reduction operations be identical during all of these calls ; using consecutively the MPI_PUT() and MPI_ACCUMULATE() subroutines do not allow the overlapping inside the same local memory window ; a load and a call to the MPI_GET() subroutine can access concurrently whatever part of the local window, provided that it was not updated previously either by a store, or when calling the MPI_PUT() or MPI_ACCUMULATE() subroutine.

144 92/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization Passive Target Synchronization It is made via calls to the MPI MPI_WIN_LOCK() and MPI_WIN_UNLOCK() subroutines. Unlike the synchronization by MPI_WIN_FENCE() (which is a collective operation of barrier type), here only the origin process will participate in the synchronization. Consequently, all the necessary calls to the data transfer (initialization of the transfer, synchronization) are only made by the origin process ; this is true One Sided Communication. The lock and unlock operations apply only to a given local memory window (i.e. identified by a target process number and an MPI window object). The period which starts at lock and ends at unlock is called an access period to the local memory window. It is only during this period that the origin process will have access to the local memory window of the target process.

145 93/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization In order to use it, it only requires the origin process to surround the call to the RMA initialization primitives of data transfer by MPI_WIN_LOCK() and MPI_WIN_UNLOCK(). For the target process, no MPI subroutine call is necessary. When MPI_WIN_UNLOCK() returns control, all the data transfers initiated after the MPI_WIN_LOCK() are completed. The first MPI_WIN_LOCK() argument allows to specify if the fact of making many simultaneous accesses via RMA operations on the same local memory window is authorized (MPI_LOCK_SHARED) or not (MPI_LOCK_EXCLUSIVE). A basic use of passive target synchronizations consists of creating RMA blocking versions (put, get, accumulate) without the target having to call MPI subroutines.

146 94/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 1 subroutine get_blocking(orig_addr, orig_count, orig_datatype, target_rank, & 2 target_disp, target_count, target_datatype, win, code) 3 integer, intent(in) :: orig_count, orig_datatype, target_rank, target_count, & 4 target_datatype, win 5 integer, intent(out) :: code 6 integer(kind= MPI_ADDRESS_KIND), intent(in) :: target_disp 7 real(kind=kind(1.d0)), dimension(:) :: orig_addr 8 9 call MPI_WIN_LOCK( MPI_LOCK_SHARED, target_rank, 0, win, code) 10 call MPI_GET(orig_addr, orig_count, orig_datatype, target_rank, target_disp, & 11 target_count, target_datatype, win, code) 12 call MPI_WIN_UNLOCK(target_rank, win, code) 13 end subroutine get_blocking

147 95/246 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization Remark Concerning the Fortran Codes For portability, during the use of passive target synchronizations (MPI_WIN_LOCK(), MPI_WIN_UNLOCK()), it is necessary to allocate the memory window with MPI_ALLOC_MEM(). This function accepts as argument pointers of C type (i.e. Fortran CRAY pointers, which do not belong to the Fortran95 standard). In the case where these latter ones are not available, a C program must be used to do the allocation of the memory window...

148 96/246 5 One-sided Communication 5.5 Conclusions 5 One-sided Communication 5.5 Conclusions The RMA concepts of MPI are difficult to implement on non-trivial applications. An in-depth knowledge of the standard is necessary in order not to fall in many traps. The performances can widely vary from one implementation to another. The advantage of the RMA concept of MPI resides essentially in the passive target approach. It is only in this case that the use of RMA subroutines is really essential (application requiring that a process access data that belong to a remote process without interruption of the latter...)

149 97/246 6 Derived datatypes 1 Introduction 2 Environment 3 Point-to-point Communications 4 Collective communications 5 One-sided Communication 6 Derived datatypes 6.1 Introduction Contiguous datatypes Constant stride Other subroutines Examples The datatype "matrix row" The datatype "matrix line" The datatype "matrix block" Homogenous datatypes of variable strides Subarray Datatype Constructor Heterogenous datatypes Subroutines annexes Conclusion Optimizations 8 Communicators 9 MPI-IO 10 Conclusion

150 98/246 6 Derived datatypes 6.1 Introduction 6 Derived datatypes 6.1 Introduction In the communications, the exchanged data have datatypes : MPI_INTEGER, MPI_REAL, MPI_COMPLEX, etc. We can create more complex data structures by using subroutines such as MPI_TYPE_CONTIGUOUS(), MPI_TYPE_VECTOR(), MPI_TYPE_CREATE_HVECTOR() Each time that we use a datatype, it is mandatory to validate it by using the MPI_TYPE_COMMIT() subroutine. If we wish to reuse the same name to define another derived datatype, we have to free it first with the MPI_TYPE_FREE() subroutine.

151 99/246 6 Derived datatypes 6.1 Introduction MPI_TYPE_CREATE_STRUCT MPI_TYPE_[CREATE_H]INDEXED MPI_TYPE_[CREATE_H]VECTOR MPI_TYPE_CONTIGUOUS MPI_REAL, MPI_INTEGER, MPI_LOGICAL Figure 24: Hierarchy of the MPI constructors

152 100/246 6 Derived datatypes 6.2 Contiguous datatypes 6 Derived datatypes 6.2 Contiguous datatypes MPI_TYPE_CONTIGUOUS() creates a data structure from a homogenous set of existing datatypes contiguous in memory call MPI_TYPE_CONTIGUOUS(5, MPI_REAL,new_type,code) Figure 25: MPI_TYPE_CONTIGUOUS subroutine integer, intent(in) :: count, old_type integer, intent(out) :: new_type,code call MPI_TYPE_CONTIGUOUS(count,old_type,new_type,code)

153 101/246 6 Derived datatypes 6.3 Constant stride 6 Derived datatypes 6.3 Constant stride MPI_TYPE_VECTOR() creates a data structure from a homogenous set of existing data separated by a constant stride in memory. The stride is given in number of elements call MPI_TYPE_VECTOR(6,1,5, MPI_REAL,new_type,code) Figure 26: MPI_TYPE_VECTOR subroutine integer, intent(in) :: count,block_length integer, intent(in) :: stride! given in elements integer, intent(in) :: old_type integer, intent(out) :: new_type,code call MPI_TYPE_VECTOR(count,block_length,stride,old_type,new_type,code)

154 102/246 6 Derived datatypes 6.3 Constant stride MPI_TYPE_CREATE_HVECTOR() creates a data structure from a homogenous set of existing datatype separated by a constant stride in memory. The stride is given in bytes. This call is useful when the old type is no longer a base datatype (MPI_INTEGER, MPI_REAL,...) but a more complex datatype constructed by using MPI subroutines, because in this case the stride can no longer be given in number of elements. integer, intent(in) :: count,block_length integer(kind= MPI_ADDRESS_KIND), intent(in) :: stride! given in bytes integer, intent(in) :: old_type integer, intent(out) :: new_type, code call MPI_TYPE_CREATE_HVECTOR(count,block_length,stride,old_type,new_type,code)

155 103/246 6 Derived datatypes 6.4 Other subroutines 6 Derived datatypes 6.4 Other subroutines Before using a new derived datatype, it is necessary to validate it with the MPI_TYPE_COMMIT() subroutine. integer, intent(inout) :: new_type integer, intent(out) :: code call MPI_TYPE_COMMIT(new_type,code) The freeing of a derived datatype is made by using the MPI_TYPE_FREE() subroutine. integer, intent(inout) :: new_type integer, intent(out) :: code call MPI_TYPE_FREE(new_type,code)

156 104/246 6 Derived datatypes 6.5 Examples 6 Derived datatypes 6.5 Examples The datatype "matrix row" 1 program column 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_lines=5,nb_columns=6 6 integer, parameter :: tag=100 7 real, dimension(nb_lines,nb_columns) :: a 8 integer, dimension( MPI_STATUS_SIZE) :: status 9 integer :: rank,code,type_column call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 13 14! Initialization of the matrix on each process 15 a(:,:) = real(rank) 16 17! Definition of the type_column datatype 18 call MPI_TYPE_CONTIGUOUS(nb_lines, MPI_REAL,type_column,code) 19 20! Validation of the type_column datatype 21 call MPI_TYPE_COMMIT(type_column,code)

157 105/246 6 Derived datatypes 6.5 Examples 22! Sending of the first column 23 if ( rank == 0 ) then 24 call MPI_SEND(a(1,1),1,type_column,1,tag, MPI_COMM_WORLD,code) 25 26! Reception in the last column 27 elseif ( rank == 1 ) then 28 call MPI_RECV(a(1,nb_columns),nb_lines, MPI_REAL,0,tag,& 29 MPI_COMM_WORLD,status,code) 30 end if 31 32! Free the datatype 33 call MPI_TYPE_FREE(type_column,code) call MPI_FINALIZE(code) end program column

158 106/246 6 Derived datatypes 6.5 Examples 6 Derived datatypes 6.5 Examples The datatype "matrix line" 1 program line 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_lines=5,nb_columns=6 6 integer, parameter :: tag=100 7 real, dimension(nb_lines,nb_columns) :: a 8 integer, dimension( MPI_STATUS_SIZE) :: status 9 integer :: rank,code,type_line call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 13 14! Initialization of the matrix on each process 15 a(:,:) = real(rank) 16 17! Definition of the datatype type_line 18 call MPI_TYPE_VECTOR(nb_columns,1,nb_lines, MPI_REAL,type_line,code) 19 20! Validation of the datatype type_ligne 21 call MPI_TYPE_COMMIT(type_line,code)

159 107/246 6 Derived datatypes 6.5 Examples 22! Sending of the second line 23 if ( rank == 0 ) then 24 call MPI_SEND(a(2,1),nb_columns, MPI_REAL,1,tag, MPI_COMM_WORLD,code) 25 26! Reception in the next to last line 27 elseif ( rank == 1 ) then 28 call MPI_RECV(a(nb_lines-1,1),1,type_line,0,tag,& 29 MPI_COMM_WORLD,status,code) 30 end if 31 32! Free the datatype type_ligne 33 call MPI_TYPE_FREE(type_line,code) call MPI_FINALIZE(code) end program line

160 108/246 6 Derived datatypes 6.5 Examples 6 Derived datatypes 6.5 Examples The datatype "matrix block" 1 program block 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_lines=5,nb_columns=6 6 integer, parameter :: tag=100 7 integer, parameter :: nb_lines_block=2,nb_columns_block=3 8 real, dimension(nb_lines,nb_columns) :: a 9 integer, dimension( MPI_STATUS_SIZE) :: status 10 integer :: rank,code,type_block call MPI_INIT(code) 13 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 14 15! Initialization of the matrix on each process 16 a(:,:) = real(rank) 17 18! Creation of the datatype type_bloc 19 call MPI_TYPE_VECTOR(nb_columns_block,nb_lines_block,nb_lines,& 20 MPI_REAL,type_block,code) 21 22! Validation of the datatype type_block 23 call MPI_TYPE_COMMIT(type_block,code)

161 109/246 6 Derived datatypes 6.5 Examples 24! Sending of a block 25 if ( rank == 0 ) then 26 call MPI_SEND(a(1,1),1,type_block,1,tag, MPI_COMM_WORLD,code) 27 28! Reception of the block 29 elseif ( rank == 1 ) then 30 call MPI_RECV(a(nb_lines-1,nb_columns-2),1,type_block,0,tag,& 31 MPI_COMM_WORLD,status,code) 32 end if 33 34! Freeing of the datatype type_block 35 call MPI_TYPE_FREE(type_block,code) call MPI_FINALIZE(code) end program block

162 110/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides 6 Derived datatypes 6.6 Homogenous datatypes of variable strides MPI_TYPE_INDEXED() allows to create a data structure composed of a sequence of blocks containing a variable number of elements separated by a variable stride in memory. The latter is given in number of elements. MPI_TYPE_CREATE_HINDEXED() has the same functionality as MPI_TYPE_INDEXED() except that the strides that separates two data blocks are given in bytes. This subroutine is useful when the old datatype is not an MPI base datatype(mpi_integer, MPI_REAL,...). We cannot therefore give the stride in number of elements of the old datatype. For MPI_TYPE_CREATE_HINDEXED(), as for MPI_TYPE_CREATE_HVECTOR(), use MPI_TYPE_SIZE() or MPI_TYPE_GET_EXTENT() in order to obtain in a portable way the size of the stride in bytes.

163 111/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides nb=3, blocks_lengths=( ), displacements=( ) old_type new_type Figure 27: The MPI_TYPE_INDEXED constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: block_lengths! Attention the displacements are given in elements integer,intent(in),dimension(nb) :: displacements integer,intent(in) :: old_type integer,intent(out) :: new_type,code call MPI_TYPE_INDEXED(nb,block_lengths,displacements,old_type,new_type,code)

164 111/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides nb=3, blocks_lengths=(2, ), displacements=(0, ) old_type new_type Figure 27: The MPI_TYPE_INDEXED constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: block_lengths! Attention the displacements are given in elements integer,intent(in),dimension(nb) :: displacements integer,intent(in) :: old_type integer,intent(out) :: new_type,code call MPI_TYPE_INDEXED(nb,block_lengths,displacements,old_type,new_type,code)

165 111/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides nb=3, blocks_lengths=(2,1, ), displacements=(0,3, ) old_type new_type Figure 27: The MPI_TYPE_INDEXED constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: block_lengths! Attention the displacements are given in elements integer,intent(in),dimension(nb) :: displacements integer,intent(in) :: old_type integer,intent(out) :: new_type,code call MPI_TYPE_INDEXED(nb,block_lengths,displacements,old_type,new_type,code)

166 111/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides nb=3, blocks_lengths=(2,1,3), displacements=(0,3,7) old_type new_type Figure 27: The MPI_TYPE_INDEXED constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: block_lengths! Attention the displacements are given in elements integer,intent(in),dimension(nb) :: displacements integer,intent(in) :: old_type integer,intent(out) :: new_type,code call MPI_TYPE_INDEXED(nb,block_lengths,displacements,old_type,new_type,code)

167 112/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides nb=4, blocks_lengths=( ), displacements=( ) old_type new_type Figure 28: The MPI_TYPE_CREATE_HINDEXED constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: block_lengths! Attention the displacements are given in bytes integer(kind= MPI_ADDRESS_KIND),intent(in),dimension(nb) :: displacements integer,intent(in) :: old_type integer,intent(out) :: new_type,code call MPI_TYPE_CREATE_HINDEXED(nb, block_lengths,displacements, old_type,new_type,code)

168 112/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides nb=4, blocks_lengths=(2, ), displacements=(2, ) old_type new_type Figure 28: The MPI_TYPE_CREATE_HINDEXED constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: block_lengths! Attention the displacements are given in bytes integer(kind= MPI_ADDRESS_KIND),intent(in),dimension(nb) :: displacements integer,intent(in) :: old_type integer,intent(out) :: new_type,code call MPI_TYPE_CREATE_HINDEXED(nb, block_lengths,displacements, old_type,new_type,code)

169 112/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides nb=4, blocks_lengths=(2,1, ), displacements=(2,10, ) old_type new_type Figure 28: The MPI_TYPE_CREATE_HINDEXED constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: block_lengths! Attention the displacements are given in bytes integer(kind= MPI_ADDRESS_KIND),intent(in),dimension(nb) :: displacements integer,intent(in) :: old_type integer,intent(out) :: new_type,code call MPI_TYPE_CREATE_HINDEXED(nb, block_lengths,displacements, old_type,new_type,code)

170 112/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides nb=4, blocks_lengths=(2,1,2, ), displacements=(2,10,14, ) old_type new_type Figure 28: The MPI_TYPE_CREATE_HINDEXED constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: block_lengths! Attention the displacements are given in bytes integer(kind= MPI_ADDRESS_KIND),intent(in),dimension(nb) :: displacements integer,intent(in) :: old_type integer,intent(out) :: new_type,code call MPI_TYPE_CREATE_HINDEXED(nb, block_lengths,displacements, old_type,new_type,code)

171 112/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides nb=4, blocks_lengths=(2,1,2,1), displacements=(2,10,14,24) old_type new_type Figure 28: The MPI_TYPE_CREATE_HINDEXED constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: block_lengths! Attention the displacements are given in bytes integer(kind= MPI_ADDRESS_KIND),intent(in),dimension(nb) :: displacements integer,intent(in) :: old_type integer,intent(out) :: new_type,code call MPI_TYPE_CREATE_HINDEXED(nb, block_lengths,displacements, old_type,new_type,code)

172 113/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides In the following example, each one of the two processes : ➀ initializes its matrix (positive growing numbers on the process 0 and negative decreasing numbers on the process 1) ; ➁ constructs its datatype : triangular matrix (superiour for the process 0 and inferior for the process 1) ; ➂ sends its triangular matrix to the other and receives a triangular matrix which it stores in the place of the matrix which it has sent via the MPI_SENDRECV_REPLACE() subroutine; ➃ frees its resources and exits MPI.

173 114/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides Before Process Process Figure 29: Exchange between the two processes

174 114/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides Before After Process Process Figure 29: Exchange between the two processes

175 115/246 6 Derived datatypes 6.6 Homogenous datatypes of variable strides 1 program triangle 2 use mpi 3 implicit none 4 integer,parameter :: n=8,tag=100 5 real,dimension(n,n) :: a 6 integer,dimension( MPI_STATUS_SIZE) :: status 7 integer :: i,code 8 integer :: rank,type_triangle 9 integer,dimension(n) :: block_lengths,displacements call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 13 14! Initialization of the matrix on each process 15 a(:,:) = reshape( (/ (sign(i,-rank),i=1,n*n) /), (/n,n/)) ! Creation of the triangular matrix datatype sup for the process 0! and of the inferior triangular matrix datatype for the process 1 19 if (rank == 0) then 20 block_lengths(:) = (/ (i-1,i=1,n) /) displacements(:) else = (/ (n*(i-1),i=1,n) /) 23 block_lengths(:) = (/ (n-i,i=1,n) /) displacements(:) endif = (/ (n*(i-1)+i,i=1,n) /) call MPI_TYPE_INDEXED(n, block_lengths,displacements, MPI_REAL,type_triangle,code) 28 call MPI_TYPE_COMMIT(type_triangle,code) 29 30! Permutation of the inferior and superior triangular matrices 31 call MPI_SENDRECV_REPLACE(a,1,type_triangle,mod(rank+1,2),tag,mod(rank+1,2), & 32 tag, MPI_COMM_WORLD,status,code) 33 34! Freeing of the triangle datatype 35 call MPI_TYPE_FREE(type_triangle,code) call MPI_FINALIZE(code) 38 end program triangle

176 116/246 6 Derived datatypes 6.7 Subarray Datatype Constructor 6 Derived datatypes 6.7 Subarray Datatype Constructor The MPI_TYPE_CREATE_SUBARRAY() subroutine allows to create a subarray from an array. integer,intent(in) :: nb_dims integer,dimension(ndims),intent(in) :: shape_array,shape_sub_array,coord_start integer,intent(in) :: order,old_type integer,intent(out) :: new_type,code call MPI_TYPE_CREATE_SUBARRAY(nb_dims,shape_array,shape_sub_array,coord_start, order,old_type,new_type,code)

177 117/246 6 Derived datatypes 6.7 Subarray Datatype Constructor Reminder of the vocabulary relative to the arrays in Fortran 95 The rank of an array is its number of dimensions. The extent of an array is its number of elements in a dimension. The shape of an array is a vector whose each dimension is the extent of the array in the corresponding dimension. For example the T(10,0:5,-10:10) array. Its rank is 3, its extent in the first dimension is 10, in the second 6 and in the third 21, its shape is the (10,6,21) vector. nb_dims : rank of the array shape_array : shape of the array from which a subarray will be extracted shape_sub_array : shape of the subarray coord_start : start coordinates if the indices of the array start at 0. For example, if we want that the start coordinates of the subarray be array(2,3), we must have coord_start(:)=(/ 1,2 /) order : storage order of elements 1 MPI_ORDER_FORTRAN for the ordering used by Fortran arrays (column-major order) 2 MPI_ORDER_C for the ordering used by C arrays (row-major order)

178 118/246 6 Derived datatypes 6.7 Subarray Datatype Constructor Before After Process 0 Process Process 1 Process 1 Figure 30: Exchanges between the two processes

179 119/246 6 Derived datatypes 6.7 Subarray Datatype Constructor 1 program subarray 2 use mpi 3 implicit none 4 5 integer,parameter :: nb_lines=4,nb_columns=3,& 6 tag=1000,nb_dims=2 7 integer :: code,rank,type_subarray,i 8 integer,dimension(nb_lines,nb_columns) :: tab 9 integer,dimension(nb_dims) :: shape_array,shape_subarray,coord_start 10 integer,dimension( MPI_STATUS_SIZE) :: status call MPI_INIT(code) 13 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 14 15!Initialization of the tab array on each process 16 tab(:,:) = reshape( (/ (sign(i,-rank),i=1,nb_lines*nb_columns) /), & 17 (/ nb_lines,nb_columns /) )

180 120/246 6 Derived datatypes 6.7 Subarray Datatype Constructor 18!Shape of the tab array from which a subarray will be extracted 19 shape_tab(:) = shape(tab) 20!The F95 shape function gives the profile of array put in argument. 21!ATTENTION, if the concerned array has not been allocated on all the processes, 22!it will become necessary to explicitly put the shape of the array in order to 23!become known on all the processes, shape_array(:) = (/ nb_lines,nb_columns) /) 24 25!Shape of the subarray 26 shape_subarray(:) = (/ 2,2 /) 27 28!Start coordinates of the subarray 29!For the process 0 we start from the tab(2,1) element 30!For the process 1 we start from the tab(3,2) element 31 coord_start(:) = (/ rank+1,rank /) 32 33!Creation of the type_subarray derived datatype 34 call MPI_TYPE_CREATE_SUBARRAY(nb_dims,shape_array,shape_subarray,coord_start,& 35 MPI_ORDER_FORTRAN, MPI_INTEGER,type_subarray,code) 36 call MPI_TYPE_COMMIT(type_subarray,code) 37 38!Exchange of the subarrays 39 call MPI_SENDRECV_REPLACE(tab,1,type_subarray,mod(rank+1,2),tag,& 40 mod(rank+1,2),tag, MPI_COMM_WORLD,status,code) 41 call MPI_TYPE_FREE(type_subarray,code) 42 call MPI_FINALIZE(code) 43 end program subarray

181 121/246 6 Derived datatypes 6.8 Heterogenous datatypes 6 Derived datatypes 6.8 Heterogenous datatypes The MPI_TYPE_CREATE_STRUCT() subroutine is the constructor of more general datatypes. It has the same functionalities as MPI_TYPE_INDEXED() but it allows in addition the replication of different datatype blocks. The MPI_TYPE_CREATE_STRUCT() parameters are the same as the ones of MPI_TYPE_INDEXED() with in addition : the old_types field is now a vector of MPI datatypes ; taking into account the heterogeneity of data and their alignment in memory, the calculation of the displacement between two elements is in the difference of their addresses ; MPI, via MPI_GET_ADDRESS(), provides a portable subroutine which allows to return the address of a variable.

182 122/246 6 Derived datatypes 6.8 Heterogenous datatypes nb=5, blocks_lengths=(, ), displacements=( ), old_types=( ) old_types type 1 type 2 type 3 new_type Figure 31: The MPI_TYPE_CREATE_STRUCT constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: blocks_lengths integer(kind= MPI_ADDRESS_KIND),intent(in),dimension(nb) :: displacements integer,intent(in),dimension(nb) :: old_types integer, intent(out) call MPI_TYPE_CREATE_STRUCT(nb,blocks_lengths,displacements, old_types,new_type,code) <type>,intent(in) :: variable integer(kind= MPI_ADDRESS_KIND),intent(out) :: address_variable integer,intent(out) :: code call MPI_GET_ADDRESS(variable,address_variable,code) :: new_type,code

183 122/246 6 Derived datatypes 6.8 Heterogenous datatypes nb=5, blocks_lengths=(3, ), displacements=(0, ), old_types=(type1, ) old_types type 1 type 2 type 3 new_type Figure 31: The MPI_TYPE_CREATE_STRUCT constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: blocks_lengths integer(kind= MPI_ADDRESS_KIND),intent(in),dimension(nb) :: displacements integer,intent(in),dimension(nb) :: old_types integer, intent(out) call MPI_TYPE_CREATE_STRUCT(nb,blocks_lengths,displacements, old_types,new_type,code) <type>,intent(in) :: variable integer(kind= MPI_ADDRESS_KIND),intent(out) :: address_variable integer,intent(out) :: code call MPI_GET_ADDRESS(variable,address_variable,code) :: new_type,code

184 122/246 6 Derived datatypes 6.8 Heterogenous datatypes nb=5, blocks_lengths=(3,1, ), displacements=(0,7, ), old_types=(type1,type2, ) old_types type 1 type 2 type 3 new_type Figure 31: The MPI_TYPE_CREATE_STRUCT constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: blocks_lengths integer(kind= MPI_ADDRESS_KIND),intent(in),dimension(nb) :: displacements integer,intent(in),dimension(nb) :: old_types integer, intent(out) call MPI_TYPE_CREATE_STRUCT(nb,blocks_lengths,displacements, old_types,new_type,code) <type>,intent(in) :: variable integer(kind= MPI_ADDRESS_KIND),intent(out) :: address_variable integer,intent(out) :: code call MPI_GET_ADDRESS(variable,address_variable,code) :: new_type,code

185 122/246 6 Derived datatypes 6.8 Heterogenous datatypes nb=5, blocks_lengths=(3,1,5, ), displacements=(0,7,11, ), old_types=(type1,type2,type3, ) old_types type 1 type 2 type 3 new_type Figure 31: The MPI_TYPE_CREATE_STRUCT constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: blocks_lengths integer(kind= MPI_ADDRESS_KIND),intent(in),dimension(nb) :: displacements integer,intent(in),dimension(nb) :: old_types integer, intent(out) call MPI_TYPE_CREATE_STRUCT(nb,blocks_lengths,displacements, old_types,new_type,code) <type>,intent(in) :: variable integer(kind= MPI_ADDRESS_KIND),intent(out) :: address_variable integer,intent(out) :: code call MPI_GET_ADDRESS(variable,address_variable,code) :: new_type,code

186 122/246 6 Derived datatypes 6.8 Heterogenous datatypes nb=5, blocks_lengths=(3,1,5,1, ), displacements=(0,7,11,21, ), old_types=(type1,type2,type3,type1, ) old_types type 1 type 2 type 3 new_type Figure 31: The MPI_TYPE_CREATE_STRUCT constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: blocks_lengths integer(kind= MPI_ADDRESS_KIND),intent(in),dimension(nb) :: displacements integer,intent(in),dimension(nb) :: old_types integer, intent(out) call MPI_TYPE_CREATE_STRUCT(nb,blocks_lengths,displacements, old_types,new_type,code) <type>,intent(in) :: variable integer(kind= MPI_ADDRESS_KIND),intent(out) :: address_variable integer,intent(out) :: code call MPI_GET_ADDRESS(variable,address_variable,code) :: new_type,code

187 122/246 6 Derived datatypes 6.8 Heterogenous datatypes nb=5, blocks_lengths=(3,1,5,1,1), displacements=(0,7,11,21,26), old_types=(type1,type2,type3,type1,type3) old_types type 1 type 2 type 3 new_type Figure 31: The MPI_TYPE_CREATE_STRUCT constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: blocks_lengths integer(kind= MPI_ADDRESS_KIND),intent(in),dimension(nb) :: displacements integer,intent(in),dimension(nb) :: old_types integer, intent(out) call MPI_TYPE_CREATE_STRUCT(nb,blocks_lengths,displacements, old_types,new_type,code) <type>,intent(in) :: variable integer(kind= MPI_ADDRESS_KIND),intent(out) :: address_variable integer,intent(out) :: code call MPI_GET_ADDRESS(variable,address_variable,code) :: new_type,code

188 123/246 6 Derived datatypes 6.8 Heterogenous datatypes 1 program Interaction_Particles 2 use mpi 3 implicit none 4 5 integer, parameter :: n=1000,tag=100 6 integer, dimension( MPI_STATUS_SIZE) :: status 7 integer :: rank,code,type_particle,i 8 integer, dimension(4) :: types,blocks_lengths 9 integer(kind= MPI_ADDRESS_KIND), dimension(4) :: displacements,addresses type Particule 12 character(len=5) :: category 13 integer :: mass 14 real, dimension(3) :: coords 15 logical :: class 16 end type Particule 17 type(particule), dimension(n) :: p,temp_p call MPI_INIT(code) 20 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 21 22! Construction of the datatype 23 types = (/ MPI_CHARACTER, MPI_INTEGER, MPI_REAL, MPI_LOGICAL/) 24 blocks_lengths= (/5,1,3,1/)

189 124/246 6 Derived datatypes 6.8 Heterogenous datatypes 25 call MPI_GET_ADDRESS(p(1)%category,addresses(1),code) 26 call MPI_GET_ADDRESS(p(1)%mass,addresses(2),code) 27 call MPI_GET_ADDRESS(p(1)%coords,addresses(3),code) 28 call MPI_GET_ADDRESS(p(1)%class,addresses(4),code) 29 30! Calculation of displacements relative to the start address 31 do i=1,4 32 displacements(i)=addresses(i) - addresses(1) 33 end do 34 call MPI_TYPE_CREATE_STRUCT(4,blocks_lengths,displacements,types,type_particle, & 35 code) 36! Validation of the structured datatype 37 call MPI_TYPE_COMMIT(type_particle,code) 38! Initialization of particles for each process ! Sending of particles from 0 towards 1 41 if (rank == 0) then 42 call MPI_SEND(p(1)%category,n,type_particle,1,tag, MPI_COMM_WORLD,code) 43 else 44 call MPI_RECV(temp_p(1)%category,n,type_particle,0,tag, MPI_COMM_WORLD, & 45 status,code) 46 endif 47 48! Freeing of the datatype 49 call MPI_TYPE_FREE(type_particle,code) 50 call MPI_FINALIZE(code) 51 end program Interaction_Particles

190 125/246 6 Derived datatypes 6.9 Subroutines annexes 6 Derived datatypes 6.9 Subroutines annexes The total size of a datatype : MPI_TYPE_SIZE() integer, intent(in) :: type_data integer, intent(out) :: size, code call MPI_TYPE_SIZE(type_data,size,code) The extent as well as the inferior limit of a derived datatype, by taking into account eventual memory alignments : MPI_TYPE_GET_EXTENT() integer, intent(in) :: type_derived integer(kind= MPI_ADDRESS_KIND),intent(out):: bound_inf_align,size_align integer, intent(out) :: code call MPI_TYPE_GET_EXTENT(type_derived,bound_inf_align,size_align,code) We can modify the inferior bound of a derived datatype and its extent in order to create a new adapted datatype of the precedent integer, intent(in) :: old_type integer(kind= MPI_ADDRESS_KIND),intent(in) :: new_bound_inf,new_size integer, intent(out) :: new_type,code call MPI_TYPE_CREATE_RESIZED(old_type,new_bound_inf,new_size, new_type,code)

191 126/246 6 Derived datatypes 6.9 Subroutines annexes 1 PROGRAM half_line 2 USE mpi 3 IMPLICIT NONE 4 INTEGER,PARAMETER :: nb_lines=5,nb_columns=6,& 5 half_line=nb_columns/2,tag= INTEGER,DIMENSION(nb_lines,nb_columns) :: A 7 INTEGER :: typehalfline,typehalfline2 8 INTEGER :: code,size_integer,rank,i 9 INTEGER(kind= MPI_ADDRESS_KIND) :: boundinf=0, sizedisplacement 10 INTEGER, DIMENSION( MPI_STATUS_SIZE) :: status CALL MPI_INIT(code) 13 CALL MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 14!Initialization of the A matrix on each process 15 A(:,:) = RESHAPE( (/ (SIGN(i,-rank),i=1,nb_lines*nb_columns) /), & 16 (/ nb_lines,nb_columns /) ) 17 18!Construction of the derived datatype typehalfline 19 CALL MPI_TYPE_VECTOR(half_line,1,nb_lines, MPI_INTEGER,typeHalfLine,code) 20!Know the size of the datatype MPI_INTEGER 21 CALL MPI_TYPE_SIZE( MPI_INTEGER, size_integer, code) 22!Construction of the derived datatype typehalfline2 23 sizedisplacement = size_integer 24 CALL MPI_TYPE_CREATE_RESIZED(typeHalfLine,boundInf,sizeDisplacement,& 25 typehalfline2,code)

192 127/246 6 Derived datatypes 6.9 Subroutines annexes 26!Validation of the datatype typehalfline2 27 CALL MPI_TYPE_COMMIT(typeHalfLine2,code) IF (rank == 0) THEN 30!Sending of the A matrix to the process 1 with the derived datatype typehalfline2 31 CALL MPI_SEND(A(1,1), 2, typehalfline2, 1, tag, & 32 MPI_COMM_WORLD, code) 33 ELSE 34!Reception for the process 1 in the A matrix 35 CALL MPI_RECV(A(1,nb_columns-1), 6, MPI_INTEGER, 0, tag,& 36 MPI_COMM_WORLD,status, code) PRINT *, A matrix on the process 1 39 DO i=1,nb_lines 40 PRINT *,A(i,:) 41 END DO END IF 44 CALL MPI_FINALIZE(code) 45 END PROGRAM half_line > mpiexec -n 4 half_ligne A matrix on the process

193 128/246 6 Derived datatypes 6.10 Conclusion 6 Derived datatypes 6.10 Conclusion The MPI derived datatypes are powerful portable mechanisms to describe data. They allow, when they are combined with subroutines like MPI_SENDRECV(), to simplify the writing of interprocess exchanges. The combination of derived datatypes and topologies (described in one of the next chapters) makes MPI the ideal tool for all the problems of domain decomposition with regular or irregular meshes.

194 129/246 7 Optimizations 1 Introduction 2 Environment 3 Point-to-point Communications 4 Collective communications 5 One-sided Communication 6 Derived datatypes 7 Optimizations 7.1 Introduction Point-to-Point Send Modes Key Terms Synchronous Sends Buffered Sends Standard Sends Ready Sends Performances of Send Modes Computation-Communication Overlap Communicators 9 MPI-IO 10 Conclusion

195 130/246 7 Optimizations 7.1 Introduction 7 Optimizations 7.1 Introduction Optimization of MPI communications must be a primary concern when their part becomes quite important compared to the calculation one. The optimization of communications, beyond the choice of an algorithm as efficient as possible, can be accomplished at many levels including, for example: by selecting the most suitable communication mode; by overlapping communications with computations.

196 131/246 7 Optimizations 7.2 Point-to-Point Send Modes 7 Optimizations 7.2 Point-to-Point Send Modes Mode Blocking Non-blocking Standard send MPI_Send MPI_Isend Synchronous send MPI_Ssend MPI_Issend Buffered send MPI_Bsend MPI_Ibsend Ready send MPI_Rsend MPI_Irsend Receive MPI_Recv MPI_Irecv

197 132/246 7 Optimizations 7.2 Point-to-Point Send Modes 7 Optimizations 7.2 Point-to-Point Send Modes Key Terms It is important to understand well the definition of some MPI terms. Blocking call: a call is blocking if the memory space used for the communication can be reused immediately after the exit of the call. The data that have been or will be sent are the data that were in this space at the moment of the call. If it is a receive, the data must have already been received in this space (if the return code is MPI_SUCCESS). Non-blocking call: a non-blocking call returns very quickly, but it does not authorize the immediate re-use of the memory space used in the communication. It is necessary to make sure that the communication is fully completed (with MPI_Wait for example) before using it again. Synchronous send: a synchronous send involves a synchronization between the involved processes. There can be no communication before the two processes are ready to communicate. A send cannot start until its receive is posted. Buffered send: a buffered send implies the copying of data in an intermediate memory space. There is then no coupling between the two processes of communication. So the return of this type of send does not mean that the receive occurred.

198 133/246 7 Optimizations 7.2 Point-to-Point Send Modes 7 Optimizations 7.2 Point-to-Point Send Modes Synchronous Sends A synchronous send is made by calling the MPI_Ssend or MPI_Issend subroutine. Rendezvous Protocol The rendezvous protocol is generally the protocol used for synchronous sends (implementation-dependent). The return receipt is optional. Time Process 0 Process header ready to receive data return receipt

199 134/246 7 Optimizations 7.2 Point-to-Point Send Modes Advantages Use less resources (no buffer) Faster if the receiver is ready (no copying in a buffer) Guarantee of receive through synchronization Disadvantages Waiting time if the receiver is not there/not ready Risks of deadlocks

200 135/246 7 Optimizations 7.2 Point-to-Point Send Modes 7 Optimizations 7.2 Point-to-Point Send Modes Buffered Sends Buffered Sends A buffered send is made by calling the MPI_Bsend or MPI_Ibsend subroutine. The buffers have to be managed manually (with calls to MPI_Attach and MPI_Detach). They have to be allocated by taking into account the header size of messages (by adding the constant MPI_BSEND_OVERHEAD for each message instance).

201 136/246 7 Optimizations 7.2 Point-to-Point Send Modes Protocol with User Buffer on the Sender Side This approach is the one generally used for the MPI_Bsend or MPI_Ibsend. In this approach, the buffer is on the sender side and is managed explicitly by the application. A buffer managed by MPI can exist on the receiver side. Many variants are possible. The return receipt is optional. Time Process 0 copying message return receipt Process

202 137/246 7 Optimizations 7.2 Point-to-Point Send Modes Eager Protocol The eager protocol is often used for standard sends of small-size messages. It can also be used for sends with MPI_Bsend with small messages (implementation-dependent) and by bypassing the user buffer on the sender side. In this approach, the buffer is on the receiver side. The return receipt is optional. Time Process message return receipt Process

203 138/246 7 Optimizations 7.2 Point-to-Point Send Modes Advantages No need to wait for the receiver (copying in a buffer) No risks of deadlocks Disadvantages Use of more resources (memory use by buffers with saturation risks) The used send buffers in the MPI_Bsend or MPI_Ibsend calls have to be managed manually (often hard to choose a suitable size) A little bit slower than the synchronous sends if the receiver is ready There is no guarantee of good receive (send-receive decoupling) Risk of wasted memory space if the buffers are too oversized The application will crash if the buffer is too small There is often also hidden buffers managed by the MPI implementation on the sender side and/or on the receiver side (and using memory resources)

204 139/246 7 Optimizations 7.2 Point-to-Point Send Modes 7 Optimizations 7.2 Point-to-Point Send Modes Standard Sends Standard Sends A standard send is made by calling the MPI_Send or MPI_Isend subroutine. In most implementations, this mode switches from a buffered mode to a synchronous mode when the size of messages grows. Advantages Often the most efficient (because the constructor chose the best parameters and algorithms) The most portable for the performances Disadvantages Little control over the really used mode (often accessible via environment variables) Risk of deadlock according to the actual mode Behavior that can vary according to the architecture and the problem size

205 140/246 7 Optimizations 7.2 Point-to-Point Send Modes 7 Optimizations 7.2 Point-to-Point Send Modes Ready Sends Ready Sends A ready send is made by calling the MPI_Rsend or MPI_Irsend subroutine. Warning: it is obligatory to make these calls only when the receive is already posted. Their use is not recommended. Advantages Slightly more efficient than the synchronous mode because the protocol of synchronization can be simplified Disadvantages Errors if the receiver is not ready during the send

206 141/246 7 Optimizations 7.2 Point-to-Point Send Modes Ping pong extranode balanced Vargas Standard Vargas Synchronous Vargas Buffered Babel Standard Babel Synchronous Babel Buffered Bandwidth (MiB/s) Message size (bytes)

207 142/246 7 Optimizations 7.3 Computation-Communication Overlap 7 Optimizations 7.3 Computation-Communication Overlap Presentation The overlap of communications by computations is a method which allows to execute communications operations in background while the program continues to operate. It is thus possible, if the hardware and software architecture allows it, to hide all or part of communications costs. The computation-communication overlap can be seen as an additional level of parallelism. This approach is used in MPI by the use of non-blocking subroutines (i.e. MPI_Isend, MPI_Irecv and MPI_Wait).

208 143/246 7 Optimizations 7.3 Computation-Communication Overlap Partial overlap Process 0 request u o Process 0 request Time finished? sending Time sending finished? yes... yes

Message Passing Interface (MPI)

Message Passing Interface (MPI) Message Passing Interface (MPI) Jalel Chergui Isabelle Dupays Denis Girou Pierre-François Lavallée Dimitri Lecas Philippe Wautelet MPI Plan I 1 Introduction... 7 1.1 Availability and updating... 7 1.2

More information

Lecture 6: Introduction to MPI programming. Lecture 6: Introduction to MPI programming p. 1

Lecture 6: Introduction to MPI programming. Lecture 6: Introduction to MPI programming p. 1 Lecture 6: Introduction to MPI programming Lecture 6: Introduction to MPI programming p. 1 MPI (message passing interface) MPI is a library standard for programming distributed memory MPI implementation(s)

More information

Lightning Introduction to MPI Programming

Lightning Introduction to MPI Programming Lightning Introduction to MPI Programming May, 2015 What is MPI? Message Passing Interface A standard, not a product First published 1994, MPI-2 published 1997 De facto standard for distributed-memory

More information

MPI Application Development Using the Analysis Tool MARMOT

MPI Application Development Using the Analysis Tool MARMOT MPI Application Development Using the Analysis Tool MARMOT HLRS High Performance Computing Center Stuttgart Allmandring 30 D-70550 Stuttgart http://www.hlrs.de 24.02.2005 1 Höchstleistungsrechenzentrum

More information

Message Passing with MPI

Message Passing with MPI Message Passing with MPI Hristo Iliev (Христо Илиев) PPCES 2012, 2013 Christian Iwainsky PPCES 2011 Rechen- und Kommunikationszentrum (RZ) Agenda Motivation MPI Part 1 Concepts Point-to-point communication

More information

Implementing MPI-IO Shared File Pointers without File System Support

Implementing MPI-IO Shared File Pointers without File System Support Implementing MPI-IO Shared File Pointers without File System Support Robert Latham, Robert Ross, Rajeev Thakur, Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne,

More information

Performance and scalability of MPI on PC clusters

Performance and scalability of MPI on PC clusters CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 24; 16:79 17 (DOI: 1.12/cpe.749) Performance Performance and scalability of MPI on PC clusters Glenn R. Luecke,,

More information

ADVANCED MPI. Dr. David Cronk Innovative Computing Lab University of Tennessee

ADVANCED MPI. Dr. David Cronk Innovative Computing Lab University of Tennessee ADVANCED MPI Dr. David Cronk Innovative Computing Lab University of Tennessee Day 1 Morning - Lecture Course Outline Communicators/Groups Extended Collective Communication One-sided Communication Afternoon

More information

Parallel Programming with MPI on the Odyssey Cluster

Parallel Programming with MPI on the Odyssey Cluster Parallel Programming with MPI on the Odyssey Cluster Plamen Krastev Office: Oxford 38, Room 204 Email: plamenkrastev@fas.harvard.edu FAS Research Computing Harvard University Objectives: To introduce you

More information

MPI Runtime Error Detection with MUST For the 13th VI-HPS Tuning Workshop

MPI Runtime Error Detection with MUST For the 13th VI-HPS Tuning Workshop MPI Runtime Error Detection with MUST For the 13th VI-HPS Tuning Workshop Joachim Protze and Felix Münchhalfen IT Center RWTH Aachen University February 2014 Content MPI Usage Errors Error Classes Avoiding

More information

Introduction to Hybrid Programming

Introduction to Hybrid Programming Introduction to Hybrid Programming Hristo Iliev Rechen- und Kommunikationszentrum aixcelerate 2012 / Aachen 10. Oktober 2012 Version: 1.1 Rechen- und Kommunikationszentrum (RZ) Motivation for hybrid programming

More information

Session 2: MUST. Correctness Checking

Session 2: MUST. Correctness Checking Center for Information Services and High Performance Computing (ZIH) Session 2: MUST Correctness Checking Dr. Matthias S. Müller (RWTH Aachen University) Tobias Hilbrich (Technische Universität Dresden)

More information

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005 Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005 Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005... 1

More information

High performance computing systems. Lab 1

High performance computing systems. Lab 1 High performance computing systems Lab 1 Dept. of Computer Architecture Faculty of ETI Gdansk University of Technology Paweł Czarnul For this exercise, study basic MPI functions such as: 1. for MPI management:

More information

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015 Hermann Härtig ISSUES starting points independent Unix processes and block synchronous execution who does it load migration mechanism

More information

Performance Tool Support for MPI-2 on Linux 1

Performance Tool Support for MPI-2 on Linux 1 Performance Tool Support for MPI-2 on Linux 1 Kathryn Mohror and Karen L. Karavanic Portland State University {kathryn, karavan}@cs.pdx.edu Abstract Programmers of message-passing codes for clusters of

More information

MPI Hands-On List of the exercises

MPI Hands-On List of the exercises MPI Hands-On List of the exercises 1 MPI Hands-On Exercise 1: MPI Environment.... 2 2 MPI Hands-On Exercise 2: Ping-pong...3 3 MPI Hands-On Exercise 3: Collective communications and reductions... 5 4 MPI

More information

Parallelization: Binary Tree Traversal

Parallelization: Binary Tree Traversal By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First

More information

MARMOT- MPI Analysis and Checking Tool Demo with blood flow simulation. Bettina Krammer, Matthias Müller krammer@hlrs.de, mueller@hlrs.

MARMOT- MPI Analysis and Checking Tool Demo with blood flow simulation. Bettina Krammer, Matthias Müller krammer@hlrs.de, mueller@hlrs. MARMOT- MPI Analysis and Checking Tool Demo with blood flow simulation Bettina Krammer, Matthias Müller krammer@hlrs.de, mueller@hlrs.de HLRS High Performance Computing Center Stuttgart Allmandring 30

More information

To connect to the cluster, simply use a SSH or SFTP client to connect to:

To connect to the cluster, simply use a SSH or SFTP client to connect to: RIT Computer Engineering Cluster The RIT Computer Engineering cluster contains 12 computers for parallel programming using MPI. One computer, cluster-head.ce.rit.edu, serves as the master controller or

More information

Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication

Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication Workshop for Communication Architecture in Clusters, IPDPS 2002: Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication Joachim Worringen, Andreas Gäer, Frank Reker

More information

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5) Introduction Reading Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF 1 Programming Assignment Notes Assume that memory is limited don t replicate the board on all nodes Need to provide load

More information

MPICH FOR SCI-CONNECTED CLUSTERS

MPICH FOR SCI-CONNECTED CLUSTERS Autumn Meeting 99 of AK Scientific Computing MPICH FOR SCI-CONNECTED CLUSTERS Joachim Worringen AGENDA Introduction, Related Work & Motivation Implementation Performance Work in Progress Summary MESSAGE-PASSING

More information

Source Code Transformations Strategies to Load-balance Grid Applications

Source Code Transformations Strategies to Load-balance Grid Applications Source Code Transformations Strategies to Load-balance Grid Applications Romaric David, Stéphane Genaud, Arnaud Giersch, Benjamin Schwarz, and Éric Violard LSIIT-ICPS, Université Louis Pasteur, Bd S. Brant,

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

HP-MPI User s Guide. 11th Edition. Manufacturing Part Number : B6060-96024 September 2007

HP-MPI User s Guide. 11th Edition. Manufacturing Part Number : B6060-96024 September 2007 HP-MPI User s Guide 11th Edition Manufacturing Part Number : B6060-96024 September 2007 Copyright 1979-2007 Hewlett-Packard Development Company, L.P. Table 1 Revision history Edition MPN Description Eleventh

More information

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber Introduction to grid technologies, parallel and cloud computing Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber OUTLINES Grid Computing Parallel programming technologies (MPI- Open MP-Cuda )

More information

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C Embedded Systems A Review of ANSI C and Considerations for Embedded C Programming Dr. Jeff Jackson Lecture 2-1 Review of ANSI C Topics Basic features of C C fundamentals Basic data types Expressions Selection

More information

HPCC - Hrothgar Getting Started User Guide MPI Programming

HPCC - Hrothgar Getting Started User Guide MPI Programming HPCC - Hrothgar Getting Started User Guide MPI Programming High Performance Computing Center Texas Tech University HPCC - Hrothgar 2 Table of Contents 1. Introduction... 3 2. Setting up the environment...

More information

Simple Fortran Multitasking Library for the Apple Macintosh Computer

Simple Fortran Multitasking Library for the Apple Macintosh Computer Simple Fortran Multitasking Library for the Apple Macintosh Computer Viktor K. Decyk Department of Physics and Astronomy UCLA Los Angeles, California 90095-1547 decyk@physics.ucla.edu The Apple Macintosh

More information

RA MPI Compilers Debuggers Profiling. March 25, 2009

RA MPI Compilers Debuggers Profiling. March 25, 2009 RA MPI Compilers Debuggers Profiling March 25, 2009 Examples and Slides To download examples on RA 1. mkdir class 2. cd class 3. wget http://geco.mines.edu/workshop/class2/examples/examples.tgz 4. tar

More information

Information technology Programming languages Fortran Enhanced data type facilities

Information technology Programming languages Fortran Enhanced data type facilities ISO/IEC JTC1/SC22/WG5 N1379 Working draft of ISO/IEC TR 15581, second edition Information technology Programming languages Fortran Enhanced data type facilities This page to be supplied by ISO. No changes

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network

More information

HPC Wales Skills Academy Course Catalogue 2015

HPC Wales Skills Academy Course Catalogue 2015 HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

Mutual Exclusion using Monitors

Mutual Exclusion using Monitors Mutual Exclusion using Monitors Some programming languages, such as Concurrent Pascal, Modula-2 and Java provide mutual exclusion facilities called monitors. They are similar to modules in languages that

More information

Building an Inexpensive Parallel Computer

Building an Inexpensive Parallel Computer Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University

More information

Application Performance Tools on Discover

Application Performance Tools on Discover Application Performance Tools on Discover Tyler Simon 21 May 2009 Overview 1. ftnchek - static Fortran code analysis 2. Cachegrind - source annotation for cache use 3. Ompp - OpenMP profiling 4. IPM MPI

More information

How To Visualize Performance Data In A Computer Program

How To Visualize Performance Data In A Computer Program Performance Visualization Tools 1 Performance Visualization Tools Lecture Outline : Following Topics will be discussed Characteristics of Performance Visualization technique Commercial and Public Domain

More information

MPI-Checker Static Analysis for MPI

MPI-Checker Static Analysis for MPI MPI-Checker Static Analysis for MPI Alexander Droste, Michael Kuhn, Thomas Ludwig November 15, 2015 Motivation 2 / 39 Why is runtime analysis in HPC challenging? Large amount of resources are used State

More information

Mathematical Libraries on JUQUEEN. JSC Training Course

Mathematical Libraries on JUQUEEN. JSC Training Course Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries on JUQUEEN JSC Training Course May 10, 2012 Outline General Informations Sequential Libraries, planned Parallel Libraries and Application Systems:

More information

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ) Advanced MPI Hybrid programming, profiling and debugging of MPI applications Hristo Iliev RZ Rechen- und Kommunikationszentrum (RZ) Agenda Halos (ghost cells) Hybrid programming Profiling of MPI applications

More information

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Client/Server Computing Distributed Processing, Client/Server, and Clusters Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the

More information

Computer Systems Structure Input/Output

Computer Systems Structure Input/Output Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices

More information

Client/Server and Distributed Computing

Client/Server and Distributed Computing Adapted from:operating Systems: Internals and Design Principles, 6/E William Stallings CS571 Fall 2010 Client/Server and Distributed Computing Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Traditional

More information

Linux Driver Devices. Why, When, Which, How?

Linux Driver Devices. Why, When, Which, How? Bertrand Mermet Sylvain Ract Linux Driver Devices. Why, When, Which, How? Since its creation in the early 1990 s Linux has been installed on millions of computers or embedded systems. These systems may

More information

MPI Implementation Analysis - A Practical Approach to Network Marketing

MPI Implementation Analysis - A Practical Approach to Network Marketing Optimizing MPI Collective Communication by Orthogonal Structures Matthias Kühnemann Fakultät für Informatik Technische Universität Chemnitz 917 Chemnitz, Germany kumat@informatik.tu chemnitz.de Gudula

More information

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card

More information

Glossary of Object Oriented Terms

Glossary of Object Oriented Terms Appendix E Glossary of Object Oriented Terms abstract class: A class primarily intended to define an instance, but can not be instantiated without additional methods. abstract data type: An abstraction

More information

Moving from CS 61A Scheme to CS 61B Java

Moving from CS 61A Scheme to CS 61B Java Moving from CS 61A Scheme to CS 61B Java Introduction Java is an object-oriented language. This document describes some of the differences between object-oriented programming in Scheme (which we hope you

More information

McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk

McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk McMPI Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk Outline Yet another MPI library? Managed-code, C#, Windows McMPI, design and implementation details Object-orientation,

More information

Dynamic Software Testing of MPI Applications with Umpire

Dynamic Software Testing of MPI Applications with Umpire Dynamic Software Testing of MPI Applications with Umpire Jeffrey S. Vetter Bronis R. de Supinski Center for Applied Scientific Computing Lawrence Livermore National Laboratory Livermore, California, USA

More information

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Introduction Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Advanced Topics in Software Engineering 1 Concurrent Programs Characterized by

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

The programming language C. sws1 1

The programming language C. sws1 1 The programming language C sws1 1 The programming language C invented by Dennis Ritchie in early 1970s who used it to write the first Hello World program C was used to write UNIX Standardised as K&C (Kernighan

More information

1 The Java Virtual Machine

1 The Java Virtual Machine 1 The Java Virtual Machine About the Spec Format This document describes the Java virtual machine and the instruction set. In this introduction, each component of the machine is briefly described. This

More information

A Java Based Tool for Testing Interoperable MPI Protocol Conformance

A Java Based Tool for Testing Interoperable MPI Protocol Conformance A Java Based Tool for Testing Interoperable MPI Protocol Conformance William George National Institute of Standards and Technology 100 Bureau Drive Stop 8951 Gaithersburg MD 20899 8951 1 301 975 4943 william.george@nist.gov

More information

Monitors, Java, Threads and Processes

Monitors, Java, Threads and Processes Monitors, Java, Threads and Processes 185 An object-oriented view of shared memory A semaphore can be seen as a shared object accessible through two methods: wait and signal. The idea behind the concept

More information

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

More information

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86

More information

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of

More information

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

More information

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL This chapter is to introduce the client-server model and its role in the development of distributed network systems. The chapter

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Kommunikation in HPC-Clustern

Kommunikation in HPC-Clustern Kommunikation in HPC-Clustern Communication/Computation Overlap in MPI W. Rehm and T. Höfler Department of Computer Science TU Chemnitz http://www.tu-chemnitz.de/informatik/ra 11.11.2005 Outline 1 2 Optimize

More information

GPI Global Address Space Programming Interface

GPI Global Address Space Programming Interface GPI Global Address Space Programming Interface SEPARS Meeting Stuttgart, December 2nd 2010 Dr. Mirko Rahn Fraunhofer ITWM Competence Center for HPC and Visualization 1 GPI Global address space programming

More information

Parallel I/O on Mira Venkat Vishwanath and Kevin Harms

Parallel I/O on Mira Venkat Vishwanath and Kevin Harms Parallel I/O on Mira Venkat Vishwanath and Kevin Harms Argonne Na*onal Laboratory venkat@anl.gov ALCF-2 I/O Infrastructure Mira BG/Q Compute Resource Tukey Analysis Cluster 48K Nodes 768K Cores 10 PFlops

More information

Debugging with TotalView

Debugging with TotalView Tim Cramer 17.03.2015 IT Center der RWTH Aachen University Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again and...... enrich

More information

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source

More information

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality

More information

Application Note: AN00121 Using XMOS TCP/IP Library for UDP-based Networking

Application Note: AN00121 Using XMOS TCP/IP Library for UDP-based Networking Application Note: AN00121 Using XMOS TCP/IP Library for UDP-based Networking This application note demonstrates the use of XMOS TCP/IP stack on an XMOS multicore micro controller to communicate on an ethernet-based

More information

Developing Parallel Applications with the Eclipse Parallel Tools Platform

Developing Parallel Applications with the Eclipse Parallel Tools Platform Developing Parallel Applications with the Eclipse Parallel Tools Platform Greg Watson IBM STG grw@us.ibm.com Parallel Tools Platform Enabling Parallel Application Development Best practice tools for experienced

More information

THE NAS KERNEL BENCHMARK PROGRAM

THE NAS KERNEL BENCHMARK PROGRAM THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures

More information

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters User s Manual for the HPCVL DMSM Library Gang Liu and Hartmut L. Schmider High Performance Computing

More information

Informatica e Sistemi in Tempo Reale

Informatica e Sistemi in Tempo Reale Informatica e Sistemi in Tempo Reale Introduction to C programming Giuseppe Lipari http://retis.sssup.it/~lipari Scuola Superiore Sant Anna Pisa October 25, 2010 G. Lipari (Scuola Superiore Sant Anna)

More information

Mathematical Libraries and Application Software on JUROPA and JUQUEEN

Mathematical Libraries and Application Software on JUROPA and JUQUEEN Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUROPA and JUQUEEN JSC Training Course May 2014 I.Gutheil Outline General Informations Sequential Libraries Parallel

More information

How To Understand The Concept Of A Distributed System

How To Understand The Concept Of A Distributed System Distributed Operating Systems Introduction Ewa Niewiadomska-Szynkiewicz and Adam Kozakiewicz ens@ia.pw.edu.pl, akozakie@ia.pw.edu.pl Institute of Control and Computation Engineering Warsaw University of

More information

Advanced Computational Software

Advanced Computational Software Advanced Computational Software Scientific Libraries: Part 2 Blue Waters Undergraduate Petascale Education Program May 29 June 10 2011 Outline Quick review Fancy Linear Algebra libraries - ScaLAPACK -PETSc

More information

SFIO, Parallel File Striping for MPI-I/O

SFIO, Parallel File Striping for MPI-I/O Published in EPFL Supercomputing Review No. 12 - Nov. 2, pp. 17-21, http://sawww.epfl.ch/sic/sa/publications/scr/scr12-page17.html SFIO, Système de fichiers distribués pour MPI-I/O Résumé Emin Gabrielyan

More information

The C Programming Language course syllabus associate level

The C Programming Language course syllabus associate level TECHNOLOGIES The C Programming Language course syllabus associate level Course description The course fully covers the basics of programming in the C programming language and demonstrates fundamental programming

More information

18-447 Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

18-447 Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013 18-447 Computer Architecture Lecture 3: ISA Tradeoffs Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013 Reminder: Homeworks for Next Two Weeks Homework 0 Due next Wednesday (Jan 23), right

More information

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

It is the thinnest layer in the OSI model. At the time the model was formulated, it was not clear that a session layer was needed.

It is the thinnest layer in the OSI model. At the time the model was formulated, it was not clear that a session layer was needed. Session Layer The session layer resides above the transport layer, and provides value added services to the underlying transport layer services. The session layer (along with the presentation layer) add

More information

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

More information

COMP5426 Parallel and Distributed Computing. Distributed Systems: Client/Server and Clusters

COMP5426 Parallel and Distributed Computing. Distributed Systems: Client/Server and Clusters COMP5426 Parallel and Distributed Computing Distributed Systems: Client/Server and Clusters Client/Server Computing Client Client machines are generally single-user workstations providing a user-friendly

More information

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Computer Network. Interconnected collection of autonomous computers that are able to exchange information Introduction Computer Network. Interconnected collection of autonomous computers that are able to exchange information No master/slave relationship between the computers in the network Data Communications.

More information

Parallel and Distributed Computing Programming Assignment 1

Parallel and Distributed Computing Programming Assignment 1 Parallel and Distributed Computing Programming Assignment 1 Due Monday, February 7 For programming assignment 1, you should write two C programs. One should provide an estimate of the performance of ping-pong

More information

Chapter 1. Dr. Chris Irwin Davis Email: cid021000@utdallas.edu Phone: (972) 883-3574 Office: ECSS 4.705. CS-4337 Organization of Programming Languages

Chapter 1. Dr. Chris Irwin Davis Email: cid021000@utdallas.edu Phone: (972) 883-3574 Office: ECSS 4.705. CS-4337 Organization of Programming Languages Chapter 1 CS-4337 Organization of Programming Languages Dr. Chris Irwin Davis Email: cid021000@utdallas.edu Phone: (972) 883-3574 Office: ECSS 4.705 Chapter 1 Topics Reasons for Studying Concepts of Programming

More information

Caml Virtual Machine File & data formats Document version: 1.4 http://cadmium.x9c.fr

Caml Virtual Machine File & data formats Document version: 1.4 http://cadmium.x9c.fr Caml Virtual Machine File & data formats Document version: 1.4 http://cadmium.x9c.fr Copyright c 2007-2010 Xavier Clerc cadmium@x9c.fr Released under the LGPL version 3 February 6, 2010 Abstract: This

More information

Java (12 Weeks) Introduction to Java Programming Language

Java (12 Weeks) Introduction to Java Programming Language Java (12 Weeks) Topic Lecture No. Introduction to Java Programming Language 1 An Introduction to Java o Java as a Programming Platform, The Java "White Paper" Buzzwords, Java and the Internet, A Short

More information

SMTP-32 Library. Simple Mail Transfer Protocol Dynamic Link Library for Microsoft Windows. Version 5.2

SMTP-32 Library. Simple Mail Transfer Protocol Dynamic Link Library for Microsoft Windows. Version 5.2 SMTP-32 Library Simple Mail Transfer Protocol Dynamic Link Library for Microsoft Windows Version 5.2 Copyright 1994-2003 by Distinct Corporation All rights reserved Table of Contents 1 Overview... 5 1.1

More information

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and

More information

Analysis and Implementation of Cluster Computing Using Linux Operating System

Analysis and Implementation of Cluster Computing Using Linux Operating System IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661 Volume 2, Issue 3 (July-Aug. 2012), PP 06-11 Analysis and Implementation of Cluster Computing Using Linux Operating System Zinnia Sultana

More information

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12 Universität Dortmund 12 Middleware Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: Alexandra Nolte, Gesine Marwedel, 2003 2010 年 11 月 26 日 These slides use Microsoft clip arts. Microsoft copyright

More information

Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine

More information

Exceptions in MIPS. know the exception mechanism in MIPS be able to write a simple exception handler for a MIPS machine

Exceptions in MIPS. know the exception mechanism in MIPS be able to write a simple exception handler for a MIPS machine 7 Objectives After completing this lab you will: know the exception mechanism in MIPS be able to write a simple exception handler for a MIPS machine Introduction Branches and jumps provide ways to change

More information

HSL and its out-of-core solver

HSL and its out-of-core solver HSL and its out-of-core solver Jennifer A. Scott j.a.scott@rl.ac.uk Prague November 2006 p. 1/37 Sparse systems Problem: we wish to solve where A is Ax = b LARGE Informal definition: A is sparse if many

More information