Message Passing Interface (MPI)
|
|
|
- Griselda Wilkerson
- 10 years ago
- Views:
Transcription
1 Message Passing Interface (MPI) Jalel Chergui Isabelle Dupays Denis Girou Pierre-François Lavallée Dimitri Lecas Philippe Wautelet
2 MPI Plan I 1 Introduction Availability and updating Definitions Message Passing concepts History Library Environment Description Exemple Point-to-point Communications General Concepts Predefined MPI Datatypes Other Possibilities Example: communication ring Collective communications General concepts Global synchronization: MPI_BARRIER() Broadcast : MPI_BCAST() Scatter : MPI_SCATTER() Gather : MPI_GATHER() Gather-to-all : MPI_ALLGATHER() Extended gather : MPI_GATHERV()...51
3 MPI Plan II 4.8 All-to-all : MPI_ALLTOALL() Global reduction Additions One-sided Communication Introduction Reminder : The Concept of Message-Passing The Concept of One-sided Communication RMA Approach of MPI Memory Window Data Transfer Completion of the Transfer : the synchronization Active Target Synchronization Passive Target Synchronization Conclusions Derived datatypes Introduction Contiguous datatypes Constant stride Other subroutines Examples The datatype "matrix row" The datatype "matrix line"...100
4 MPI Plan III The datatype "matrix block" Homogenous datatypes of variable strides Subarray Datatype Constructor Heterogenous datatypes Subroutines annexes Conclusion Optimizations Introduction Point-to-Point Send Modes Key Terms Synchronous Sends Buffered Sends Standard Sends Ready Sends Performances of Send Modes Computation-Communication Overlap Communicators Introduction Default Communicator Example Groups and Communicators Communicator Resulting from Another
5 MPI Plan IV 8.6 Topologies Cartesian topologies Subdividing a cartesian topology MPI-IO Introduction Presentation Issues Definitions File management Reads/Writes : general concepts Individual Reads/Writes Via explicit offsets Via Individual File Pointers Shared File Pointers Collective Reads/Writes Via explicit offsets Via individual file pointers Via shared file pointers Explicit positioning of pointers in a file Definition of views NonBlocking Reads/Writes Via explicit displacements
6 MPI Plan V Via individual file pointers NonBlocking and collective Reads/Writes Advices Conclusion...236
7 7/236 1 Introduction 1.1 Availability and updating 1 Introduction 1.1 Availability and updating This document is likely to be updated regularly. The most recent version is available on the Web server of IDRIS, section Support de cours: IDRIS Institut du développement et des ressources en informatique scientifique Rue John Von Neumann Bâtiment 506 BP ORSAY CEDEX France Translated with the help of Yazeed Zabrie.
8 8/236 1 Introduction 1.2 Definitions 1 Introduction 1.2 Definitions ➊ The sequential programming model: the program is executed by one and only one process ; all the variables and constants of the program are allocated in the memory of the process ; a process is executed on a physical processor of the machine. MEMORY Program P Figure 1: Sequential programming model
9 9/236 1 Introduction 1.2 Definitions ➋ In the message passing programming model : the program is written in a classic language (Fortran, C, C++, etc.) ; all the variables of the program are private and reside in the local memory of each process ; each process may execute different parts of a program ; a variable is exchanged between two or many processes via a call to subroutines. Memories Processes Network Programs Figure 2: Message-Passing Programming Model
10 10/236 1 Introduction 1.2 Definitions ➌ The SPMD execution model : Single Program, Multiple Data ; the same program is executed by all the processes ; all machines support this programming model and some support only this one ; it is a particular case of the general model of MPMD (Multiple Program, Multiple Data), which it can also emulate. Memories d0 d1 d2 d3 Processes Network Single program Figure 3: Single Program, Multiple Data
11 11/236 1 Introduction 1.3 Message Passing concepts 1 Introduction 1.3 Message Passing concepts If a message is sent to a process, the latter must receive it 0 1 source t target I send Figure 4: Message Passing I receive
12 12/236 1 Introduction 1.3 Message Passing concepts A message consists of data chunks passing from the sending process to the receiving process In addition to the data (scalar variables, arrays, etc.) to be sent, a message has to contain the following information : the identifier of the sending process ; the datatype ; the length ; the identifier of the receiving process. Memories Processes Network d1 d2 1 2 message Figure 5: Message Construction message sender receiver datatype length DATA d1
13 13/236 1 Introduction 1.3 Message Passing concepts The exchanged messages are interpreted and managed by an environment comparable to telephony, fax, postal mail, , etc. The message is sent to a specified address. The receiving process must be able to classify and interpret the messages which are sent to it. The environment in question is MPI (Message Passing Interface). An MPI application is a group of independent processes executing, each one, their own code and communicating via calls to MPI library subroutines.
14 14/236 1 Introduction 1.4 History 1 Introduction 1.4 History Version 1.0 : in June 1994, the MPI forum (Message Passing Interface Forum), with the participation of about forty organizations, came to the definition of a set of subroutines concerning the MPI message passing library. Version 1.1 : June 1995, with only minor changes. Version 1.2 : in 1997, with minor changes for a better consistency of the naming of some subroutines. Version 1.3 : September 2008, with clarifications in MPI 1.2, according to clarifications themselves made by MPI-2.1 Version 2.0 : released in July 1997, this version brought deliberately non-integrated, essential additions in MPI 1.0 (process dynamic management, one-sided communications, parallel I/O, etc.). Version 2.1 : June 2008, with only clarifications in MPI 2.0 but without any changes. Version 2.2 : September 2009, with only "small" additions
15 15/236 1 Introduction 1.4 History Version 3.0: September 2012 changes and important additions compared to version 2.2 ; main changes : nonblocking collective communications ; revision of the implementation of one-sided communications ; Fortran ( ) bindings ; C++ bindings removed ; interfacing of external tools (for debugging and performance measures) ; etc. see
16 16/236 1 Introduction 1.5 Library 1 Introduction 1.5 Library Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, Version 3.0, High Performance Computing Center Stuttgart (HLRS), Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, Version 2.2, High Performance Computing Center Stuttgart (HLRS), William Gropp, Ewing Lusk et Anthony Skjellum : Using MPI: Portable Parallel Programming with the Message Passing Interface, second edition, MIT Press, 1999 William Gropp, Ewing Lusk et Rajeev Thakur : Using MPI-2, MIT Press, 1999 Peter S. Pacheco : Parallel Programming with MPI, Morgan Kaufman Ed., 1997 Additional references :
17 17/236 1 Introduction 1.5 Library MPI open source implementations : they can be installed on a large number of architectures but their performances are generally below the ones of the constructors implementations. 1 MPICH2 : 2 Open MPI :
18 18/236 1 Introduction 1.5 Library The debuggers 1 Totalview 2 DDT Tools for performance measurements 1 MPE : MPI Parallel Environment 2 Scalasca : Scalable Performance Analysis of Large-Scale Applications 3 Vampir
19 19/236 1 Introduction 1.5 Library Some open source parallel scientific libraries : 1 ScaLAPACK : linear algebraic problem solvers using direct methods. The sources can be downloaded from the website 2 PETSc : linear and non-linear algebraic problem solvers using iterative methods. The sources can be downloaded from the website 3 MUMPS : MUltifrontal Massively Parallel sparse direct Solvers. The sources can be downloaded from the website 4 FFTW : fast Fourier transforms. The sources can be downloaded from the website
20 20/236 2 Environment 2.1 Description 2 Environment 2.1 Description Every program unit calling MPI subroutines has to include a header file. In Fortran, we must use the mpi module introduced in MPI-2 (in MPI-1, it was the mpif.h file), and in C/C++ the mpi.h file. The MPI_INIT() subroutine initializes the necessary environment : integer, intent(out) :: code call MPI_INIT(code) The MPI_FINALIZE() subroutine disables this environment : integer, intent(out) :: code call MPI_FINALIZE(code)
21 21/236 2 Environment 2.1 Description All the operations made by MPI are related to communicators. The default communicator is MPI_COMM_WORLD which includes all the active processes Figure 6: MPI_COMM_WORLD Communicator
22 22/236 2 Environment 2.1 Description At any moment, we can know the number of processes managed by a given communicator by the MPI_COMM_SIZE() subroutine : integer, intent(out) :: nb_procs,code call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) Similarly, the MPI_COMM_RANK() subroutine allows to obtain the process rank (i.e. its instance number, which is a number between 0 and the value sent by MPI_COMM_SIZE() 1) : integer, intent(out) :: rank,code call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code)
23 23/236 2 Environment 2.2 Exemple 2 Environment 2.2 Exemple 1 program who_am_i 2 use mpi 3 implicit none 4 integer :: nb_procs,rank,code 5 6 call MPI_INIT(code) 7 8 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 9 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) print *, I am the process,rank, among,nb_procs call MPI_FINALIZE(code) 14 end program who_am_i > mpiexec -n 7 who_am_i I am the process 3 among 7 I am the process 0 among 7 I am the process 4 among 7 I am the process 1 among 7 I am the process 5 among 7 I am the process 2 among 7 I am the process 6 among 7
24 24/236 3 Point-to-point Communications 3.1 General Concepts 3 Point-to-point Communications 3.1 General Concepts A point-to-point communication occurs between two processes, one is called the sender process and the other one is called the receiver process Sender Receiver 5 6 Figure 7: Point-to-point communication
25 25/236 3 Point-to-point Communications 3.1 General Concepts The sender and the receiver are identified by their rank in the communicator. The so-called message envelope is composed of: 1 the rank of the sender process; 2 the rank of the receiver process; 3 the tag of the message; 4 the communicator which define the process group and communication context. The exchanged data are predefined (integer, real, etc) or personal derived datatypes. There are in each case many communication modes, calling different protocols which will be discussed in Chapter 7.
26 26/236 3 Point-to-point Communications 3.1 General Concepts 1 program point_to_point 2 use mpi 3 implicit none 4 5 integer, dimension( MPI_STATUS_SIZE) :: status 6 integer, parameter :: tag=100 7 integer :: rank,value,code 8 9 call MPI_INIT(code) call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) if (rank == 2) then 14 value= call MPI_SEND(value,1, MPI_INTEGER,5,tag, MPI_COMM_WORLD,code) 16 elseif (rank == 5) then 17 call MPI_RECV(value,1, MPI_INTEGER,2,tag, MPI_COMM_WORLD,status,code) 18 print *, I, process 5, I have received,value, from the process 2 19 end if call MPI_FINALIZE(code) end program point_to_point > mpiexec -n 7 point_to_point I, process 5, I have received 1000 from the process 2
27 27/236 3 Point-to-point Communications 3.2 Predefined MPI Datatypes 3 Point-to-point Communications 3.2 Predefined MPI Datatypes MPI Type MPI_INTEGER MPI_REAL MPI_DOUBLE_PRECISION MPI_COMPLEX MPI_LOGICAL MPI_CHARACTER Fortran Type INTEGER REAL DOUBLE PRECISION COMPLEX LOGICAL CHARACTER Table 1: Predefined MPI Datatypes (Fortran)
28 28/236 3 Point-to-point Communications 3.2 Predefined MPI Datatypes MPI Type C Type MPI_CHAR signed char MPI_SHORT signed short MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double Table 2: Predefined MPI Datatypes (C)
29 29/236 3 Point-to-point Communications 3.3 Other Possibilities 3 Point-to-point Communications 3.3 Other Possibilities On the reception of a message, the rank of the sender and the tag can be replaced respectively by the MPI_ANY_SOURCE and MPI_ANY_TAG wildcards. A communication with the dummy process of rank MPI_PROC_NULL has no effect. MPI_STATUS_IGNORE is a predefined constant that can be used instead of the status variable. MPI_SUCCESS is a predefined constant which allows testing the return code of a MPI subroutine. There are syntactic variants, MPI_SENDRECV() and MPI_SENDRECV_REPLACE(), which combine a send and a receive operations (for the former, the receive memory zone must be different from the send one). It is possible to create more complex data structures using derived datatypes (see the Chapter 6)
30 30/236 3 Point-to-point Communications 3.3 Other Possibilities Figure 8: sendrecv Communication between the Processes 0 and 1 1 program sendrecv 2 use mpi 3 implicit none 4 integer :: rank,value,num_proc,code 5 integer,parameter :: tag= call MPI_INIT(code) 8 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 9 10! We suppose that we have exactly 2 processes 11 num_proc=mod(rank+1,2) call MPI_SENDRECV(rank+1000,1, MPI_INTEGER,num_proc,tag,value,1, MPI_INTEGER, & 14 num_proc,tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE,code) print *, I, process,rank,, I have received,value, from the process,num_proc call MPI_FINALIZE(code) 19 end program sendrecv
31 31/236 3 Point-to-point Communications 3.3 Other Possibilities > mpiexec -n 2 sendrecv I, process 1, I have received 1000 from the process 0 I, process 0, I have received 1001 from the process 1 Warning! It must be noticed that if the MPI_SEND() subroutine is implemented in a synchronous way (see the Chapter 7) in the used MPI implementation, the previous code will deadlock if, rather than to use the MPI_SENDRECV() subroutine, we used a MPI_SEND() subroutine followed by a MPI_RECV() one. In this case, each of the two processes would wait a receive command which will never occur, because the two sends would stay suspended. Therefore, for correctness and portability reasons, it is absolutely necessary to avoid such situations. call MPI_SEND(rank+1000,1, MPI_INTEGER,num_proc,tag, MPI_COMM_WORLD,code) call MPI_RECV(value,1, MPI_INTEGER,num_proc,tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE,code)
32 32/236 3 Point-to-point Communications 3.3 Other Possibilities 1 program joker 2 use mpi 3 implicit none 4 integer, parameter :: m=4,tag1=11,tag2=22 5 integer, dimension(m,m) :: A 6 integer :: nb_procs,rank,code 7 integer, dimension( MPI_STATUS_SIZE):: status 8 integer :: nb_elements,i 9 integer, dimension(:), allocatable :: C call MPI_INIT(code) 12 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 13 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) A(:,:) = 0
33 33/236 3 Point-to-point Communications 3.3 Other Possibilities 16 if (rank == 0) then 17! Initialisation of the matrix A on the process 0 18 A(:,:) = reshape((/ (i,i=1,m*m) /), (/ m,m /)) 19! Sending of 2 elements of the matrix A to the process 1 20 call MPI_SEND(A(1,1),2, MPI_INTEGER,1,tag1, MPI_COMM_WORLD,code) 21! Sending of 3 elements of the matrix A to the process 2 22 call MPI_SEND(A(1,2),3, MPI_INTEGER,2,tag2, MPI_COMM_WORLD,code) 23 else 24! We check before the receive if a message has arrived and from which process 25 call MPI_PROBE( MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,status,code) 26! We check how many elements must be received 27 call MPI_GET_COUNT(status, MPI_INTEGER,nb_elements,code) 28! We allocate the reception array C on each process 29 if (nb_elements /= 0 ) allocate (C(1:nb_elements)) 30! We receive the message 31 call MPI_RECV(C,nb_elements, MPI_INTEGER,status( MPI_SOURCE),status( MPI_TAG), & 32 MPI_COMM_WORLD,status,code) 33 print *, I, process,rank,, I have received,nb_elements, & 34 elements from the process, & 35 status( MPI_SOURCE), My array C is, C(:) 36 end if call MPI_FINALIZE(code) 39 end program joker
34 34/236 3 Point-to-point Communications 3.3 Other Possibilities > mpiexec -n 3 sendrecv1 I, process 1, I have received 2 elements from the process 0 My array C is 1 2 I, process 2, I have received 3 elements from the process 0 My array C is 5 6 7
35 35/236 3 Point-to-point Communications 3.4 Example: communication ring 3 Point-to-point Communications 3.4 Example: communication ring Figure 9: Communication Ring
36 36/236 3 Point-to-point Communications 3.4 Example: communication ring If all the processes execute a send followed by a receive, all the communications could potentially start simultaneously and then will not occur in a ring way (outside the already mentioned portability problem if the implementation of MPI_SEND() is made in a synchronous way in the MPI library):... value=rank+1000 call MPI_SEND(value,1, MPI_INTEGER,num_proc_next,tag, MPI_COMM_WORLD,code) call MPI_RECV(value,1, MPI_INTEGER,num_proc_previous,tag, MPI_COMM_WORLD, & status,code)...
37 37/236 3 Point-to-point Communications 3.4 Example: communication ring To ensure that the communications will really happen in a ring way, like for the exchange of a token between processes, it is necessary to use a different method and to have one process which starts the chain: Figure 10: Communication Ring
38 38/236 3 Point-to-point Communications 3.4 Example: communication ring 1 program ring 2 use mpi 3 implicit none 4 integer, dimension( MPI_STATUS_SIZE) :: status 5 integer, parameter :: tag=100 6 integer :: nb_procs,rank,value, & 7 num_proc_previous,num_proc_next,code 8 call MPI_INIT(code) 9 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 10 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) num_proc_next=mod(rank+1,nb_procs) 13 num_proc_previous=mod(nb_procs+rank-1,nb_procs) if (rank == 0) then 16 call MPI_SEND(rank+1000,1, MPI_INTEGER,num_proc_next,tag, & 17 MPI_COMM_WORLD,code) 18 call MPI_RECV(value,1, MPI_INTEGER,num_proc_previous,tag, & 19 MPI_COMM_WORLD,status,code) 20 else 21 call MPI_RECV(value,1, MPI_INTEGER,num_proc_previous,tag, & 22 MPI_COMM_WORLD,status,code) 23 call MPI_SEND(rank+1000,1, MPI_INTEGER,num_proc_next,tag, & 24 MPI_COMM_WORLD,code) 25 end if print *, I, process,rank,, I have received,value, from process, & 28 num_proc_previous call MPI_FINALIZE(code) 31 end program ring
39 39/236 3 Point-to-point Communications 3.4 Example: communication ring > mpiexec -n 7 ring I, process 1, I have received 1000 from process 0 I, process 2, I have received 1001 from process 1 I, process 3, I have received 1002 from process 2 I, process 4, I have received 1003 from process 3 I, process 5, I have received 1004 from process 4 I, process 6, I have received 1005 from process 5 I, process 0, I have received 1006 from process 6
40 40/236 4 Collective communications 4.1 General concepts 4 Collective communications 4.1 General concepts The collective communications allow to make a series of point-to-point communications in one single call. A collective communication always concerns all the processes of the indicated communicator. For each process, the call ends when its participation in the collective call is completed, in the sense of point-to-point communications (when the concerned memory area can be changed). It is useless to add a global synchronization (barrier) after a collective call. The management of tags in these communications is transparent and system-dependent. Therefore, they are never explicitly defined during the calling of these subroutines. This has among other advantages that the collective communications never interfere with point-to-point communications.
41 41/236 4 Collective communications 4.1 General concepts There are three types of subroutines : ❶ the one which ensures the global synchronizations : MPI_BARRIER(). ❷ the ones which only transfer data : global distribution of data : MPI_BCAST() ; selective distribution of data : MPI_SCATTER() ; collection of distributed data : MPI_GATHER() ; collection by all the processes of distributed data : MPI_ALLGATHER() ; selective distribution, by all the processes, of distributed data : MPI_ALLTOALL(). ❸ the ones which, in addition to the communications management, carry out operations on the transferred data : reduction operations (sum, product, maximum, minimum, etc.) whether they are of a predefined or personal type : MPI_REDUCE() ; reduction operations with broadcasting of the result (it is in fact equivalent to an MPI_REDUCE() followed by an MPI_BCAST()) : MPI_ALLREDUCE().
42 42/236 4 Collective communications 4.2 Global synchronization: MPI_BARRIER() 4 Collective communications 4.2 Global synchronization: MPI_BARRIER() P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 Barrière Figure 11: Global Synchronization : MPI_BARRIER() integer, intent(out) :: code call MPI_BARRIER( MPI_COMM_WORLD,code)
43 43/236 4 Collective communications 4.3 Broadcast : MPI_BCAST() 4 Collective communications 4.3 Broadcast : MPI_BCAST() 0 A A A 1 A 2 3 P0 P1 P2 A P3 MPI_BCAST() P0 A P1 A P2 A P3 A Figure 12: Broadcast : MPI_BCAST()
44 44/236 4 Collective communications 4.3 Broadcast : MPI_BCAST() 1 program bcast 2 use mpi 3 implicit none 4 5 integer :: rank,value,code 6 7 call MPI_INIT(code) 8 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 9 10 if (rank == 2) value=rank call MPI_BCAST(value,1, MPI_INTEGER,2, MPI_COMM_WORLD,code) print *, I, process,rank, I have received,value, of the process call MPI_FINALIZE(code) end program bcast > mpiexec -n 4 bcast I, process 2 I have received 1002 of the process 2 I, process 0 I have received 1002 of the process 2 I, process 1 I have received 1002 of the process 2 I, process 3 I have received 1002 of the process 2
45 45/236 4 Collective communications 4.4 Scatter : MPI_SCATTER() 4 Collective communications 4.4 Scatter : MPI_SCATTER() 0 A0 A1 A2 1 A3 2 3 P0 P1 P2 A0 A1 A2 A3 P3 MPI_SCATTER() P0 A0 P1 A1 P2 A2 P3 A3 Figure 13: Scatter : MPI_SCATTER()
46 46/236 4 Collective communications 4.4 Scatter : MPI_SCATTER() 1 program scatter 2 3 use mpi implicit none integer, parameter integer :: nb_values=8 :: nb_procs,rank,block_length,i,code 7 real, allocatable, dimension(:) :: values,data 8 9 call MPI_INIT(code) 10 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 12 block_length=nb_values/nb_procs 13 allocate(data(block_length)) if (rank == 2) then 16 allocate(values(nb_values)) values(:)=(/(1000.+i,i=1,nb_values)/) print *, I, process,rank, send my values array :,& end if values(1:nb_values) call MPI_SCATTER(values,block_length, MPI_REAL,data,block_length, & 23 MPI_REAL,2, MPI_COMM_WORLD,code) print *, I, process,rank,, I have received, data(1:block_length), & of the process 2 26 call MPI_FINALIZE(code) end program scatter > mpiexec -n 4 scatter I, process 2 send my values array : I, process 0, I have received of the process 2 I, process 1, I have received of the process 2 I, process 3, I have received of the process 2 I, process 2, I have received of the process 2
47 47/236 4 Collective communications 4.5 Gather : MPI_GATHER() 4 Collective communications 4.5 Gather : MPI_GATHER() 0 A0 A1 A2 1 A3 2 3 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHER() P0 P1 P2 A0 A1 A2 A3 P3 Figure 14: Gather : MPI_GATHER()
48 48/236 4 Collective communications 4.5 Gather : MPI_GATHER() 1 program gather 2 use mpi 3 implicit none 4 integer, parameter :: nb_values=8 5 integer :: nb_procs,rank,block_length,i,code 6 real, dimension(nb_values) :: data 7 real, allocatable, dimension(:) :: values 8 9 call MPI_INIT(code) 10 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) block_length=nb_values/nb_procs allocate(values(block_length)) values(:)=(/(1000.+rank*block_length+i,i=1,block_length)/) print *, I, process,rank, send my values array :,& 19 values(1:block_length) call MPI_GATHER(values,block_length, MPI_REAL,data,block_length, & 22 MPI_REAL,2, MPI_COMM_WORLD,code) if (rank == 2) print *, I, process 2, have received,data(1:nb_values) call MPI_FINALIZE(code) end program gather > mpiexec -n 4 gather I, process 1 send my values array : I, process 0 send my values array : I, process 2 send my values array : I, process 3 send my values array : I, process 2 have received
49 49/236 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() A0 A0 0 A2 A1 A1 A0 A0 A2 A2 1 A1 2 A3 A3 A2 A1 3 A3 P0 A0 P1 A1 P2 A2 A3 MPI_ALLGATHER() P0 A0 A1 A2 A3 P1 A0 A1 A2 A3 P2 A0 A1 A2 A3 P3 A3 P3 A0 A1 A2 A3 Figure 15: Allgather : MPI_ALLGATHER()
50 50/236 4 Collective communications 4.6 Gather-to-all : MPI_ALLGATHER() 1 program allgather 2 3 use mpi implicit none integer, parameter integer :: nb_values=8 :: nb_procs,rank,block_length,i,code 7 real, dimension(nb_values) :: data 8 real, allocatable, dimension(:) :: values 9 10 call MPI_INIT(code) call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 13 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) block_length=nb_values/nb_procs 16 allocate(values(block_length)) values(:)=(/(1000.+rank*block_length+i,i=1,block_length)/) call MPI_ALLGATHER(values,block_length, MPI_REAL,data,block_length, & 21 MPI_REAL, MPI_COMM_WORLD,code) print *, I, process,rank,, I have received,data(1:nb_values) call MPI_FINALIZE(code) end program allgather > mpiexec -n 4 allgather I, process 1, I have received I, process 3, I have received I, process 2, I have received I, process 0, I have received
51 51/236 4 Collective communications 4.7 Extended gather : MPI_GATHERV() 4 Collective communications 4.7 Extended gather : MPI_GATHERV() 0 A1 A0 A2 1 A3 2 3 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHERV() P0 P1 P2 A0 A1 A2 A3 P3 Figure 16: Gatherv : MPI_GATHERV()
52 52/236 4 Collective communications 4.7 Extended gather : MPI_GATHERV() 1 program gatherv 2 use mpi 3 implicit none 4 INTEGER, PARAMETER :: nb_values=10 5 INTEGER :: nb_procs, rank, block_length, i, code 6 REAL, DIMENSION(nb_values) :: data,remainder 7 REAL, ALLOCATABLE, DIMENSION(:) :: values 8 INTEGER, ALLOCATABLE, DIMENSION(:) :: nb_elements_received,displacement 9 10 CALL MPI_INIT(code) 11 CALL MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 12 CALL MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) block_length=nb_values/nb_procs 15 remainder=mod(nb_values,nb_procs) 16 if(remainder > rank) block_length = block_length ALLOCATE(values(block_length)) 18 values(:) = (/(1000.+(rank*(nb_values/nb_procs))+min(rank,remainder)+i, & 19 i=1,block_length)/) PRINT *, I, process, rank, send my values array :,& 22 values(1:block_length) IF (rank == 2) THEN 25 ALLOCATE(nb_elements_received(nb_procs),displacement(nb_procs)) 26 nb_elements_received(1) = nb_values/nb_procs 27 if (remainder > 0) nb_elements_received(1)=nb_elements_received(1) displacement(1) = 0 DO i=2,nb_procs 30 displacement(i) = displacement(i-1)+nb_elements_received(i-1) 31 nb_elements_received(i) = nb_values/nb_procs 32 if (remainder > i-1) nb_elements_received(i)=nb_elements_received(i)+1 33 END DO 34 END IF 35
53 53/236 4 Collective communications 4.7 Extended gather : MPI_GATHERV() CALL MPI_GATHERV(values,block_length, MPI_REAL,data,nb_elements_received,& displacement, MPI_REAL,2, MPI_COMM_WORLD,code) IF (rank == 2) PRINT *, I, processus 2 receives, data(1:nb_values) CALL MPI_FINALIZE(code) end program gatherv > mpiexec -n 4 gatherv I, process 0 send my values array : I, process 2 send my values array : I, process 3 send my values array : I, process 1 send my values array : I, process 2 receives
54 54/236 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() A0 A1 0 C0 B1 B0 A2 A3 C1 C2 1 B2 2 D1 D0 C3 B3 3 D2 P0 A0 A1 A2 A3 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 P3 D0 D1 D2 D3 D3 MPI_ALLTOALL() P0 A0 B0 C0 D0 P1 A1 B1 C1 D1 P2 A2 B2 C2 D2 P3 A3 B3 C3 D3 Figure 17: Alltoall : MPI_ALLTOALL()
55 55/236 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() 1 program alltoall 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_values=8 6 integer :: nb_procs,rank,block_length,i,code 7 real, dimension(nb_values) :: values,data 8 9 call MPI_INIT(code) 10 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) values(:)=(/(1000.+rank*nb_values+i,i=1,nb_values)/) 14 block_length=nb_values/nb_procs print *, I, process,rank, send my values array :,& 17 values(1:nb_values) call MPI_ALLTOALL(values,block_length, MPI_REAL,data,block_length, & 20 MPI_REAL, MPI_COMM_WORLD,code) print *, I, process,rank,, I have received,data(1:nb_values) call MPI_FINALIZE(code) 25 end program alltoall
56 56/236 4 Collective communications 4.8 All-to-all : MPI_ALLTOALL() > mpiexec -n 4 alltoall I, process 1 send my values array : I, process 0 send my values array : I, process 2 send my values array : I, process 3 send my values array : I, process 0, I have received I, process 2, I have received I, process 1, I have received I, process 3, I have received
57 57/236 4 Collective communications 4.9 Global reduction 4 Collective communications 4.9 Global reduction A reduction is an operation applied to a set of elements in order to obtain one single value. Classical examples are the sum of the elements of a vector (SUM(A(:))) or the search of the maximum value element in a vector (MAX(V(:))). MPI proposes high-level subroutines in order to operate reductions on distributed data on a group of processes. The result is obtained on one process (MPI_REDUCE()) or on all (MPI_ALLREDUCE(), which is in fact equivalent to an MPI_REDUCE() followed by an MPI_BCAST()). If several elements are implied by process, the reduction function is applied to each one of them. The MPI_SCAN() subroutine allows also to make partial reductions by considering, for each process, the previous processes of the group and itself. The MPI_OP_CREATE() and MPI_OP_FREE() subroutines allow personal reduction operations.
58 58/236 4 Collective communications 4.9 Global reduction Table 3: Main Predefined Reduction Operations (there are also other logical operations) Name MPI_SUM MPI_PROD MPI_MAX MPI_MIN MPI_MAXLOC MPI_MINLOC MPI_LAND MPI_LOR MPI_LXOR Opération Sum of elements Product of elements Maximum of elements Minimum of elements Maximum of elements and location Minimum of elements and location Logical AND Logical OR Logical exclusive OR
59 59/236 4 Collective communications 4.9 Global reduction = Figure 18: Distributed reduction (sum)
60 60/236 4 Collective communications 4.9 Global reduction 1 program reduce 2 use mpi 3 implicit none 4 integer :: nb_procs,rank,value,sum,code 5 6 call MPI_INIT(code) 7 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 8 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 9 10 if (rank == 0) then 11 value= else 13 value=rank 14 endif call MPI_REDUCE(value,sum,1, MPI_INTEGER, MPI_SUM,0, MPI_COMM_WORLD,code) if (rank == 0) then 19 print *, I, process 0, I have the global sum value,sum 20 end if call MPI_FINALIZE(code) 23 end program reduce > mpiexec -n 7 reduce I, process 0, I have the global sum value 1021
61 61/236 4 Collective communications 4.9 Global reduction = Figure 19: Distributed reduction (product) with broadcast of the result
62 62/236 4 Collective communications 4.9 Global reduction 1 program allreduce 2 3 use mpi 4 implicit none 5 6 integer :: nb_procs,rank,value,product,code 7 8 call MPI_INIT(code) 9 call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 10 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) if (rank == 0) then 13 value=10 14 else 15 value=rank 16 endif call MPI_ALLREDUCE(value,product,1, MPI_INTEGER, MPI_PROD, MPI_COMM_WORLD,code) print *, I,process,rank,,I have received the value of the global product,product call MPI_FINALIZE(code) end program allreduce
63 63/236 4 Collective communications 4.9 Global reduction > mpiexec -n 7 allreduce I, process 6, I have received the value of the global product 7200 I, process 2, I have received the value of the global product 7200 I, process 0, I have received the value of the global product 7200 I, process 4, I have received the value of the global product 7200 I, process 5, I have received the value of the global product 7200 I, process 3, I have received the value of the global product 7200 I, process 1, I have received the value of the global product 7200
64 64/236 4 Collective communications 4.10 Additions 4 Collective communications 4.10 Additions The MPI_SCATTERV(), MPI_GATHERV(), MPI_ALLGATHERV() and MPI_ALLTOALLV() subroutines extend MPI_SCATTER(), MPI_GATHER(), MPI_ALLGATHER() and MPI_ALLTOALL() in the case where the number of elements to transmit or gather is different following the processes. Two new subroutines have been added to extend the possibilities of collective subroutines in some particular cases: MPI_ALLTOALLW() : MPI_ALLTOALLV() version where the displacements are expressed in bytes and not in elements, MPI_EXSCAN() : exclusive version of MPI_SCAN().
65 65/236 5 One-sided Communication 5.1 Introduction 5 One-sided Communication 5.1 Introduction There are various approaches to transfer data between two different processes. Among the most commonly used are : ➊ Point-to-point communications by message-passing (MPI, etc.) ; ➋ One-sided communications (direct access to the memory of a distant process). Also called RMA for Remote Memory Access, it is one of the major contributions of MPI.
66 66/236 5 One-sided Communication 5.1 Introduction 5 One-sided Communication 5.1 Introduction Reminder : The Concept of Message-Passing 0 1 source t target I send Figure 20: Message-Passing I receive In message-passing, a sender (origin) sends a message to a destination process (target) which will make all what is necessary to receive this message. This requires that the sender as well as the receiver be involved in the communication. This can be restrictive and difficult to implement in some algorithms (for example when it is necessary to manage a global counter).
67 67/236 5 One-sided Communication 5.1 Introduction 5 One-sided Communication 5.1 Introduction The Concept of One-sided Communication The concept of one-sided communication is not new, MPI having simply unified the already existing constructors solutions (such as shmem (CRAY), lapi (IBM),...) by offering its own RMA primitives. Through these subroutines, a process has a direct access (in read, write or update) to the memory of another remote process. In this approach, the remote process does not have to participate in the data-transfer process. The principle advantages are the following : enhanced performances when the hardware allows it, a simpler programming for some algorithms.
68 68/236 5 One-sided Communication 5.1 Introduction 5 One-sided Communication 5.1 Introduction RMA Approach of MPI The use of MPI RMA is done in three steps: ➊ definition on each process of a memory area (local memory window) visible and eventually accessible to remote processes ; ➋ start of the data transfer directly from the memory of a process to the memory of another process. It is therefore necessary to specify the type, the number and the initial and final localization of data. ➌ completion of current transfers by a step of synchronization, the data are then available.
69 69/236 5 One-sided Communication 5.2 Memory Window 5 One-sided Communication 5.2 Memory Window All the processes participating in an one-sided communication have to specify which part of their memory will be available to the other processes ; it is the notion of memory window. More precisely, the MPI_WIN_CREATE() collective operation allows the creation of an MPI window object. This object is composed, for each process, of a specific memory area called local memory window. For each process, a local memory window is characterized by its initial address, its size in bytes (which can be zero) and the displacement unit size inside this window (in bytes). These characteristics can be different on each process.
70 70/236 5 One-sided Communication 5.2 Memory Window Example: MPI Window Object win1 0 1 First local window Second local window Second local window First local window MPI Window Object win2 Figure 21: Creation of two MPI window objects, win1 and win2
71 71/236 5 One-sided Communication 5.2 Memory Window Once the transfers are finished, the window has to be freed with the MPI_WIN_FREE() subroutine. MPI_WIN_GET_ATTR() allows to know the characteristics of a local memory window by using the keywords MPI_WIN_BASE, MPI_WIN_SIZE or MPI_WIN_DISP_UNIT. Remark : The choice of the displacement unit associated with the local memory window is important (essential in a heterogenous environment and facilitating the encoding in all the cases). The size of a MPI datatype is obtained by calling the MPI_TYPE_SIZE() subroutine.
72 72/236 5 One-sided Communication 5.2 Memory Window 1 program window 2 3 use mpi 4 implicit none 5 6 integer :: code, rank, size_real, win, n=4 7 integer (kind= MPI_ADDRESS_KIND) :: dim_win, size, base, unit 8 real(kind=kind(1.d0)), dimension(:), allocatable :: win_local 9 logical :: flag call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD, rank, code) 13 call MPI_TYPE_SIZE( MPI_DOUBLE_PRECISION,size_real,code) if (rank==0) n=0 16 allocate(win_local(n)) 17 dim_win = size_real*n call MPI_WIN_CREATE(win_local, dim_win, size_real, MPI_INFO_NULL, & 20 MPI_COMM_WORLD, win, code) 21 call MPI_WIN_GET_ATTR(win, MPI_WIN_SIZE, size, flag, code) 22 call MPI_WIN_GET_ATTR(win, MPI_WIN_BASE, base, flag, code) 23 call MPI_WIN_GET_ATTR(win, MPI_WIN_DISP_UNIT, unit, flag, code) 24 call MPI_WIN_FREE(win,code) 25 print *,"process", rank,"size, base, unit = " & 26,size, base, unit 27 call MPI_FINALIZE(code) 28 end program window > mpiexec -n 3 window process 1 size, base, unit = process 0 size, base, unit = process 2 size, base, unit =
73 73/236 5 One-sided Communication 5.3 Data Transfer 5 One-sided Communication 5.3 Data Transfer MPI allows a process to read (MPI_GET()), to write (MPI_PUT()) and to update (MPI_ACCUMULATE()) data located in the local memory window of a remote process. We call the process which makes the calling of the initialization subroutine of the transfer origin and we call the process which has the local memory window which will be used in the transfer procedure target. During the initialization of the transfer, the target process does not call any MPI subroutine. All the necessary informations are specified under the form of parameters during the calling of the MPI subroutine by the origin.
74 74/236 5 One-sided Communication 5.3 Data Transfer In particular, we find : parameters related to the origin : the datatype of elements ; their number ; the memory address of the first element. parameters related to the target : the rank of the target as well as the MPI window object, which uniquely determines a local memory window ; a displacement in this local window ; the number of elements to be transferred and their datatype.
75 75/236 5 One-sided Communication 5.3 Data Transfer 1 program example_put 2 integer :: nb_orig=10, nb_target=10, target=1, win, code 3 integer (kind= MPI_ADDRESS_KIND) :: displacement=40 4 integer, dimension(10) :: B call MPI_PUT(B,nb_orig, MPI_INTEGER,target,displacement,nb_target, MPI_INTEGER,win,code) end program example_put Displacement of 40 units in the local window 0 1 First Local Window First Local Window B(1:10) Figure 22: Example of a MPI_PUT
76 76/236 5 One-sided Communication 5.3 Data Transfer Remarks : The MPI_GET syntax is identical to the MPI_PUT syntax, only the direction of the data transfer is reversed. The RMA transfer data subroutines are nonblocking primitives (deliberate choice of MPI). On the target process, the only accessible data are the ones contained in the local memory window. MPI_ACCUMULATE() admits among its parameters an operation which should be either of the MPI_REPLACE type, or one of the predefined reduction operations : MPI_SUM, MPI_PROD, MPI_MAX, etc. This cannot be in any case an operation defined by the user.
77 77/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization The data transfer starts after the call to one of the nonblocking subroutines (MPI_PUT(),...). But when is the transfer completed and the data really available? After a synchronization by the programmer. These synchronizations can be classified into two types : ➊ Active target synchronization (collective operation, all the processes associated with the window involved in the synchronization) ; ➋ Passive target synchronization (only the origin process calls the synchronization subroutine).
78 78/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization Active Target Synchronization It is made by using the MPI subroutine MPI_WIN_FENCE(). MPI_WIN_FENCE() is a collective operation on all the processes associated with the MPI window object. MPI_WIN_FENCE() acts as a synchronization barrier. It waits for the end of all the data transfers (RMA or not) using the local memory window and initiated since the last call to MPI_WIN_FENCE(). This primitive will help separate the calculation parts of the code (where data of the local memory window are used via load or store) of RMA data transfer parts. An assert argument of the MPI_WIN_FENCE() primitive, of integer type, allows to refine its behavior for better performances. Different values are predefined MPI_MODE_NOSTORE, MPI_MODE_NOPUT, MPI_MODE_NOPRECEDE, MPI_MODE_NOSUCCEED. A zero value for this argument is always valid.
79 79/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization Remarks : The fact of having chosen nonblocking RMA subroutines for initialisation of the transfers and having a synchronization for the completion of actual transfers authorizes the implementation to gather, during the execution, different transfers towards the same target in one single transfer. The latencies effects are therefore reduced and the performances are improved. The collective character of the synchronization results in that we have nothing to do with One Sided Communication... In fact all the processes of the communicator will have to participate in the synchronization, which makes it of less interest!
80 80/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization The Proper Use of MPI_WIN_FENCE() It is important to make sure that between two successive calls to MPI_WIN_FENCE() either there is only local operations (load/store) on variables that are in the local memory window of the process, or there is MPI_PUT() RMA operations or MPI_ACCUMULATE(), but never both at once! 0 1 MPI_WIN_FENCE() MPI_WIN_FENCE() win_loc(:) = win_loc(:) MPI_WIN_FENCE() MPI_WIN_FENCE() MPI_PUT() 1 MPI_WIN_FENCE() MPI_WIN_FENCE() Is the previous program compliant with the proper use of MPI_WIN_FENCE()? All depends on the code portion represented by 1. If this latter does not produce a load/store on the local window (assignment or use of a variable stored in the local window), then it is correct ; otherwise, the result is undefined.
81 81/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization Example tab 1 P0 P1 corresponding line win_loc MPI_PUT l. 1,30 l. 31,35 tab win_loc l. 37 l. 38,42 tab win_loc MPI_GET l. 44 l. 45,49 tab win_loc l. 50 l. 51,55 tab win_loc l. 57 MPI_ACCUMULATE l. 58,63 tab win_loc l. 65 Figure 23: Example corresponding to the ex_fence program
82 82/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 1 program ex_fence 2 use mpi 3 implicit none 4 5 integer, parameter :: assert=0 6 integer :: code, rank, size_real, win, i, nb_elements, target, m=4, n=4 7 integer (kind= MPI_ADDRESS_KIND) :: displacement, dim_win 8 real(kind=kind(1.d0)), dimension(:), allocatable :: win_local, tab 9 10 call MPI_INIT(code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD, rank, code) 12 call MPI_TYPE_SIZE( MPI_DOUBLE_PRECISION,size_real,code) if (rank==0) then 15 n=0 16 allocate(tab(m)) 17 endif allocate(win_local(n)) 20 dim_win = size_real*n call MPI_WIN_CREATE(win_local, dim_win, size_real, MPI_INFO_NULL, & 23 MPI_COMM_WORLD, win, code)
83 83/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 24 if (rank==0) then 25 tab(:) = (/ (i, i=1,m) /) 26 else 27 win_local(:) = end if call MPI_WIN_FENCE(assert,win,code) 31 if (rank==0) then 32 target = 1; nb_elements = 2; displacement = 1 33 call MPI_PUT(tab, nb_elements, MPI_DOUBLE_PRECISION, target, displacement, & 34 nb_elements, MPI_DOUBLE_PRECISION, win, code) 35 end if call MPI_WIN_FENCE(assert,win,code) 38 if (rank==0) then 39 tab(m) = sum(tab(1:m-1)) 40 else 41 win_local(n) = sum(win_local(1:n-1)) 42 endif call MPI_WIN_FENCE(assert,win,code) 45 if (rank==0) then 46 nb_elements = 1; displacement = m-1 47 call MPI_GET(tab, nb_elements, MPI_DOUBLE_PRECISION, target, displacement, & 48 nb_elements, MPI_DOUBLE_PRECISION, win, code) 49 end if
84 84/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 50 call MPI_WIN_FENCE(assert,win,code) 51 if (rank==0) then 52 tab(m) = sum(tab(1:m-1)) 53 else 54 win_local(:) = win_local(:) endif call MPI_WIN_FENCE(assert,win,code) 58 if (rank==0) then 59 nb_elements = m-1; displacement = 1 60 call MPI_ACCUMULATE(tab(2), nb_elements, MPI_DOUBLE_PRECISION, target, & 61 displacement, nb_elements, MPI_DOUBLE_PRECISION, & 62 MPI_SUM, win, code) 63 end if call MPI_WIN_FENCE(assert,win,code) 66 call MPI_WIN_FREE(win,code) if (rank==0) then 69 print *,"process", rank, "tab=",tab(:) 70 else 71 print *,"process", rank, "win_local=",win_local(:) 72 endif call MPI_FINALIZE(code) 75 end program ex_fence
85 85/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization Some Details and Restrictions... It is possible to work on different local memory windows which overlap, even if this is not recommanded (such a use implies so many restrictions). We will presume next never being in this case. It is always important to separate by a call to MPI_WIN_FENCE() a store and by a call to the MPI_PUT() or MPI_ACCUMULATE() subroutine that accesses the same local memory window even at different locations that do not overlap. Between two successive calls to the MPI_WIN_FENCE() subroutine, we have the following constraints : the MPI_PUT() subroutines do not allow the overlapping inside the same local memory window. In other terms, the memory areas involved during calls to many MPI_PUT() subroutines that modify the same local memory window, must not overlap ;
86 86/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization the MPI_ACCUMULATE() subroutines allow the overlapping inside the same local memory window, on the condition that the datatypes and the used reduction operations be identical during all of these calls ; using consecutively the MPI_PUT() and MPI_ACCUMULATE() subroutines do not allow the overlapping inside the same local memory window ; a load and a call to the MPI_GET() subroutine can access concurrently whatever part of the local window, provided that it was not updated previously either by a store, or when calling the MPI_PUT() or MPI_ACCUMULATE() subroutine.
87 87/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization Passive Target Synchronization It is made via calls to the MPI MPI_WIN_LOCK() and MPI_WIN_UNLOCK() subroutines. Unlike the synchronization by MPI_WIN_FENCE() (which is a collective operation of barrier type), here only the origin process will participate in the synchronization. Consequently, all the necessary calls to the data transfer (initialization of the transfer, synchronization) are only made by the origin process ; this is true One Sided Communication. The lock and unlock operations apply only to a given local memory window (i.e. identified by a target process number and an MPI window object). The period which starts at lock and ends at unlock is called an access period to the local memory window. It is only during this period that the origin process will have access to the local memory window of the target process.
88 88/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization In order to use it, it only requires the origin process to surround the call to the RMA initialization primitives of data transfer by MPI_WIN_LOCK() and MPI_WIN_UNLOCK(). For the target process, no MPI subroutine call is necessary. When MPI_WIN_UNLOCK() returns control, all the data transfers initiated after the MPI_WIN_LOCK() are completed. The first MPI_WIN_LOCK() argument allows to specify if the fact of making many simultaneous accesses via RMA operations on the same local memory window is authorized (MPI_LOCK_SHARED) or not (MPI_LOCK_EXCLUSIVE). A basic use of passive target synchronizations consists of creating RMA blocking versions (put, get, accumulate) without the target having to call MPI subroutines.
89 89/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization 1 subroutine get_blocking(orig_addr, orig_count, orig_datatype, target_rank, & 2 target_disp, target_count, target_datatype, win, code) 3 integer, intent(in) :: orig_count, orig_datatype, target_rank, target_count, & 4 target_datatype, win 5 integer, intent(out) :: code 6 integer(kind= MPI_ADDRESS_KIND), intent(in) :: target_disp 7 real(kind=kind(1.d0)), dimension(:) :: orig_addr 8 9 call MPI_WIN_LOCK( MPI_LOCK_SHARED, target_rank, 0, win, code) 10 call MPI_GET(orig_addr, orig_count, orig_datatype, target_rank, target_disp, & 11 target_count, target_datatype, win, code) 12 call MPI_WIN_UNLOCK(target_rank, win, code) 13 end subroutine get_blocking
90 90/236 5 One-sided Communication 5.4 Completion of the Transfer : the synchronization Remark Concerning the Fortran Codes For portability, during the use of passive target synchronizations (MPI_WIN_LOCK(), MPI_WIN_UNLOCK()), it is necessary to allocate the memory window with MPI_ALLOC_MEM(). This function accepts as argument pointers of C type (i.e. Fortran CRAY pointers, which do not belong to the Fortran95 standard). In the case where these latter ones are not available, a C program must be used to do the allocation of the memory window...
91 91/236 5 One-sided Communication 5.5 Conclusions 5 One-sided Communication 5.5 Conclusions The RMA concepts of MPI are difficult to implement on non-trivial applications. An in-depth knowledge of the standard is necessary in order not to fall in many traps. The performances can widely vary from one implementation to another. The advantage of the RMA concept of MPI resides essentially in the passive target approach. It is only in this case that the use of RMA subroutines is really essential (application requiring that a process access data that belong to a remote process without interruption of the latter...)
92 92/236 6 Derived datatypes 6.1 Introduction 6 Derived datatypes 6.1 Introduction In the communications, the exchanged data have datatypes : MPI_INTEGER, MPI_REAL, MPI_COMPLEX, etc. We can create more complex data structures by using subroutines such as MPI_TYPE_CONTIGUOUS(), MPI_TYPE_VECTOR(), MPI_TYPE_CREATE_HVECTOR() Each time that we use a datatype, it is mandatory to validate it by using the MPI_TYPE_COMMIT() subroutine. If we wish to reuse the same name to define another derived datatype, we have to free it first with the MPI_TYPE_FREE() subroutine.
93 93/236 6 Derived datatypes 6.1 Introduction MPI_TYPE_CREATE_STRUCT MPI_TYPE_[CREATE_H]INDEXED MPI_TYPE_[CREATE_H]VECTOR MPI_TYPE_CONTIGUOUS MPI_REAL, MPI_INTEGER, MPI_LOGICAL Figure 24: Hierarchy of the MPI constructors
94 94/236 6 Derived datatypes 6.2 Contiguous datatypes 6 Derived datatypes 6.2 Contiguous datatypes MPI_TYPE_CONTIGUOUS() creates a data structure from a homogenous set of existing datatypes contiguous in memory call MPI_TYPE_CONTIGUOUS(5, MPI_REAL,new_type,code) Figure 25: MPI_TYPE_CONTIGUOUS subroutine integer, intent(in) :: count, old_type integer, intent(out) :: new_type,code call MPI_TYPE_CONTIGUOUS(count,old_type,new_type,code)
95 95/236 6 Derived datatypes 6.3 Constant stride 6 Derived datatypes 6.3 Constant stride MPI_TYPE_VECTOR() creates a data structure from a homogenous set of existing data separated by a constant stride in memory. The stride is given in number of elements call MPI_TYPE_VECTOR(6,1,5, MPI_REAL,new_type,code) Figure 26: MPI_TYPE_VECTOR subroutine integer, intent(in) :: count,block_length integer, intent(in) :: stride! given in elements integer, intent(in) :: old_type integer, intent(out) :: new_type,code call MPI_TYPE_VECTOR(count,block_length,stride,old_type,new_type,code)
96 96/236 6 Derived datatypes 6.3 Constant stride MPI_TYPE_CREATE_HVECTOR() creates a data structure from a homogenous set of existing datatype separated by a constant stride in memory. The stride is given in bytes. This call is useful when the old type is no longer a base datatype (MPI_INTEGER, MPI_REAL,...) but a more complex datatype constructed by using MPI subroutines, because in this case the stride can no longer be given in number of elements. integer, intent(in) :: count,block_length integer(kind= MPI_ADDRESS_KIND), intent(in) :: stride! given in bytes integer, intent(in) :: old_type integer, intent(out) :: new_type, code call MPI_TYPE_CREATE_HVECTOR(count,block_length,stride,old_type,new_type,code)
97 97/236 6 Derived datatypes 6.4 Other subroutines 6 Derived datatypes 6.4 Other subroutines Before using a new derived datatype, it is necessary to validate it with the MPI_TYPE_COMMIT() subroutine. integer, intent(inout) :: new_type integer, intent(out) :: code call MPI_TYPE_COMMIT(new_type,code) The freeing of a derived datatype is made by using the MPI_TYPE_FREE() subroutine. integer, intent(inout) :: new_type integer, intent(out) :: code call MPI_TYPE_FREE(new_type,code)
98 98/236 6 Derived datatypes 6.5 Examples 6 Derived datatypes 6.5 Examples The datatype "matrix row" 1 program column 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_lines=5,nb_columns=6 6 integer, parameter :: tag=100 7 real, dimension(nb_lines,nb_columns) :: a 8 integer, dimension( MPI_STATUS_SIZE) :: status 9 integer :: rank,code,type_column call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 13 14! Initialization of the matrix on each process 15 a(:,:) = real(rank) 16 17! Definition of the type_column datatype 18 call MPI_TYPE_CONTIGUOUS(nb_lines, MPI_REAL,type_column,code) 19 20! Validation of the type_column datatype 21 call MPI_TYPE_COMMIT(type_column,code)
99 99/236 6 Derived datatypes 6.5 Examples 22! Sending of the first column 23 if ( rank == 0 ) then 24 call MPI_SEND(a(1,1),1,type_column,1,tag, MPI_COMM_WORLD,code) 25 26! Reception in the last column 27 elseif ( rank == 1 ) then 28 call MPI_RECV(a(1,nb_columns),nb_lines, MPI_REAL,0,tag,& 29 MPI_COMM_WORLD,status,code) 30 end if 31 32! Free the datatype 33 call MPI_TYPE_FREE(type_column,code) call MPI_FINALIZE(code) end program column
100 100/236 6 Derived datatypes 6.5 Examples 6 Derived datatypes 6.5 Examples The datatype "matrix line" 1 program line 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_lines=5,nb_columns=6 6 integer, parameter :: tag=100 7 real, dimension(nb_lines,nb_columns) :: a 8 integer, dimension( MPI_STATUS_SIZE) :: status 9 integer :: rank,code,type_line call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 13 14! Initialization of the matrix on each process 15 a(:,:) = real(rank) 16 17! Definition of the datatype type_line 18 call MPI_TYPE_VECTOR(nb_columns,1,nb_lines, MPI_REAL,type_line,code) 19 20! Validation of the datatype type_ligne 21 call MPI_TYPE_COMMIT(type_line,code)
101 101/236 6 Derived datatypes 6.5 Examples 22! Sending of the second line 23 if ( rank == 0 ) then 24 call MPI_SEND(a(2,1),nb_columns, MPI_REAL,1,tag, MPI_COMM_WORLD,code) 25 26! Reception in the next to last line 27 elseif ( rank == 1 ) then 28 call MPI_RECV(a(nb_lines-1,1),1,type_line,0,tag,& 29 MPI_COMM_WORLD,status,code) 30 end if 31 32! Free the datatype type_ligne 33 call MPI_TYPE_FREE(type_line,code) call MPI_FINALIZE(code) end program line
102 102/236 6 Derived datatypes 6.5 Examples 6 Derived datatypes 6.5 Examples The datatype "matrix block" 1 program block 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_lines=5,nb_columns=6 6 integer, parameter :: tag=100 7 integer, parameter :: nb_lines_block=2,nb_columns_block=3 8 real, dimension(nb_lines,nb_columns) :: a 9 integer, dimension( MPI_STATUS_SIZE) :: status 10 integer :: rank,code,type_block call MPI_INIT(code) 13 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 14 15! Initialization of the matrix on each process 16 a(:,:) = real(rank) 17 18! Creation of the datatype type_bloc 19 call MPI_TYPE_VECTOR(nb_columns_block,nb_lines_block,nb_lines,& 20 MPI_REAL,type_block,code) 21 22! Validation of the datatype type_block 23 call MPI_TYPE_COMMIT(type_block,code)
103 103/236 6 Derived datatypes 6.5 Examples 24! Sending of a block 25 if ( rank == 0 ) then 26 call MPI_SEND(a(1,1),1,type_block,1,tag, MPI_COMM_WORLD,code) 27 28! Reception of the block 29 elseif ( rank == 1 ) then 30 call MPI_RECV(a(nb_lines-1,nb_columns-2),1,type_block,0,tag,& 31 MPI_COMM_WORLD,status,code) 32 end if 33 34! Freeing of the datatype type_block 35 call MPI_TYPE_FREE(type_block,code) call MPI_FINALIZE(code) end program block
104 104/236 6 Derived datatypes 6.6 Homogenous datatypes of variable strides 6 Derived datatypes 6.6 Homogenous datatypes of variable strides MPI_TYPE_INDEXED() allows to create a data structure composed of a sequence of blocks containing a variable number of elements separated by a variable stride in memory. The latter is given in number of elements. MPI_TYPE_CREATE_HINDEXED() has the same functionality as MPI_TYPE_INDEXED() except that the strides that separates two data blocks are given in bytes. This subroutine is useful when the old datatype is not an MPI base datatype(mpi_integer, MPI_REAL,...). We cannot therefore give the stride in number of elements of the old datatype. For MPI_TYPE_CREATE_HINDEXED(), as for MPI_TYPE_CREATE_HVECTOR(), use MPI_TYPE_SIZE() or MPI_TYPE_GET_EXTENT() in order to obtain in a portable way the size of the stride in bytes.
105 105/236 6 Derived datatypes 6.6 Homogenous datatypes of variable strides nb=3, blocks_lengths=(2,1,3), displacements=(0,3,7) old_type new_type Figure 27: The MPI_TYPE_INDEXED constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: block_lengths! Attention the displacements are given in elements integer,intent(in),dimension(nb) :: displacements integer,intent(in) :: old_type integer,intent(out) :: new_type,code call MPI_TYPE_INDEXED(nb,block_lengths,displacements,old_type,new_type,code)
106 106/236 6 Derived datatypes 6.6 Homogenous datatypes of variable strides nb=4, blocks_lengths=(2,1,2,1), displacements=(2,10,14,24) old_type new_type Figure 28: The MPI_TYPE_CREATE_HINDEXED constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: block_lengths! Attention the displacements are given in bytes integer(kind= MPI_ADDRESS_KIND),intent(in),dimension(nb) :: displacements integer,intent(in) :: old_type integer,intent(out) :: new_type,code call MPI_TYPE_CREATE_HINDEXED(nb, block_lengths,displacements, old_type,new_type,code)
107 107/236 6 Derived datatypes 6.6 Homogenous datatypes of variable strides In the following example, each one of the two processes : ➀ initializes its matrix (positive growing numbers on the process 0 and negative decreasing numbers on the process 1) ; ➁ constructs its datatype : triangular matrix (superiour for the process 0 and inferior for the process 1) ; ➂ sends its triangular matrix to the other and receives a triangular matrix which it stores in the place of the matrix which it has sent via the MPI_SENDRECV_REPLACE() subroutine; ➃ frees its resources and exits MPI.
108 108/236 6 Derived datatypes 6.6 Homogenous datatypes of variable strides Before After Process Process Figure 29: Exchange between the two processes
109 109/236 6 Derived datatypes 6.6 Homogenous datatypes of variable strides 1 program triangle 2 use mpi 3 implicit none 4 integer,parameter :: n=8,tag=100 5 real,dimension(n,n) :: a 6 integer,dimension( MPI_STATUS_SIZE) :: status 7 integer :: i,code 8 integer :: rank,type_triangle 9 integer,dimension(n) :: block_lengths,displacements call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 13 14! Initialization of the matrix on each process 15 a(:,:) = reshape( (/ (sign(i,-rank),i=1,n*n) /), (/n,n/)) ! Creation of the triangular matrix datatype sup for the process 0! and of the inferior triangular matrix datatype for the process 1 19 if (rank == 0) then 20 block_lengths(:) = (/ (i-1,i=1,n) /) displacements(:) else = (/ (n*(i-1),i=1,n) /) 23 block_lengths(:) = (/ (n-i,i=1,n) /) displacements(:) endif = (/ (n*(i-1)+i,i=1,n) /) call MPI_TYPE_INDEXED(n, block_lengths,displacements, MPI_REAL,type_triangle,code) 28 call MPI_TYPE_COMMIT(type_triangle,code) 29 30! Permutation of the inferior and superior triangular matrices 31 call MPI_SENDRECV_REPLACE(a,1,type_triangle,mod(rank+1,2),tag,mod(rank+1,2), & 32 tag, MPI_COMM_WORLD,status,code) 33 34! Freeing of the triangle datatype 35 call MPI_TYPE_FREE(type_triangle,code) call MPI_FINALIZE(code) 38 end program triangle
110 110/236 6 Derived datatypes 6.7 Subarray Datatype Constructor 6 Derived datatypes 6.7 Subarray Datatype Constructor The MPI_TYPE_CREATE_SUBARRAY() subroutine allows to create a subarray from an array. integer,intent(in) :: nb_dims integer,dimension(ndims),intent(in) :: shape_array,shape_sub_array,coord_start integer,intent(in) :: order,old_type integer,intent(out) :: new_type,code call MPI_TYPE_CREATE_SUBARRAY(nb_dims,shape_array,shape_sub_array,coord_start, order,old_type,new_type,code)
111 111/236 6 Derived datatypes 6.7 Subarray Datatype Constructor Reminder of the vocabulary relative to the arrays in Fortran 95 The rank of an array is its number of dimensions. The extent of an array is its number of elements in a dimension. The shape of an array is a vector whose each dimension is the extent of the array in the corresponding dimension. For example the T(10,0:5,-10:10) array. Its rank is 3, its extent in the first dimension is 10, in the second 6 and in the third 21, its shape is the (10,6,21) vector. nb_dims : rank of the array shape_array : shape of the array from which a subarray will be extracted shape_sub_array : shape of the subarray coord_start : start coordinates if the indices of the array start at 0. For example, if we want that the start coordinates of the subarray be array(2,3), we must have coord_start(:)=(/ 1,2 /) order : storage order of elements 1 MPI_ORDER_FORTRAN for the ordering used by Fortran arrays (column-major order) 2 MPI_ORDER_C for the ordering used by C arrays (row-major order)
112 112/236 6 Derived datatypes 6.7 Subarray Datatype Constructor Before After Process 0 Process Process 1 Process 1 Figure 30: Exchanges between the two processes
113 113/236 6 Derived datatypes 6.7 Subarray Datatype Constructor 1 program subarray 2 use mpi 3 implicit none 4 5 integer,parameter :: nb_lines=4,nb_columns=3,& 6 tag=1000,nb_dims=2 7 integer :: code,rank,type_subarray,i 8 integer,dimension(nb_lines,nb_columns) :: tab 9 integer,dimension(nb_dims) :: shape_array,shape_subarray,coord_start 10 integer,dimension( MPI_STATUS_SIZE) :: status call MPI_INIT(code) 13 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 14 15!Initialization of the tab array on each process 16 tab(:,:) = reshape( (/ (sign(i,-rank),i=1,nb_lines*nb_columns) /), & 17 (/ nb_lines,nb_columns /) )
114 114/236 6 Derived datatypes 6.7 Subarray Datatype Constructor 18!Shape of the tab array from which a subarray will be extracted 19 shape_tab(:) = shape(tab) 20!The F95 shape function gives the profile of array put in argument. 21!ATTENTION, if the concerned array has not been allocated on all the processes, 22!it will become necessary to explicitly put the shape of the array in order to 23!become known on all the processes, shape_array(:) = (/ nb_lines,nb_columns) /) 24 25!Shape of the subarray 26 shape_subarray(:) = (/ 2,2 /) 27 28!Start coordinates of the subarray 29!For the process 0 we start from the tab(2,1) element 30!For the process 1 we start from the tab(3,2) element 31 coord_start(:) = (/ rank+1,rank /) 32 33!Creation of the type_subarray derived datatype 34 call MPI_TYPE_CREATE_SUBARRAY(nb_dims,shape_array,shape_subarray,coord_start,& 35 MPI_ORDER_FORTRAN, MPI_INTEGER,type_subarray,code) 36 call MPI_TYPE_COMMIT(type_subarray,code) 37 38!Exchange of the subarrays 39 call MPI_SENDRECV_REPLACE(tab,1,type_subarray,mod(rank+1,2),tag,& 40 mod(rank+1,2),tag, MPI_COMM_WORLD,status,code) 41 call MPI_TYPE_FREE(type_subarray,code) 42 call MPI_FINALIZE(code) 43 end program subarray
115 115/236 6 Derived datatypes 6.8 Heterogenous datatypes 6 Derived datatypes 6.8 Heterogenous datatypes The MPI_TYPE_CREATE_STRUCT() subroutine is the constructor of more general datatypes. It has the same functionalities as MPI_TYPE_INDEXED() but it allows in addition the replication of different datatype blocks. The MPI_TYPE_CREATE_STRUCT() parameters are the same as the ones of MPI_TYPE_INDEXED() with in addition : the old_types field is now a vector of MPI datatypes ; taking into account the heterogeneity of data and their alignment in memory, the calculation of the displacement between two elements is in the difference of their addresses ; MPI, via MPI_GET_ADDRESS(), provides a portable subroutine which allows to return the address of a variable.
116 116/236 6 Derived datatypes 6.8 Heterogenous datatypes nb=5, blocks_lengths=(3,1,5,1,1), displacements=(0,7,11,21,26), old_types=(type1,type2,type3,type1,type3) old_types type 1 type 2 type 3 new_type Figure 31: The MPI_TYPE_CREATE_STRUCT constructor integer,intent(in) :: nb integer,intent(in),dimension(nb) :: blocks_lengths integer(kind= MPI_ADDRESS_KIND),intent(in),dimension(nb) :: displacements integer,intent(in),dimension(nb) :: old_types integer, intent(out) call MPI_TYPE_CREATE_STRUCT(nb,blocks_lengths,displacements, old_types,new_type,code) <type>,intent(in) :: variable integer(kind= MPI_ADDRESS_KIND),intent(out) :: address_variable integer,intent(out) :: code call MPI_GET_ADDRESS(variable,address_variable,code) :: new_type,code
117 117/236 6 Derived datatypes 6.8 Heterogenous datatypes 1 program Interaction_Particles 2 use mpi 3 implicit none 4 5 integer, parameter :: n=1000,tag=100 6 integer, dimension( MPI_STATUS_SIZE) :: status 7 integer :: rank,code,type_particle,i 8 integer, dimension(4) :: types,blocks_lengths 9 integer(kind= MPI_ADDRESS_KIND), dimension(4) :: displacements,addresses type Particule 12 character(len=5) :: category 13 integer :: mass 14 real, dimension(3) :: coords 15 logical :: class 16 end type Particule 17 type(particule), dimension(n) :: p,temp_p call MPI_INIT(code) 20 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 21 22! Construction of the datatype 23 types = (/ MPI_CHARACTER, MPI_INTEGER, MPI_REAL, MPI_LOGICAL/) 24 blocks_lengths= (/5,1,3,1/)
118 118/236 6 Derived datatypes 6.8 Heterogenous datatypes 25 call MPI_GET_ADDRESS(p(1)%category,addresses(1),code) 26 call MPI_GET_ADDRESS(p(1)%mass,addresses(2),code) 27 call MPI_GET_ADDRESS(p(1)%coords,addresses(3),code) 28 call MPI_GET_ADDRESS(p(1)%class,addresses(4),code) 29 30! Calculation of displacements relative to the start address 31 do i=1,4 32 displacements(i)=addresses(i) - addresses(1) 33 end do 34 call MPI_TYPE_CREATE_STRUCT(4,blocks_lengths,displacements,types,type_particle, & 35 code) 36! Validation of the structured datatype 37 call MPI_TYPE_COMMIT(type_particle,code) 38! Initialization of particles for each process ! Sending of particles from 0 towards 1 41 if (rank == 0) then 42 call MPI_SEND(p(1)%category,n,type_particle,1,tag, MPI_COMM_WORLD,code) 43 else 44 call MPI_RECV(temp_p(1)%category,n,type_particle,0,tag, MPI_COMM_WORLD, & 45 status,code) 46 endif 47 48! Freeing of the datatype 49 call MPI_TYPE_FREE(type_particle,code) 50 call MPI_FINALIZE(code) 51 end program Interaction_Particles
119 119/236 6 Derived datatypes 6.9 Subroutines annexes 6 Derived datatypes 6.9 Subroutines annexes The total size of a datatype : MPI_TYPE_SIZE() integer, intent(in) :: type_data integer, intent(out) :: size, code call MPI_TYPE_SIZE(type_data,size,code) The extent as well as the inferior limit of a derived datatype, by taking into account eventual memory alignments : MPI_TYPE_GET_EXTENT() integer, intent(in) :: type_derived integer(kind= MPI_ADDRESS_KIND),intent(out):: bound_inf_align,size_align integer, intent(out) :: code call MPI_TYPE_GET_EXTENT(type_derived,bound_inf_align,size_align,code) We can modify the inferior bound of a derived datatype and its extent in order to create a new adapted datatype of the precedent integer, intent(in) :: old_type integer(kind= MPI_ADDRESS_KIND),intent(in) :: new_bound_inf,new_size integer, intent(out) :: new_type,code call MPI_TYPE_CREATE_RESIZED(old_type,new_bound_inf,new_size, new_type,code)
120 120/236 6 Derived datatypes 6.9 Subroutines annexes 1 PROGRAM half_line 2 USE mpi 3 IMPLICIT NONE 4 INTEGER,PARAMETER :: nb_lines=5,nb_columns=6,& 5 half_line=nb_columns/2,tag= INTEGER,DIMENSION(nb_lines,nb_columns) :: A 7 INTEGER :: typehalfline,typehalfline2 8 INTEGER :: code,size_integer,rank,i 9 INTEGER(kind= MPI_ADDRESS_KIND) :: boundinf=0, sizedisplacement 10 INTEGER, DIMENSION( MPI_STATUS_SIZE) :: status CALL MPI_INIT(code) 13 CALL MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 14!Initialization of the A matrix on each process 15 A(:,:) = RESHAPE( (/ (SIGN(i,-rank),i=1,nb_lines*nb_columns) /), & 16 (/ nb_lines,nb_columns /) ) 17 18!Construction of the derived datatype typehalfline 19 CALL MPI_TYPE_VECTOR(half_line,1,nb_lines, MPI_INTEGER,typeHalfLine,code) 20!Know the size of the datatype MPI_INTEGER 21 CALL MPI_TYPE_SIZE( MPI_INTEGER, size_integer, code) 22!Construction of the derived datatype typehalfline2 23 sizedisplacement = size_integer 24 CALL MPI_TYPE_CREATE_RESIZED(typeHalfLine,boundInf,sizeDisplacement,& 25 typehalfline2,code)
121 121/236 6 Derived datatypes 6.9 Subroutines annexes 26!Validation of the datatype typehalfline2 27 CALL MPI_TYPE_COMMIT(typeHalfLine2,code) IF (rank == 0) THEN 30!Sending of the A matrix to the process 1 with the derived datatype typehalfline2 31 CALL MPI_SEND(A(1,1), 2, typehalfline2, 1, tag, & 32 MPI_COMM_WORLD, code) 33 ELSE 34!Reception for the process 1 in the A matrix 35 CALL MPI_RECV(A(1,nb_columns-1), 6, MPI_INTEGER, 0, tag,& 36 MPI_COMM_WORLD,status, code) PRINT *, A matrix on the process 1 39 DO i=1,nb_lines 40 PRINT *,A(i,:) 41 END DO END IF 44 CALL MPI_FINALIZE(code) 45 END PROGRAM half_line > mpiexec -n 4 half_ligne A matrix on the process
122 122/236 6 Derived datatypes 6.10 Conclusion 6 Derived datatypes 6.10 Conclusion The MPI derived datatypes are powerful portable mechanisms to describe data. They allow, when they are combined with subroutines like MPI_SENDRECV(), to simplify the writing of interprocess exchanges. The combination of derived datatypes and topologies (described in one of the next chapters) makes MPI the ideal tool for all the problems of domain decomposition with regular or irregular meshes.
123 123/236 7 Optimizations 7.1 Introduction 7 Optimizations 7.1 Introduction Optimization of MPI communications must be a primary concern when their part becomes quite important compared to the calculation one. The optimization of communications, beyond the choice of an algorithm as efficient as possible, can be accomplished at many levels including, for example: by selecting the most suitable communication mode; by overlapping communications with computations.
124 124/236 7 Optimizations 7.2 Point-to-Point Send Modes 7 Optimizations 7.2 Point-to-Point Send Modes Mode Blocking Non-blocking Standard send MPI_Send MPI_Isend Synchronous send MPI_Ssend MPI_Issend Buffered send MPI_Bsend MPI_Ibsend Ready send MPI_Rsend MPI_Irsend Receive MPI_Recv MPI_Irecv
125 125/236 7 Optimizations 7.2 Point-to-Point Send Modes 7 Optimizations 7.2 Point-to-Point Send Modes Key Terms It is important to understand well the definition of some MPI terms. Blocking call: a call is blocking if the memory space used for the communication can be reused immediately after the exit of the call. The data that have been or will be sent are the data that were in this space at the moment of the call. If it is a receive, the data must have already been received in this space (if the return code is MPI_SUCCESS). Non-blocking call: a non-blocking call returns very quickly, but it does not authorize the immediate re-use of the memory space used in the communication. It is necessary to make sure that the communication is fully completed (with MPI_Wait for example) before using it again. Synchronous send: a synchronous send involves a synchronization between the involved processes. There can be no communication before the two processes are ready to communicate. A send cannot start until its receive is posted. Buffered send: a buffered send implies the copying of data in an intermediate memory space. There is then no coupling between the two processes of communication. So the return of this type of send does not mean that the receive occurred.
126 126/236 7 Optimizations 7.2 Point-to-Point Send Modes 7 Optimizations 7.2 Point-to-Point Send Modes Synchronous Sends A synchronous send is made by calling the MPI_Ssend or MPI_Issend subroutine. Rendezvous Protocol The rendezvous protocol is generally the protocol used for synchronous sends (implementation-dependent). The return receipt is optional. Time Process 0 Process header ready to receive data return receipt
127 127/236 7 Optimizations 7.2 Point-to-Point Send Modes Advantages Use less resources (no buffer) Faster if the receiver is ready (no copying in a buffer) Guarantee of receive through synchronization Disadvantages Waiting time if the receiver is not there/not ready Risks of deadlocks
128 128/236 7 Optimizations 7.2 Point-to-Point Send Modes 7 Optimizations 7.2 Point-to-Point Send Modes Buffered Sends Buffered Sends A buffered send is made by calling the MPI_Bsend or MPI_Ibsend subroutine. The buffers have to be managed manually (with calls to MPI_Attach and MPI_Detach). They have to be allocated by taking into account the header size of messages (by adding the constant MPI_BSEND_OVERHEAD for each message instance).
129 129/236 7 Optimizations 7.2 Point-to-Point Send Modes Protocol with User Buffer on the Sender Side This approach is the one generally used for the MPI_Bsend or MPI_Ibsend. In this approach, the buffer is on the sender side and is managed explicitly by the application. A buffer managed by MPI can exist on the receiver side. Many variants are possible. The return receipt is optional. Time Process 0 copying message return receipt Process
130 130/236 7 Optimizations 7.2 Point-to-Point Send Modes Eager Protocol The eager protocol is often used for standard sends of small-size messages. It can also be used for sends with MPI_Bsend with small messages (implementation-dependent) and by bypassing the user buffer on the sender side. In this approach, the buffer is on the receiver side. The return receipt is optional. Time Process message return receipt Process
131 131/236 7 Optimizations 7.2 Point-to-Point Send Modes Advantages No need to wait for the receiver (copying in a buffer) No risks of deadlocks Disadvantages Use of more resources (memory use by buffers with saturation risks) The used send buffers in the MPI_Bsend or MPI_Ibsend calls have to be managed manually (often hard to choose a suitable size) A little bit slower than the synchronous sends if the receiver is ready There is no guarantee of good receive (send-receive decoupling) Risk of wasted memory space if the buffers are too oversized The application will crash if the buffer is too small There is often also hidden buffers managed by the MPI implementation on the sender side and/or on the receiver side (and using memory resources)
132 132/236 7 Optimizations 7.2 Point-to-Point Send Modes 7 Optimizations 7.2 Point-to-Point Send Modes Standard Sends Standard Sends A standard send is made by calling the MPI_Send or MPI_Isend subroutine. In most implementations, this mode switches from a buffered mode to a synchronous mode when the size of messages grows. Advantages Often the most efficient (because the constructor chose the best parameters and algorithms) The most portable for the performances Disadvantages Little control over the really used mode (often accessible via environment variables) Risk of deadlock according to the actual mode Behavior that can vary according to the architecture and the problem size
133 133/236 7 Optimizations 7.2 Point-to-Point Send Modes 7 Optimizations 7.2 Point-to-Point Send Modes Ready Sends Ready Sends A ready send is made by calling the MPI_Rsend or MPI_Irsend subroutine. Warning: it is obligatory to make these calls only when the receive is already posted. Their use is not recommended. Advantages Slightly more efficient than the synchronous mode because the protocol of synchronization can be simplified Disadvantages Errors if the receiver is not ready during the send
134 134/236 7 Optimizations 7.2 Point-to-Point Send Modes Ping pong extranode balanced Vargas Standard Vargas Synchronous Vargas Buffered Babel Standard Babel Synchronous Babel Buffered Bandwidth (MiB/s) Message size (bytes)
135 135/236 7 Optimizations 7.3 Computation-Communication Overlap 7 Optimizations 7.3 Computation-Communication Overlap Presentation The overlap of communications by computations is a method which allows to execute communications operations in background while the program continues to operate. It is thus possible, if the hardware and software architecture allows it, to hide all or part of communications costs. The computation-communication overlap can be seen as an additional level of parallelism. This approach is used in MPI by the use of non-blocking subroutines (i.e. MPI_Isend, MPI_Irecv and MPI_Wait).
136 136/236 7 Optimizations 7.3 Computation-Communication Overlap Partial overlap Process 0 request u o Process 0 request Time finished? sending Time sending finished? yes... yes
137 137/236 7 Optimizations 7.3 Computation-Communication Overlap Advantages Possibility of hiding all or part of communications costs (if the architecture allows it) No risks of deadlock Disadvantages Greater additional costs (several calls for one single send or receive, management of requests) Higher complexity and more complicated maintenance Less efficient on some machines (for example with transfer starting only at the MPI_Wait call) Performance-loss risk on the computational kernels (for example differentiated management between the area near the border of a domain and the interior area resulting in less efficient use of memory caches) Limited to point-to-point communications (it is extended to collective communications in MPI 3.0)
138 138/236 7 Optimizations 7.3 Computation-Communication Overlap Use The message send is made in two steps: Initiate the send or the receive by a call to MPI_Isend or MPI_Irecv (or one of their variants) Wait the end of the local contribution by a call to MPI_Wait (or one of its variants). The communications overlap with all the operations that occur between these two steps. The access to data being in receive is not permitted before the end of the MPI_Wait (the access to data being in send is also not permitted for the MPI implementations previous to the 2.2 version).
139 139/236 7 Optimizations 7.3 Computation-Communication Overlap 1 integer, dimension(2) :: req 2 do i=1,niter 3! Initiate communications 4 call MPI_Irecv(data_ext, sz, MPI_REAL,dest,tag,comm, & 5 req(1),ierr) 6 call MPI_Isend(data_bound,sz, MPI_REAL,dest,tag,comm, & 7 req(2),ierr) 8 9! Compute the interior domain (data_ext and data_bound 10! are unused) during communications 11 call compute_interior_domain(data_int) 12 13! Wait for the end of communications 14 call MPI_Waitall(2,req, MPI_STATUSES_IGNORE,ierr) 15 16! Compute the exterior domain 17 call compute_domaine_exterieur(data_int,data_bound,data_ext) 18 end do
140 140/236 7 Optimizations 7.3 Computation-Communication Overlap Overlap Level on Different Machines Machine Level Blue Gene/P DCMF_INTERRUPT=0 34% Blue Gene/P DCMF_INTERRUPT=1 100% Power6 InfiniBand 38% NEC SX-8 10% CURIE 0% Measures done by overlapping a computational kernel with a communication kernel which have the same execution time and by using different schemes of communications (intra/extra-nodes, by pairs, random processes...). According to the scheme of communication, the results can be totally different. An overlap of 0% means that the total time of execution is 2x the time of a computational (or communication) kernel. An overlap of 100% means that the total time of execution is 1x the time of a computational (or communication) kernel.
141 141/236 8 Communicators 8.1 Introduction 8 Communicators 8.1 Introduction Communicators usage consists of partitioning a group of processes in order to create subgroups on which we can carry out operations such as collective or point-to-point communications. Each created subgroup will have its own communication space. MPI_COMM_WORLD c 2 b 1 h e 4 7 a 0 g f 5 6 d 3 Figure 32: Communicator partitioning
142 142/236 8 Communicators 8.2 Default Communicator 8 Communicators 8.2 Default Communicator It is the story of the chicken and the egg... We can only create a communicator from another communicator. Luckily, this has been resolved by assuming that the chicken existed first. In fact, a communicator is provided by default, whose the MPI_COMM_WORLD identifier is an integer value defined in the header files. This MPI_COMM_WORLD initial communicator is created for all the execution period of the program from a call to the MPI_INIT() subroutine. This communicator can only be destroyed via the calling of MPI_FINALIZE() By default, it sets therefore the range of collective and point-to-point communications to all the processes of the application.
143 143/236 8 Communicators 8.3 Example 8 Communicators 8.3 Example In the following example, we will : put together on one hand the even-ranked processes and on the other hand the odd-ranked processes ; broadcast a collective message only to even-ranked processes and another only to odd-ranked processes. MPI_COMM_WORLD a 0 e 4 h 7 g 6 c 2 b 1 f 5 d 3 $ mpirun np 8 CommPairImpair call MPI_INIT(...) a0 e 4 2 g3 c1 6 2 a0 e2 0 4 g3 c1 6 2 a 0 e 4 g 6 c 2 h 3 7 f5 2 b d 3 h 3 7 f 2 0 b d 3 h 7 b 1 f 5 d 3 call MPI_COMM_SPLIT(...) call MPI_BCAST(...) call MPI_COMM_FREE(...) Figure 33: Communicator creation/destruction
144 144/236 8 Communicators 8.3 Example What to do in order for the rank 2 process to broadcast this message to the even-ranked process subgroup, for example? looping on send/recv can be very detrimental especially if the number of processes is high. Also a test would be compulsary in the loop in order to know if the rank of the process to which the rank 2 process must send the message is even or odd. The solution is to create a communicator containing these processes in a way that the rank 2 process broadcasts the message only to them. MPI_COMM_WORLD 0 e 4 h 7 a f 5 g 6 c 2 b 1 d 3 Figure 34: A new communicator
145 145/236 8 Communicators 8.4 Groups and Communicators 8 Communicators 8.4 Groups and Communicators A communicator consists : 1 of a group, which is an ordered group of processes ; 2 of a communication context made at the calling of the communicator construction subroutine, which allows to define the communication space. The communication contexts are managed by MPI (the programmer has no action on them : it is an opaque attribute). In pratice, in order to build a communicator, there are two ways to do this : 1 through a group of processes ; 2 directly from another communicator.
146 146/236 8 Communicators 8.4 Groups and Communicators In the MPI library, many subroutines exist in order to build communicators : MPI_CART_CREATE(), MPI_CART_SUB(), MPI_COMM_CREATE(), MPI_COMM_DUP(), MPI_COMM_SPLIT(). The communicators constructors are collective calls. The communicators that the programmer creates can be freed by using the MPI_COMM_FREE() subroutine.
147 147/236 8 Communicators 8.5 Communicator Resulting from Another 8 Communicators 8.5 Communicator Resulting from Another The direct use of groups presents in this case many disadvantages, because it requires to : name differently the two communicators (for example comm_even and comm_odd) ; go through the groups to build these two communicators ; leave the ordering of the process rank to MPI in these two communicators ; make conditional tests when calling the MPI_BCAST() subroutine : if (comm_even /= MPI_COMM_NULL) then...! Broadcast of the message only to even-ranked processes call MPI_BCAST(a,m, MPI_REAL,rank_in_even,comm_even,code) elseif (comm_odd /= MPI_COMM_NULL) then...! Broadcast of the message only to odd-ranked processes call MPI_BCAST(a,m, MPI_REAL,rank_in_odd,comm_odd,code) end if
148 148/236 8 Communicators 8.5 Communicator Resulting from Another The MPI_COMM_SPLIT() subroutine allows to partition a given communicator in as many communicators as we want... integer, intent(in) :: comm, color, key integer, intent(out) :: new_comm, code call MPI_COMM_SPLIT(comm,color,key,new_comm,code) process a b c d e f g h rank_world color key rank_new_com MPI_COMM_WORLD a 1 f d0 3 1 b 1 g e1 2 4 h7 c Figure 35: Construction of communicators with MPI_COMM_SPLIT() A process that is assigned a color equal to the MPI_UNDEFINED value will belong only to its initial communicator.
149 149/236 8 Communicators 8.5 Communicator Resulting from Another Let s see how to proceed in order to build the communicator which will subdivide the communication space between odd-ranked and even-ranked processes via the MPI_COMM_SPLIT() constructor. process a b c d e f g h rank_world color key a c0 MPI_COMM_WORLD 1 0 g e2 4 d2 3 f 0 5 h3 7 b1 1 rank_even_odd Figure 36: Construction of the CommEvenOdd communicator with MPI_COMM_SPLIT() In pratice, this is set up very simply...
150 150/236 8 Communicators 8.5 Communicator Resulting from Another 1 program EvenOdd 2 use mpi 3 implicit none 4 5 integer, parameter :: m=16 6 integer :: key,commevenodd 7 integer :: rank_in_world,code 8 real, dimension(m) :: a 9 10 call MPI_INIT(code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank_in_world,code) 12 13! Initialization of the A vector 14 a(:)=0. 15 if(rank_in_world == 2) a(:)=2. 16 if(rank_in_world == 5) a(:)= key = rank_in_world 19 if (rank_in_world == 2.OR. rank_in_world == 5 ) then 20 key=-1 21 end if 22 23! Creation of even and odd communicators by giving them the same name 24 call MPI_COMM_SPLIT( MPI_COMM_WORLD,mod(rank_in_world,2),key,CommEvenOdd,code) 25 26! Broadcast of the message by the rank process 0 of each communicator to the processes 27! of its group 28 call MPI_BCAST(a,m, MPI_REAL,0,CommEvenOdd,code) 29 30! Destruction of the communicators 31 call MPI_COMM_FREE(CommEvenOdd,code) 32 call MPI_FINALIZE(code) 33 end program EvenOdd
151 151/236 8 Communicators 8.6 Topologies 8 Communicators 8.6 Topologies In most applications, especially in domain decomposition methods where we match the calculation domain to the grid of processes, it is interesting to be able to arrange the processes according to a regular topology. MPI allows to define cartesian or graph virtual topologies. Cartesian topologies : each process is defined in a grid ; the grid can be periodic or not ; the processes are identified by their coordinates in the grid. Graph Topologies : generalization to more complex topologies.
152 152/236 8 Communicators 8.6 Topologies 8 Communicators 8.6 Topologies Cartesian topologies A cartesian topology is defined when a group of processes belonging to a given communicator comm_old calls the MPI_CART_CREATE() subroutine. integer, intent(in) :: comm_old, ndims integer, dimension(ndims),intent(in) :: dims logical, dimension(ndims),intent(in) :: periods logical, intent(in) :: reorganization integer, intent(out) :: comm_new, code call MPI_CART_CREATE(comm_old, ndims,dims,periods,reorganization,comm_new,code)
153 153/236 8 Communicators 8.6 Topologies Example on a grid having 4 domains along x and 2 along y, periodic in y. use mpi integer :: comm_2d, code integer, parameter :: ndims = 2 integer, dimension(ndims) :: dims logical, dimension(ndims) :: periods logical :: reorganization... dims(1) = 4 dims(2) = 2 periods(1) =.false. periods(2) =.true. reorganization =.false. call MPI_CART_CREATE( MPI_COMM_WORLD,ndims,dims,periods,reorganization,comm_2D,code) If reorganization =.false. then the rank of the processes in the new communicator (comm_2d) is the same as in the old communicator (MPI_COMM_WORLD). If reorganization =.true., the MPI implementation chooses the order of the processes.
154 154/236 8 Communicators 8.6 Topologies y x Figure 37: 2D periodic cartesian topology in y
155 155/236 8 Communicators 8.6 Topologies Example on a 3D grid having 4 domains along x, 2 along y and 2 along z, non periodic. use mpi integer :: comm_3d,code integer, parameter :: ndims = 3 integer, dimension(ndims) :: dims logical, dimension(ndims) :: periods logical :: reorganization... dims(1) = 4 dims(2) = 2 dims(3) = 2 periods(:) =.false. reorganization =.false. call MPI_CART_CREATE( MPI_COMM_WORLD,ndims,dims,periods,reorganization,comm_3D,code)
156 156/236 8 Communicators 8.6 Topologies z = z = y z x z = 1 z = 0 Figure 38: None periodic cartesian topology 3D
157 157/236 8 Communicators 8.6 Topologies The MPI_DIMS_CREATE() subroutine returns the number of processes in each dimension of the grid according to the total number of processes. integer, intent(in) :: nb_procs, ndims integer, dimension(ndims),intent(inout) :: dims integer, intent(out) :: code call MPI_DIMS_CREATE(nb_procs,ndims,dims,code) Remark : if the values of dims in entry are all 0, this means that we leave to MPI the choice of the number of processes in each direction according to their total number. dims in entry call MPI_DIMS_CREATE dims en exit (0,0) (8,2,dims,code) (4,2) (0,0,0) (16,3,dims,code) (4,2,2) (0,4,0) (16,3,dims,code) (2,4,2) (0,3,0) (16,3,dims,code) error
158 158/236 8 Communicators 8.6 Topologies In a cartesian topology, the MPI_CART_RANK() subroutine returns the rank of the associated process to the coordinates in the grid. integer, intent(in) :: comm_new integer, dimension(ndims),intent(in) :: coords integer, intent(out) :: rank, code call MPI_CART_RANK(comm_new,coords,rank,code)
159 159/236 8 Communicators 8.6 Topologies y (0,1) (1,1) (2,1) (3,1) (0,0) (1,0) (2,0) (3,0) (0,1) (1,1) (2,1) (3,1) (0,0) (1,0) (2,0) (3,0) x Figure 39: Periodic cartesian topology 2D in y coords(1)=dims(1)-1 do i=0,dims(2)-1 coords(2) = i call MPI_CART_RANK(comm_2D,coords,rank(i),code) end do... i=0,in entry coords=(3,0),in exit rank(0)=6. i=1,in entry coords=(3,1),in exit rank(1)=7.
160 160/236 8 Communicators 8.6 Topologies In a cartesian topology, the MPI_CART_COORDS() subroutine returns the coordinates of a process of a given rank in the grid. integer, intent(in) :: comm_new, rank, ndims integer, dimension(ndims),intent(out) :: coords integer, intent(out) :: code call MPI_CART_COORDS(comm_new, rank, ndims, coords, code)
161 161/236 8 Communicators 8.6 Topologies y (0,1) (1,1) (2,1) (3,1) (0,0) (1,0) (2,0) (3,0) (0,1) (1,1) (2,1) (3,1) (0,0) (1,0) (2,0) (3,0) x Figure 40: Periodic cartesian ropology 2D in y if (mod(rank,2) == 0) then call MPI_CART_COORDS(comm_2D,rank,2,coords,code) end if... In entry, the rank values are : 0,2,4,6. In exit, the coords values are : (0,0),(1,0),(2,0),(3,0)
162 162/236 8 Communicators 8.6 Topologies In a cartesian topology, a process that calls the MPI_CART_SHIFT() subroutine can get the rank of its neighboring processes in a given direction. integer, intent(in) :: comm_new, direction, step integer, intent(out) :: rank_previous,rank_next integer, intent(out) :: code call MPI_CART_SHIFT(comm_new, direction, step, rank_previous, rank_next, code) The direction parameter corresponds to the displacement axis (xyz). The step parameter corresponds to the displacement step.
163 163/236 8 Communicators 8.6 Topologies y direction = 0 1 (0,1) 3 (1,1) 5 (2,1) 7 (3,1) 0 (0,0) 2 (1,0) 4 (2,0) 6 (3,0) direction = 1 x Figure 41: Call of the MPI_CART_SHIFT() subroutine call MPI_CART_SHIFT(comm_2D,0,1,rank_left,rank_right,code)... For the process 2, rank_left=0, rank_right=4 call MPI_CART_SHIFT(comm_2D,1,1,rank_low,rank_high,code)... For the process 2, rank_low=3, rank_high=3
164 164/236 8 Communicators 8.6 Topologies y 8 z 12 z = 1 x z = 0 Figure 42: Call of the MPI_CART_SHIFT() subroutine direction=1 direction=2 direction=0 call MPI_CART_SHIFT(comm_3D,0,1,rank_left,rank_right,code)... For the process 0, rank_left=-1, rank_right=4 call MPI_CART_SHIFT(comm_3D,1,1,rank_low,rank_high,code)... For the process 0, rank_low=-1, rank_high=2 call MPI_CART_SHIFT(comm_3D,2,1,rank_ahead,rank_before,code)... For the process 0, rank_ahead=-1, rank_before=1
165 165/236 8 Communicators 8.6 Topologies Program Example : 1 program decomposition 2 use mpi 3 implicit none 4 5 integer :: rank_in_topo,nb_procs 6 integer :: code,comm_2d 7 integer, dimension(4) :: neighbor 8 integer, parameter :: N=1,E=2,S=3,W=4 9 integer, parameter :: ndims = 2 10 integer, dimension (ndims) :: dims,coords 11 logical, dimension (ndims) :: periods 12 logical :: reorganization call MPI_INIT(code) call MPI_COMM_SIZE( MPI_COMM_WORLD,nb_procs,code) 17 18! Know the number of processes along x and y 19 dims(:) = call MPI_DIMS_CREATE(nb_procs,ndims,dims,code)
166 166/236 8 Communicators 8.6 Topologies 22! 2D y-periodic grid creation 23 periods(1) =.false. 24 periods(2) =.true. 25 reorganization =.false call MPI_CART_CREATE( MPI_COMM_WORLD,ndims,dims,periods,reorganization,comm_2D,code) 28 29! Know my coordinates in the topology 30 call MPI_COMM_RANK(comm_2D,rank_in_topo,code) 31 call MPI_CART_COORDS(comm_2D,rank_in_topo,ndims,coords,code) 32 33! Initialization of the neigboring array to the MPI_PROC_NULL value 34 neighbor(:) = MPI_PROC_NULL 35 36! Search of my West and East neigbors 37 call MPI_CART_SHIFT(comm_2D,0,1,neighbor(W),neighbor(E),code) 38 39! Search of my South and North neighbors 40 call MPI_CART_SHIFT(comm_2D,1,1,neighbor(S),neighbor(N),code) call MPI_FINALIZE(code) end program decomposition
167 167/236 8 Communicators 8.6 Topologies 8 Communicators 8.6 Topologies Subdividing a cartesian topology The goal is, by example, to degenerate a 2D or 3D cartesian topology in, respectively, a 1D or 2D cartesian topology. For MPI, degenerating a 2D (or 3D) cartesian topology creates as many communicators as there are rows or columns (resp. planes) in the initial cartesian grid. The major advantage is to be able to carry out collective operations limited to a subgroup of processes belonging to : the same row (or column), if the initial topology is 2D ; the same plane, if the initial topology is 3D.
168 168/236 8 Communicators 8.6 Topologies V V Figure 43: Two examples of data distribution in a degenerated 2D topology
169 169/236 8 Communicators 8.6 Topologies There are two ways to degenerate a topology : by using the MPI_COMM_SPLIT() general subroutine ; by using the MPI_CART_SUB() subroutine designed for this purpose. logical,intent(in),dimension(ndim) :: remain_dims integer,intent(in) :: CommCart integer,intent(out) :: CommCartD, code call MPI_CART_SUB(CommCart,remain_dims,CommCartD,code) v(:)=0. v(:)=5. v(:)=0. v(:)= v(:)=0. v(:)=4. v(:)=0. v(:)= v(:)=0. v(:)=3. v(:)=0. v(:)= v(:)=5 2 w=5 2 5 w=5 8 w=5 11 w=5 v(:)=4 1 w=4 1 4 w=4 7 w=4 10 w=4 v(:)=3 0 w=3 0 3 w=3 6 w=3 9 w= Figure 44: Initial representation of a V array in the 2D grid and final representation after the distribution of that one on the degenerated 2D grid
170 170/236 8 Communicators 8.6 Topologies 1 program CommCartSub 2 use mpi 3 implicit none 4 5 integer :: Comm2D,Comm1D,rank,code 6 integer,parameter :: NDim2D=2 7 integer,dimension(ndim2d) :: Dim2D,Coord2D 8 logical,dimension(ndim2d) :: Period,remain_dims 9 logical :: Reorder 10 integer,parameter :: m=4 11 real, dimension(m) :: V=0. 12 real :: W=0.
171 171/236 8 Communicators 8.6 Topologies 13 call MPI_INIT(code) 14 15! Creation of the initial 2D grid 16 Dim2D(1) = 4 17 Dim2D(2) = 3 18 Period(:) =.false. 19 ReOrder =.false. 20 call MPI_CART_CREATE( MPI_COMM_WORLD,NDim2D,Dim2D,Period,ReOrder,Comm2D,code) 21 call MPI_COMM_RANK(Comm2D,rank,code) 22 call MPI_CART_COORDS(Comm2D,rank,NDim2D,Coord2D,code) 23 24! Initialization of the V vector 25 if (Coord2D(1) == 1) V(:)=real(rank) 26 27! Every row of the grid must be a 1D cartesian topology 28 remain_dims(1) =.true. 29 remain_dims(2) =.false. 30! Subdivision of the 2D cartesian grid 31 call MPI_CART_SUB(Comm2D,remain_dims,Comm1D,code) 32 33! The processes of the column 2 distribute the V vector to the processes of their row 34 call MPI_SCATTER(V,1, MPI_REAL,W,1, MPI_REAL,1,Comm1D,code) print ("Rank : ",I2," ; Coordinates : (",I1,",",I1,") ; W = ",F2.0), & 37 rank,coord2d(1),coord2d(2),w call MPI_FINALIZE(code) 40 end program CommCartSub
172 172/236 8 Communicators 8.6 Topologies > mpiexec -n 12 CommCartSub Rank : 0 ; Coordinates : (0,0) ; W = 3. Rank : 1 ; Coordinates : (0,1) ; W = 4. Rank : 3 ; Coordinates : (1,0) ; W = 3. Rank : 8 ; Coordinates : (2,2) ; W = 5. Rank : 4 ; Coordinates : (1,1) ; W = 4. Rank : 5 ; Coordinates : (1,2) ; W = 5. Rank : 6 ; Coordinates : (2,0) ; W = 3. Rank : 10 ; Coordinates : (3,1) ; W = 4. Rank : 11 ; Coordinates : (3,2) ; W = 5. Rank : 9 ; Coordinates : (3,0) ; W = 3. Rank : 2 ; Coordinates : (0,2) ; W = 5. Rank : 7 ; Coordinates : (2,1) ; W = 4.
173 173/236 9 MPI-IO 9.1 Introduction 9 MPI-IO 9.1 Introduction Presentation Very logically, the applications that make large calculations also handle large amounts of data, and generate therefore a significant number of I/O. Thus, their effective treatment sometimes affects very strongly the global performances of applications.
174 174/236 9 MPI-IO 9.1 Introduction The I/O optimization of parallel codes is made by the combination : of their parallelization, in order to avoid creating a bottleneck due to their serialization ; of explicitly implemented techniques at the level of programming (nonblocking reads / writes) ; of specific operations supported by the operating system (grouping of requests, buffer management of I/O, etc.). The goals of MPI-IO, via the high-level interface that it proposes, are to provide simplicity, expressivity and flexibility, while authorizing performing implementations that take into account the software and hardware specificities of I/O devices of the target machines. MPI-IO provides an interface modeled on the one used for message passing. The definition of data accessed according to the processes is made by the use of (basic or derived) datatypes. As for the notions of nonblocking and collective operations, they are managed similarly to what MPI proposes for the messages. MPI-IO authorises both sequential and random accesses.
175 175/236 9 MPI-IO 9.1 Introduction 9 MPI-IO 9.1 Introduction Issues It is a high-level interface, where, as we sayed, some optimization techniques are accessible to the users, but where many basic optimizations can be transparently implemented. Two important examples are : The case of multiple accesses, by one single process, to small discontiguous blocks : this can be treated by a data sieving mechanism, after grouping of a set of requests, reading of a big contiguous block of the disk towards a buffer memory area, then allocation to the user memory areas of the adequate subgroups of data ;
176 176/236 9 MPI-IO 9.1 Introduction Requests on small non-contiguous blocks of a file Reading of a big contiguous block and transfer in a buffer memory area Memory copies of the required elements in the variables of the program Figure 45: Data sieving mechanism in the case of multiple accesses, by one single process, to small discontiguous blocks
177 177/236 9 MPI-IO 9.1 Introduction the case of access, by a group of processes, to discontiguous blocks (case of distributed arrays, for example) : this can be treated by collective I/O by dividing the operations into two phases. Process domain 0 Process domain 1 Process domain 2 Request proc. 0 Request proc. 1 Request proc. 2 Fichier Lecture Lecture Lecture Memory buffer process 0 Memory buffer process 1 Memory buffer process 2 Communications Variable of the process 0 Variable of the process 1 Variable of the process 2 Figure 46: Read in two phases, by a group of processes
178 178/236 9 MPI-IO 9.1 Introduction 9 MPI-IO 9.1 Introduction Definitions file : an MPI file is an ordered group of typed data. A file is collectively opened by all the processes of a communicator. All the later collective I/O operations are made in this context. displacement : it is an absolute address, in bytes, in relation to the beginning of the file and from which a view starts. etype : it is the data unit used to calculate the position and to access data. This can be any MPI datatype, predefined or created as a derived datatype. filetype : it is a mask that constitutes the basis of the partitioning of a file between processes (if the file is seen as a paving in a dimension, the filetype is the elementary tile which serves the paving). It is either an etype, or an MPI derived datatype constructed as a repetition of occurrences of such an etype (the holes non-accessed parts of the file must also be a multiple of the used etype).
179 179/236 9 MPI-IO 9.1 Introduction view : it defines the group of visible (and therefore accessible) data of a file, once this one is opened. It is an ordered group of elementary datatypes. offset : it is the position in the file, expressed by the number of etype, relatively to the current view (the holes defined in the view are not taken into account to calculate the offsets). file handle : it is an opaque object created at the opening and destroyed at the closing of a file. All the operations on a file are made by specifing its file handle as a reference. file pointer : they are automatically updated by MPI and determine offsets inside the file. There are two types : the individual file pointers which are specific to each process that have opened the file ; the shared file pointers which are common to all the processes that have opened the file. file size : the MPI file size is measured in bytes.
180 180/236 9 MPI-IO 9.2 File management 9 MPI-IO 9.2 File management The file management tasks are collective operations made by all the processes of the indicated communicator. We are only describing here the principal subroutines (opening, closing) but others are available (deletion, etc.). The attributes (describing the access rights, the opening mode, the possible destruction at the closing, etc.) must be precised by sum on predefined constants. All the processes of the communicator inside of which a file is open will participate in the later collective operations of data access. The opening of a file returns a file handle, which will be later used in all the operations relative to this file. The available information via the MPI_FILE_SET_INFO() subroutine varies from one implementation to another.
181 181/236 9 MPI-IO 9.2 File management Table 4: Attributes that can be positioned during the opening of files Attribut Meaning MPI_MODE_RDONLY MPI_MODE_RDWR MPI_MODE_WRONLY MPI_MODE_CREATE MPI_MODE_EXCL MPI_MODE_UNIQUE_OPEN MPI_MODE_SEQUENTIAL MPI_MODE_APPEND MPI_MODE_DELETE_ON_CLOSE read only reading and writing write only create the file if it does not exist error if the file exists error if the file is already open by another application sequential access pointers at the end of file (add mode) delete after the closing
182 182/236 9 MPI-IO 9.2 File management 1 program open use mpi 4 implicit none 5 6 integer :: fh,code 7 8 call MPI_INIT(code) 9 10 call MPI_FILE_OPEN( MPI_COMM_WORLD,"file.data", & 11 MPI_MODE_RDWR + MPI_MODE_CREATE, MPI_INFO_NULL,fh,code) call MPI_FILE_CLOSE(fh,code) 14 call MPI_FINALIZE(code) end program open01 > ls -l file.data -rw name grp 0 Feb 08 12:13 file.data
183 183/236 9 MPI-IO 9.3 Reads/Writes : general concepts 9 MPI-IO 9.3 Reads/Writes : general concepts The data transfers between files and memory areas of processes are made via explicit calls to read and write subroutines. We distinguish three aspects to file access : the positioning, which can be explicit (by specifying for example the desired number of bytes from the beginning of the file) or implicit, via pointers managed by the system (these pointers can be of two types : either individual to each process, or shared by all the processes) ; the synchronism, the accesses can be blocking or nonblocking ; the coordination, the accesses can be collective (that is to say made by all the processes of the communicator inside of which the file is opened) or specific only to one or many processes. There are many available variants : we will describe some of them.
184 184/236 9 MPI-IO 9.3 Reads/Writes : general concepts Table 5: Summary of possible access types explicit offsets Positioning Synchronism blocking nonblocking Coordination individual collective MPI_FILE_READ_AT MPI_FILE_READ_AT_ALL MPI_FILE_WRITE_AT MPI_FILE_WRITE_AT_ALL MPI_FILE_IREAD_AT MPI_FILE_READ_AT_ALL_BEGIN MPI_FILE_READ_AT_ALL_END MPI_FILE_IWRITE_AT MPI_FILE_WRITE_AT_ALL_BEGIN MPI_FILE_WRITE_AT_ALL_END see next page
185 185/236 9 MPI-IO 9.3 Reads/Writes : general concepts individual file pointers shared file pointers Positioning Synchronism blocking Coordination individual MPI_FILE_READ MPI_FILE_WRITE MPI_FILE_IREAD nonblocking MPI_FILE_IWRITE collective MPI_FILE_READ_ALL MPI_FILE_WRITE_ALL MPI_FILE_READ_ALL_BEGIN MPI_FILE_READ_ALL_END MPI_FILE_WRITE_ALL_BEGIN MPI_FILE_WRITE_ALL_END MPI_FILE_READ_SHARED MPI_FILE_READ_ORDERED blocking MPI_FILE_WRITE_SHARED MPI_FILE_WRITE_ORDERED MPI_FILE_IREAD_SHARED MPI_FILE_READ_ORDERED_BEGIN MPI_FILE_READ_ORDERED_END nonblocking MPI_FILE_IWRITE_SHAREDMPI_FILE_WRITE_ORDERED_BEGIN MPI_FILE_WRITE_ORDERED_END
186 186/236 9 MPI-IO 9.3 Reads/Writes : general concepts It is possible to mix the access types performed at the same file inside an application. The accessed memory areas are described by three quantities : the initial address of the concerned area ; the number of elements ; the datatype, which must match a sequence of contiguous copies of the etype of the current "view".
187 187/236 9 MPI-IO 9.4 Individual Reads/Writes 9 MPI-IO 9.4 Individual Reads/Writes Via explicit offsets The offset is expressed in the number of occurrences of a datatype, which must be a multiple of the etype of the current "view". The file must not have been opened with the MPI_MODE_SEQUENTIAL attribute.
188 188/236 9 MPI-IO 9.4 Individual Reads/Writes 1 program write_at 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_values=10 6 integer :: i,rank,fh,code,nb_bytes_integer 7 integer(kind= MPI_OFFSET_KIND) :: offset 8 integer, dimension(nb_values) :: values 9 integer, dimension( MPI_STATUS_SIZE) :: status call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) values(:)= (/(i+rank*100,i=1,nb_values)/) 15 print *, "Write process",rank, ":",values(:) call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_WRONLY + MPI_MODE_CREATE, & 18 MPI_INFO_NULL,fh,code) call MPI_TYPE_SIZE( MPI_INTEGER,nb_bytes_integer,code) offset=rank*nb_values*nb_bytes_integer 23 call MPI_FILE_WRITE_AT(fh,offset,values,nb_values, MPI_INTEGER, & 24 status,code) call MPI_FILE_CLOSE(fh,code) 27 call MPI_FINALIZE(code) 28 end program write_at
189 189/236 9 MPI-IO 9.4 Individual Reads/Writes Process } {{ } File { }} { } {{ } { }} { Process Figure 47: MPI_FILE_WRITE_AT() > mpiexec -n 2 write_at Write process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Write process 1 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110
190 190/236 9 MPI-IO 9.4 Individual Reads/Writes 1 program read_at 2 3 use mpi 4 implicit none 5 6 integer, parameter :: nb_values=10 7 integer :: rank,fh,code,nb_bytes_integer 8 integer(kind= MPI_OFFSET_KIND) :: offset 9 integer, dimension(nb_values) :: values 10 integer, dimension( MPI_STATUS_SIZE) :: status call MPI_INIT(code) 13 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 16 fh,code) call MPI_TYPE_SIZE( MPI_INTEGER,nb_bytes_integer,code) offset=rank*nb_values*nb_bytes_integer 21 call MPI_FILE_READ_AT(fh,offset,values,nb_values, MPI_INTEGER, & 22 status,code) 23 print *, "Read process",rank,":",values(:) call MPI_FILE_CLOSE(fh,code) 26 call MPI_FINALIZE(code) end program read_at
191 191/236 9 MPI-IO 9.4 Individual Reads/Writes Process } {{ } { }} { File } {{ } Process 1 { }} { Figure 48: MPI_FILE_READ_AT() > mpiexec -n 2 read_at Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Read process 1 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110
192 192/236 9 MPI-IO 9.4 Individual Reads/Writes 9 MPI-IO 9.4 Individual Reads/Writes Via Individual File Pointers In these cases, an individual pointer is managed by the system, per file and per process. For a given process, two successive accesses to the same file allows therefore the automatic access to consecutive elements of it. In all these subroutines, the shared pointers are never accessed or modified. After each access, the pointer is positioned on the next elementary datatype. The file must not have been opened with the MPI_MODE_SEQUENTIAL attribute.
193 193/236 9 MPI-IO 9.4 Individual Reads/Writes 1 program read use mpi 4 implicit none 5 6 integer, parameter :: nb_values=10 7 integer :: rank,fh,code 8 integer, dimension(nb_values) :: values 9 integer, dimension( MPI_STATUS_SIZE) :: status call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 15 fh,code) call MPI_FILE_READ(fh,values,6, MPI_INTEGER,status,code) 18 call MPI_FILE_READ(fh,values(7),4, MPI_INTEGER,status,code) print *, "Read process",rank,":",values(:) call MPI_FILE_CLOSE(fh,code) 23 call MPI_FINALIZE(code) end program read01
194 194/236 9 MPI-IO 9.4 Individual Reads/Writes Process } {{ } } {{ } { }} {{ }} { File } {{ }} {{ } Process 1 { }} { { }} { Figure 49: Example 1 of MPI_FILE_READ() > mpiexec -n 2 read01 Read process 1 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
195 195/236 9 MPI-IO 9.4 Individual Reads/Writes 1 program read02 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_values=10 6 integer :: rank,fh,code 7 integer, dimension(nb_values) :: values=0 8 integer, dimension( MPI_STATUS_SIZE) :: status 9 10 call MPI_INIT(code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 14 fh,code) if (rank == 0) then 17 call MPI_FILE_READ(fh,values,5, MPI_INTEGER,status,code) 18 else 19 call MPI_FILE_READ(fh,values,8, MPI_INTEGER,status,code) 20 call MPI_FILE_READ(fh,values,5, MPI_INTEGER,status,code) 21 end if print *, "Read process",rank,":",values(1:8) call MPI_FILE_CLOSE(fh,code) 26 call MPI_FINALIZE(code) 27 end program read02
196 196/236 9 MPI-IO 9.4 Individual Reads/Writes Process } {{ } { }} { File } {{ }} {{ } Process 1 { }} { { }} { Figure 50: Example 2 of MPI_FILE_READ() > mpiexec -n 2 read02 Read process 0 : 1, 2, 3, 4, 5, 0, 0, 0 Read process 1 : 9, 10, 101, 102, 103, 6, 7, 8
197 197/236 9 MPI-IO 9.4 Individual Reads/Writes 9 MPI-IO 9.4 Individual Reads/Writes Shared File Pointers There is one and only shared pointer per file, common to all the processes of the communicator in which the file is opened. All the processes which make an I/O operation that uses the shared pointer must employ the same view of the file in order to do this. If we use noncollective variants of the subroutines, the read order is not deterministic. If the treatment must be deterministic, we have to use the collective variants. After each access, the pointer is positioned on the next etype. In all of these subroutines, the individual pointers are never accessed or modified.
198 198/236 9 MPI-IO 9.4 Individual Reads/Writes 1 program read_shared use mpi 4 implicit none 5 6 integer :: rank,fh,code 7 integer, parameter :: nb_values=10 8 integer, dimension(nb_values) :: values 9 integer, dimension( MPI_STATUS_SIZE) :: status call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 15 fh,code) call MPI_FILE_READ_SHARED(fh,values,4, MPI_INTEGER,status,code) 18 call MPI_FILE_READ_SHARED(fh,values(5),6, MPI_INTEGER,status,code) print *, "Read process",rank,":",values(:) call MPI_FILE_CLOSE(fh,code) 23 call MPI_FINALIZE(code) end program read_shared01
199 199/236 9 MPI-IO 9.4 Individual Reads/Writes Process } {{ } } {{ } { }} {{ }} { File } {{ }} {{ } { }} { Process { }} { Figure 51: Example of MPI_FILE_READ_SHARED() > mpiexec -n 2 read_shared01 Read process 1 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Read process 0 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110
200 200/236 9 MPI-IO 9.5 Collective Reads/Writes 9 MPI-IO 9.5 Collective Reads/Writes All the processes of the communicator inside of which a file is open participate in the collective operations of data access. The collective operations perform generally better than the individual operations, because they authorize more automatically implemented optimization techniques (like the accesses in two phases see the paragraph 2). In the collective operations, the accesses are executed in the order of the process ranks. The treatment is therefore in this case deterministic.
201 201/236 9 MPI-IO 9.5 Collective Reads/Writes 9 MPI-IO 9.5 Collective Reads/Writes Via explicit offsets 1 program read_at_all 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_values=10 6 integer :: rank,fh,code,nb_bytes_integer 7 integer(kind= MPI_OFFSET_KIND) :: offset_file 8 integer, dimension(nb_values) :: values 9 integer, dimension( MPI_STATUS_SIZE) :: status call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 15 fh,code) call MPI_TYPE_SIZE( MPI_INTEGER,nb_bytes_integer,code) 18 offset_file=rank*nb_values*nb_bytes_integer 19 call MPI_FILE_READ_AT_ALL(fh,offset_file,values,nb_values, & 20 MPI_INTEGER,status,code) 21 print *, "Read process",rank,":",values(:) call MPI_FILE_CLOSE(fh,code) 24 call MPI_FINALIZE(code) 25 end program read_at_all
202 202/236 9 MPI-IO 9.5 Collective Reads/Writes Process } {{ } { }} { File } {{ } Process 1 { }} { Figure 52: Example of MPI_FILE_READ_AT_ALL() > mpiexec -n 2 read_at_all Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Read process 1 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110
203 203/236 9 MPI-IO 9.5 Collective Reads/Writes 9 MPI-IO 9.5 Collective Reads/Writes Via individual file pointers 1 program read_all01 2 use mpi 3 implicit none 4 5 integer :: rank,fh,code 6 integer, parameter :: nb_values=10 7 integer, dimension(nb_values) :: values 8 integer, dimension( MPI_STATUS_SIZE) :: status 9 10 call MPI_INIT(code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 14 fh,code) call MPI_FILE_READ_ALL(fh,values,4, MPI_INTEGER,status,code) 17 call MPI_FILE_READ_ALL(fh,values(5),6, MPI_INTEGER,status,code) print *, "Read process ",rank, ":",values(:) call MPI_FILE_CLOSE(fh,code) 22 call MPI_FINALIZE(code) 23 end program read_all01
204 204/236 9 MPI-IO 9.5 Collective Reads/Writes Process } {{ } } {{ } { }} {{ }} { File } {{ }} {{ } Process 1 { }} { { }} { Figure 53: Example 1 of MPI_FILE_READ_ALL() > mpiexec -n 2 read_all01 Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Read process 1 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
205 205/236 9 MPI-IO 9.5 Collective Reads/Writes 1 program read_all02 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_values=10 6 integer :: rank,fh,index1,index2,code 7 integer, dimension(nb_values) :: values=0 8 integer, dimension( MPI_STATUS_SIZE) :: status 9 10 call MPI_INIT(code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 12 call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 13 fh,code) if (rank == 0) then 16 index1=3 17 index2=6 18 else 19 index1=5 20 index2=9 21 end if call MPI_FILE_READ_ALL(fh,values(index1),index2-index1+1, & 24 MPI_INTEGER,status,code) 25 print *, "Read process",rank,":",values(:) call MPI_FILE_CLOSE(fh,code) 28 call MPI_FINALIZE(code) 29 end program read_all02
206 206/236 9 MPI-IO 9.5 Collective Reads/Writes Process } {{ } { }} { File } {{ } Process 1 { }} { Figure 54: Example 2 of MPI_FILE_READ_ALL() > mpiexec -n 2 read_all02 Read process 1 : 0, 0, 0, 0, 1, 2, 3, 4, 5, 0 Read process 0 : 0, 0, 1, 2, 3, 4, 0, 0, 0, 0
207 207/236 9 MPI-IO 9.5 Collective Reads/Writes 1 program read_all03 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_values=10 6 integer :: rank,fh,code 7 integer, dimension(nb_values) :: values=0 8 integer, dimension( MPI_STATUS_SIZE) :: status 9 10 call MPI_INIT(code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 14 fh,code) if (rank == 0) then 17 call MPI_FILE_READ_ALL(fh,values(3),4, MPI_INTEGER,status,code) 18 else 19 call MPI_FILE_READ_ALL(fh,values(5),5, MPI_INTEGER,status,code) 20 end if print *, "Read process",rank,":",values(:) call MPI_FILE_CLOSE(fh,code) 25 call MPI_FINALIZE(code) 26 end program read_all03
208 208/236 9 MPI-IO 9.5 Collective Reads/Writes Process } {{ } { }} { File } {{ } Process 1 { }} { Figure 55: Example 3 of MPI_FILE_READ_ALL() > mpiexec -n 2 read_all03 Read process 1 : 0, 0, 0, 0, 1, 2, 3, 4, 5, 0 Read process 0 : 0, 0, 1, 2, 3, 4, 0, 0, 0, 0
209 209/236 9 MPI-IO 9.5 Collective Reads/Writes 9 MPI-IO 9.5 Collective Reads/Writes Via shared file pointers 1 program read_ordered 2 use mpi 3 implicit none 4 5 integer :: rank,fh,code 6 integer, parameter :: nb_values=10 7 integer, dimension(nb_values) :: values 8 integer, dimension( MPI_STATUS_SIZE) :: status 9 10 call MPI_INIT(code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 14 fh,code) call MPI_FILE_READ_ORDERED(fh,values,4, MPI_INTEGER,status,code) 17 call MPI_FILE_READ_ORDERED(fh,values(5),6, MPI_INTEGER,status,code) print *, "Read process",rank,":",values(:) call MPI_FILE_CLOSE(fh,code) 22 call MPI_FINALIZE(code) 23 end program read_ordered
210 210/236 9 MPI-IO 9.5 Collective Reads/Writes Process } {{ } } {{ } { }} { { }} { File } {{ } } {{ } Process 1 { }} { { }} { Figure 56: Example of MPI_FILE_ORDERED() > mpiexec -n 2 read_ordered Read process 1 : 5, 6, 7, 8, 105, 106, 107, 108, 109, 110 Read process 0 : 1, 2, 3, 4, 9, 10, 101, 102, 103, 104
211 211/236 9 MPI-IO 9.6 Explicit positioning of pointers in a file 9 MPI-IO 9.6 Explicit positioning of pointers in a file The MPI_FILE_GET_POSITION() and MPI_FILE_GET_POSITION_SHARED() subroutines allow to know respectively the current value of individual pointers and the value of the shared pointer. It is possible to explicitly position the individual pointers by the help of the MPI_FILE_SEEK() subroutine, and also the shared pointer with the MPI_FILE_SEEK_SHARED() subroutine. There are three possible modes to set the offset of a pointer : MPI_SEEK_SET sets an absolute offset ; MPI_SEEK_CUR sets a relative offset ; MPI_SEEK_END positions the pointer at the end of the file, to which a possible displacement is added. With MPI_SEEK_CUR, we can specify a negative offset, which allows to return back in the file.
212 212/236 9 MPI-IO 9.6 Explicit positioning of pointers in a file 1 program seek 2 use mpi 3 implicit none 4 integer, parameter :: nb_values=10 5 integer :: rank,fh,nb_bytes_integer,code 6 integer(kind= MPI_OFFSET_KIND) :: offset 7 integer, dimension(nb_values) :: values 8 integer, dimension( MPI_STATUS_SIZE) :: status 9 10 call MPI_INIT(code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 12 call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 13 fh,code) call MPI_FILE_READ(fh,values,3, MPI_INTEGER,status,code) 16 call MPI_TYPE_SIZE( MPI_INTEGER,nb_bytes_integer,code) 17 offset=8*nb_bytes_integer 18 call MPI_FILE_SEEK(fh,offset, MPI_SEEK_CUR,code) 19 call MPI_FILE_READ(fh,values(4),3, MPI_INTEGER,status,code) 20 offset=4*nb_bytes_integer 21 call MPI_FILE_SEEK(fh,offset, MPI_SEEK_SET,code) 22 call MPI_FILE_READ(fh,values(7),4, MPI_INTEGER,status,code) print *, "Read process",rank,":",values(:) call MPI_FILE_CLOSE(fh,code) 27 call MPI_FINALIZE(code) 28 end program seek
213 213/236 9 MPI-IO 9.6 Explicit positioning of pointers in a file Process } {{ } } {{ } } {{ } { }} { { }} { { }} { File } {{ } } {{ } } {{ } { }} { { }} { { }} { Process Figure 57: Example of MPI_FILE_SEEK() > mpiexec -n 2 seek Read process 1 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8 Read process 0 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8
214 214/236 9 MPI-IO 9.7 Definition of views 9 MPI-IO 9.7 Definition of views The views are a flexible and powerful mechanism for describing the accessed areas in the files. The views are constructed by the help of MPI derived datatypes. Each process has its own view (or its own views) of a file, defined by three variables : a displacement, an etype and a filetype. A view is defined as a repetition of the filetype, once the initial positioning is made. It is possible to define holes in a view, by not taking into account some data parts. Different processes can perfectly have different views of the file, in order to access complementary parts of it. A given process can define and use many different views of the same file. A shared pointer may be used with a view only if all the processes have the same view.
215 215/236 9 MPI-IO 9.7 Definition of views If the file is open for writing, the described areas by the etypes and the filetypes cannot overlap, even partially. The default view consists of a simple sequence of bytes (zero initial displacement, etype and filetype equal to MPI_BYTE). etype filetype } {{ } file holes } {{ } } {{ } } {{ } initial displacement accessible data Figure 58: etype and filetype
216 216/236 9 MPI-IO 9.7 Definition of views etype filetype proc.0 filetype proc.1 filetype proc.2 file initial displacement Figure 59: Example of definition of different filetypes according to the processes
217 217/236 9 MPI-IO 9.7 Definition of views initial_disp 2 integers type_élém MPI_INTEGER filetype holes Figure 60: Filetype used in the example 1 of MPI_FILE_SET_VIEW() 1 program read_view use mpi 4 implicit none 5 6 integer, parameter :: nb_values=10 7 integer :: rank,fh,filetype_temp1,filetype_temp2 8 integer :: filetype,code,size_integer 9 integer(kind= MPI_OFFSET_KIND) :: initial_displacement 10 integer(kind= MPI_ADDRESS_KIND ), dimension(2) :: displacements 11 integer, dimension(nb_values) :: values 12 integer, dimension( MPI_STATUS_SIZE) :: status
218 218/236 9 MPI-IO 9.7 Definition of views 13 call MPI_INIT(code) 14 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 17 fh,code) call MPI_TYPE_CREATE_SUBARRAY(1,(/3/),(/2/),(/1/),MPI_ORDER_FORTRAN, & 20 MPI_INTEGER, filetype_temp1, code) 21 call MPI_TYPE_CREATE_SUBARRAY(1,(/2/),(/1/),(/1/),MPI_ORDER_FORTRAN, & 22 MPI_INTEGER, filetype_temp2, code) 23 call MPI_TYPE_SIZE( MPI_INTEGER,size_integer,code) displacements(1) = 0 26 displacements(2) = 3*size_integer 27 call MPI_TYPE_CREATE_STRUCT(2,(/1,1/),displacements,& 28 (/filetype_temp1,filetype_temp2/),filetype, code) 29 call MPI_TYPE_COMMIT (filetype,code) 30 31! Do not omit to go through an intermediate variable of kind 32! MPI_OFFSET_KIND, for portability reasons 33 initial_displacement=0 34 call MPI_FILE_SET_VIEW(fh,initial_displacement, MPI_INTEGER,filetype, & 35 "native", MPI_INFO_NULL,code) call MPI_FILE_READ(fh,values,7, MPI_INTEGER,status,code) 38 call MPI_FILE_READ(fh,values(8),3, MPI_INTEGER,status,code) print *,"Read process",rank,":",values(:) call MPI_FILE_CLOSE(fh,code) 43 call MPI_FINALIZE(code) end program read_view01
219 219/236 9 MPI-IO 9.7 Definition of views Process } {{ }}{{}} {{ }}{{}}{{} }{{} }{{} }{{} { }} { {}}{ { }} { {}}{ {}}{ {}}{ {}}{ {}}{ File } {{ } }{{} } {{ } }{{} }{{} }{{} }{{} }{{} Process 1 { }} {{}}{{ }} {{}}{{}}{ {}}{ {}}{ {}}{ Figure 61: Example 1 of MPI_FILE_SET_VIEW() > mpiexec -n 2 read_view01 Read process 1 : 2, 3, 5, 7, 8, 10, 102, 103, 105, 107 Read process 0 : 2, 3, 5, 7, 8, 10, 102, 103, 105, 107
220 220/236 9 MPI-IO 9.7 Definition of views initial_disp 0 etype MPI_INTEGER filetype proc. 0 filetype proc. 1 Figure 62: Filetype used in example 2 of MPI_FILE_SET_VIEW() 1 program read_view use mpi 4 implicit none 5 6 integer, parameter :: nb_values=10 7 integer :: rank,fh,coord,filetype,code 8 integer(kind= MPI_OFFSET_KIND) :: initial_displacement 9 integer, dimension(nb_values) :: values 10 integer, dimension( MPI_STATUS_SIZE) :: status
221 221/236 9 MPI-IO 9.7 Definition of views 11 call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 15 fh,code) if (rank == 0) then 18 coord=1 19 else 20 coord=3 21 end if call MPI_TYPE_CREATE_SUBARRAY(1,(/4/),(/2/),(/coord - 1/), & 24 MPI_ORDER_FORTRAN, MPI_INTEGER,filetype,code) 25 call MPI_TYPE_COMMIT(filetype,code) initial_displacement=0 28 call MPI_FILE_SET_VIEW(fh,initial_displacement, MPI_INTEGER,filetype, & 29 "native", MPI_INFO_NULL,code) call MPI_FILE_READ(fh,values,nb_values, MPI_INTEGER,status,code) print *, "Read process",rank,":",values(:) call MPI_FILE_CLOSE(fh,code) 36 call MPI_FINALIZE(code) end program read_view02
222 222/236 9 MPI-IO 9.7 Definition of views Process } {{ }} {{ }} {{ }} {{ }} {{ } { }} { { }} { { }} { { }} { { }} { File } {{ } } {{ } } {{ } } {{ } } {{ } Process 1 { }} {{ }} {{ }} {{ }} {{ }} { Figure 63: Example 2 of MPI_FILE_SET_VIEW() > mpiexec -n 2 read_view02 Read process 1 : 3, 4, 7, 8, 101, 102, 105, 106, 109, 110 Read process 0 : 1, 2, 5, 6, 9, 10, 103, 104, 107, 108
223 223/236 9 MPI-IO 9.7 Definition of views initial_disp 0 initial_disp 0 initial_disp 2 integers etype MPI_INTEGER etype MPI_INTEGER etype MPI_INTEGER filetype_1 filetype_2 filetype_3 (a) filetype_1 (b) filetype_2 (c) filetype_3 Figure 64: Filetypes used in the example 3 of MPI_FILE_SET_VIEW() 1 program read_view use mpi 4 implicit none 5 6 integer, parameter :: nb_values=10 7 integer :: rank,fh,code, & 8 filetype_1,filetype_2,filetype_3,nb_bytes_integer 9 integer(kind= MPI_OFFSET_KIND) :: initial_displacement 10 integer, dimension(nb_values) :: values 11 integer, dimension( MPI_STATUS_SIZE) :: status call MPI_INIT(code) 14 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) 15 call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 16 fh,code)
224 224/236 9 MPI-IO 9.7 Definition of views 17 call MPI_TYPE_CREATE_SUBARRAY(1,(/4/),(/2/),(/0/), & 18 MPI_ORDER_FORTRAN, MPI_INTEGER,filetype_1,code) 19 call MPI_TYPE_COMMIT(filetype_1,code) call MPI_TYPE_CREATE_SUBARRAY(1,(/4/),(/2/),(/2/), & 22 MPI_ORDER_FORTRAN, MPI_INTEGER,filetype_2,code) 23 call MPI_TYPE_COMMIT(filetype_2,code) call MPI_TYPE_CREATE_SUBARRAY(1,(/4/),(/1/),(/3/), & 26 MPI_ORDER_FORTRAN, MPI_INTEGER,filetype_3,code) 27 call MPI_TYPE_COMMIT(filetype_3,code) initial_displacement=0 30 call MPI_FILE_SET_VIEW(fh,initial_displacement, MPI_INTEGER,filetype_1, & 31 "native", MPI_INFO_NULL,code) 32 call MPI_FILE_READ(fh,values,4, MPI_INTEGER,status,code) call MPI_FILE_SET_VIEW(fh,initial_displacement, MPI_INTEGER,filetype_2, & 35 "native", MPI_INFO_NULL,code) 36 call MPI_FILE_READ(fh,values(5),3, MPI_INTEGER,status,code) call MPI_TYPE_SIZE( MPI_INTEGER,nb_bytes_integer,code) 39 initial_displacement=2*nb_bytes_integer 40 call MPI_FILE_SET_VIEW(fh,initial_displacement, MPI_INTEGER,filetype_3, & 41 "native", MPI_INFO_NULL,code) 42 call MPI_FILE_READ(fh,values(8),3, MPI_INTEGER,status,code) print *, "Read process",rank,":",values(:) call MPI_FILE_CLOSE(fh,code) 47 call MPI_FINALIZE(code) 48 end program read_view03
225 225/236 9 MPI-IO 9.7 Definition of views Process }{{} }{{} }{{} }{{} }{{} }{{} 104 }{{} {}}{ {}}{ {}}{ {}}{ {}}{ File }{{} }{{} }{{} }{{} }{{} }{{} {}}{ Process 1 {}}{ {}}{ {}}{ {}}{ {}}{ {}}{ 10 {}}{ 104 Figure 65: Example 3 of MPI_FILE_SET_VIEW() > mpiexec -n 2 read_view03 Read process 1 : 1, 2, 5, 6, 3, 4, 7, 6, 10, 104 Read process 0 : 1, 2, 5, 6, 3, 4, 7, 6, 10, 104
226 226/236 9 MPI-IO 9.8 NonBlocking Reads/Writes 9 MPI-IO 9.8 NonBlocking Reads/Writes The nonblocking I/O are implemented according to the model used for the nonblocking communications. A nonblocking access must later lead to an explicit test of completeness or to a standby (via MPI_TEST(), MPI_WAIT(), etc.), in a way similar to the management of nonblocking messages. The advantage is to make an overlap between the computations and the I/O.
227 227/236 9 MPI-IO 9.8 NonBlocking Reads/Writes 9 MPI-IO 9.8 NonBlocking Reads/Writes Via explicit displacements 1 program iread_at 2 3 use mpi 4 implicit none 5 6 integer, parameter :: nb_values=10 7 integer :: i,nb_iterations=0,rank,nb_bytes_integer, & 8 fh,request,code 9 integer(kind= MPI_OFFSET_KIND) :: offset 10 integer, dimension(nb_values) :: values 11 integer, dimension( MPI_STATUS_SIZE) :: status 12 logical :: finish call MPI_INIT(code) 15 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code)
228 228/236 9 MPI-IO 9.8 NonBlocking Reads/Writes 1 call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 2 fh,code) 3 4 call MPI_TYPE_SIZE( MPI_INTEGER,nb_bytes_integer,code) 5 6 offset=rank*nb_values*nb_bytes_integer 7 call MPI_FILE_IREAD_AT(fh,offset,values,nb_values, & 8 MPI_INTEGER,requests,code) 9 10 do while (nb_iterations < 5000) 11 nb_iterations=nb_iterations+1 12! Computations overlapping the time consumed by the read operation call MPI_TEST(request,finish,status,code) 15 if (finish) exit 16 end do 17 print *,"After",nb_iterations,"iterations, read process",rank,":",values call MPI_FILE_CLOSE(fh,code) 20 call MPI_FINALIZE(code) end program iread_at
229 229/236 9 MPI-IO 9.8 NonBlocking Reads/Writes Process } {{ } { }} { File } {{ } Process 1 { }} { Figure 66: Example of MPI_FILE_IREAD_AT() > mpiexec -n 2 iread_at After 1 iterations, read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 After 1 iterations, read process 1 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110
230 230/236 9 MPI-IO 9.8 NonBlocking Reads/Writes 9 MPI-IO 9.8 NonBlocking Reads/Writes Via individual file pointers 1 program iread 2 use mpi 3 implicit none 4 5 integer, parameter :: nb_values=10 6 integer :: rank,fh,request,code 7 integer, dimension(nb_values) :: values 8 integer, dimension( MPI_STATUS_SIZE) :: status 9 10 call MPI_INIT(code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 14 fh,code) call MPI_FILE_IREAD(fh,values,nb_values, MPI_INTEGER,request,code) 17! Computation overlapping the time consumed by the read operation call MPI_WAIT(request,status,code) 20 print *, "Read process",rank,":",values(:) call MPI_FILE_CLOSE(fh,code) 23 call MPI_FINALIZE(code) 24 end program iread
231 231/236 9 MPI-IO 9.8 NonBlocking Reads/Writes Process } {{ } { }} { File } {{ } Process 1 { }} { Figure 67: Example of MPI_FILE_IREAD() > mpiexec -n 2 iread Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Read process 1 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
232 232/236 9 MPI-IO 9.8 NonBlocking Reads/Writes It is possible to carry out operations which are both collective and nonblocking, via a particular form of nonblocking collective operation. This latter requires a call to two distinct subroutines, one for starting the operation and the other to end it. We cannot modify the concerned memory area between the two phases of the operation. Nevertheless, it is possible during this time to make noncollective operations on the file. There can be only one such operation happening at the same time per process.
233 233/236 9 MPI-IO 9.8 NonBlocking Reads/Writes 1 program read_ordered_begin_end 2 3 use mpi 4 implicit none 5 6 integer :: rank,fh,code 7 integer, parameter :: nb_values=10 8 integer, dimension(nb_values) :: values 9 integer, dimension( MPI_STATUS_SIZE) :: status call MPI_INIT(code) 12 call MPI_COMM_RANK( MPI_COMM_WORLD,rank,code) call MPI_FILE_OPEN( MPI_COMM_WORLD,"data.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, & 15 fh,code) call MPI_FILE_READ_ORDERED_BEGIN(fh,values,4, MPI_INTEGER,code) 18 print *, "Process number :",rank 19 call MPI_FILE_READ_ORDERED_END(fh,values,status,code) print *, "Read process",rank,":",values(1:4) call MPI_FILE_CLOSE(fh,code) 24 call MPI_FINALIZE(code) end program read_ordered_begin_end
234 234/236 9 MPI-IO 9.8 NonBlocking Reads/Writes Process } {{ } { }} { File } {{ } Process 1 { }} { Figure 68: Example of MPI_FILE_READ_ORDERED_BEGIN() > mpiexec -n 2 read_ordered_begin_end Process number : 0 Read process 0 : 1, 2, 3, 4 Process number : 1 Read process 1 : 5, 6, 7, 8
235 235/236 9 MPI-IO 9.9 Advices 9 MPI-IO 9.9 Advices As we saw, MPI-IO provides a very rich group of functionalities, as well as a high-level interface. This latter, while remaining portable, allows at the same time to hide on users complex operations and to implement transparently optimizations specific to the target machines. Some choices are clearly advisable : when the operations imply all the processes, or a group of these processes that can be defined in a particular communicator, the collective form of operations must be generally preferred ; the use of subroutines of explicit positioning in the files are to be employed only in particular cases, the implicit use of shared or individual pointers provides a high-level interface ; exactly as for the treatment of messages when those ones represent an important part of the application, the nonblocking is a preferred way of optimization to be implemented by the programers, but this must be implemented only after the verification of the correct behavior of the application in blocking mode.
236 236/ Conclusion 10 Conclusion Use blocking point-to-point communications, this before going to nonblocking communications. It will be necessary then to try to make computations/communications overlap. Use the blocking I/O functions, this before going to nonblocking I/O. Similarly, it will be necessary then to make I/O-computations overlap. Write the communications as if the sendings were synchronous (MPI_SSEND()). Avoid the synchronization barriers (MPI_BARRIER()), especially on the blocking collective functions. The MPI/OpenMP hybrid programming can bring gains of scalability, in order for this approach to function well, it is obviously necessary to have good OpenMP performances inside each MPI process. A course is given at IDRIS (
Message Passing Interface (MPI)
Message Passing Interface (MPI) Jalel Chergui Isabelle Dupays Denis Girou Pierre-François Lavallée Dimitri Lecas Philippe Wautelet MPI Plan I 1 Introduction... 7 1.1 Availability and updating... 8 1.2
Lecture 6: Introduction to MPI programming. Lecture 6: Introduction to MPI programming p. 1
Lecture 6: Introduction to MPI programming Lecture 6: Introduction to MPI programming p. 1 MPI (message passing interface) MPI is a library standard for programming distributed memory MPI implementation(s)
MPI Application Development Using the Analysis Tool MARMOT
MPI Application Development Using the Analysis Tool MARMOT HLRS High Performance Computing Center Stuttgart Allmandring 30 D-70550 Stuttgart http://www.hlrs.de 24.02.2005 1 Höchstleistungsrechenzentrum
Lightning Introduction to MPI Programming
Lightning Introduction to MPI Programming May, 2015 What is MPI? Message Passing Interface A standard, not a product First published 1994, MPI-2 published 1997 De facto standard for distributed-memory
MPI Runtime Error Detection with MUST For the 13th VI-HPS Tuning Workshop
MPI Runtime Error Detection with MUST For the 13th VI-HPS Tuning Workshop Joachim Protze and Felix Münchhalfen IT Center RWTH Aachen University February 2014 Content MPI Usage Errors Error Classes Avoiding
Implementing MPI-IO Shared File Pointers without File System Support
Implementing MPI-IO Shared File Pointers without File System Support Robert Latham, Robert Ross, Rajeev Thakur, Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne,
Session 2: MUST. Correctness Checking
Center for Information Services and High Performance Computing (ZIH) Session 2: MUST Correctness Checking Dr. Matthias S. Müller (RWTH Aachen University) Tobias Hilbrich (Technische Universität Dresden)
ADVANCED MPI. Dr. David Cronk Innovative Computing Lab University of Tennessee
ADVANCED MPI Dr. David Cronk Innovative Computing Lab University of Tennessee Day 1 Morning - Lecture Course Outline Communicators/Groups Extended Collective Communication One-sided Communication Afternoon
Message Passing with MPI
Message Passing with MPI Hristo Iliev (Христо Илиев) PPCES 2012, 2013 Christian Iwainsky PPCES 2011 Rechen- und Kommunikationszentrum (RZ) Agenda Motivation MPI Part 1 Concepts Point-to-point communication
Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005
Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005 Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005... 1
Parallel Programming with MPI on the Odyssey Cluster
Parallel Programming with MPI on the Odyssey Cluster Plamen Krastev Office: Oxford 38, Room 204 Email: [email protected] FAS Research Computing Harvard University Objectives: To introduce you
Introduction to Hybrid Programming
Introduction to Hybrid Programming Hristo Iliev Rechen- und Kommunikationszentrum aixcelerate 2012 / Aachen 10. Oktober 2012 Version: 1.1 Rechen- und Kommunikationszentrum (RZ) Motivation for hybrid programming
Performance and scalability of MPI on PC clusters
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 24; 16:79 17 (DOI: 1.12/cpe.749) Performance Performance and scalability of MPI on PC clusters Glenn R. Luecke,,
MPI Hands-On List of the exercises
MPI Hands-On List of the exercises 1 MPI Hands-On Exercise 1: MPI Environment.... 2 2 MPI Hands-On Exercise 2: Ping-pong...3 3 MPI Hands-On Exercise 3: Collective communications and reductions... 5 4 MPI
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015 Hermann Härtig ISSUES starting points independent Unix processes and block synchronous execution who does it load migration mechanism
High performance computing systems. Lab 1
High performance computing systems Lab 1 Dept. of Computer Architecture Faculty of ETI Gdansk University of Technology Paweł Czarnul For this exercise, study basic MPI functions such as: 1. for MPI management:
Performance Tool Support for MPI-2 on Linux 1
Performance Tool Support for MPI-2 on Linux 1 Kathryn Mohror and Karen L. Karavanic Portland State University {kathryn, karavan}@cs.pdx.edu Abstract Programmers of message-passing codes for clusters of
Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication
Workshop for Communication Architecture in Clusters, IPDPS 2002: Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication Joachim Worringen, Andreas Gäer, Frank Reker
Parallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
HP-MPI User s Guide. 11th Edition. Manufacturing Part Number : B6060-96024 September 2007
HP-MPI User s Guide 11th Edition Manufacturing Part Number : B6060-96024 September 2007 Copyright 1979-2007 Hewlett-Packard Development Company, L.P. Table 1 Revision history Edition MPN Description Eleventh
Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)
Introduction Reading Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF 1 Programming Assignment Notes Assume that memory is limited don t replicate the board on all nodes Need to provide load
MARMOT- MPI Analysis and Checking Tool Demo with blood flow simulation. Bettina Krammer, Matthias Müller [email protected], mueller@hlrs.
MARMOT- MPI Analysis and Checking Tool Demo with blood flow simulation Bettina Krammer, Matthias Müller [email protected], [email protected] HLRS High Performance Computing Center Stuttgart Allmandring 30
Mutual Exclusion using Monitors
Mutual Exclusion using Monitors Some programming languages, such as Concurrent Pascal, Modula-2 and Java provide mutual exclusion facilities called monitors. They are similar to modules in languages that
Information technology Programming languages Fortran Enhanced data type facilities
ISO/IEC JTC1/SC22/WG5 N1379 Working draft of ISO/IEC TR 15581, second edition Information technology Programming languages Fortran Enhanced data type facilities This page to be supplied by ISO. No changes
To connect to the cluster, simply use a SSH or SFTP client to connect to:
RIT Computer Engineering Cluster The RIT Computer Engineering cluster contains 12 computers for parallel programming using MPI. One computer, cluster-head.ce.rit.edu, serves as the master controller or
Performance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
Source Code Transformations Strategies to Load-balance Grid Applications
Source Code Transformations Strategies to Load-balance Grid Applications Romaric David, Stéphane Genaud, Arnaud Giersch, Benjamin Schwarz, and Éric Violard LSIIT-ICPS, Université Louis Pasteur, Bd S. Brant,
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
MPI-Checker Static Analysis for MPI
MPI-Checker Static Analysis for MPI Alexander Droste, Michael Kuhn, Thomas Ludwig November 15, 2015 Motivation 2 / 39 Why is runtime analysis in HPC challenging? Large amount of resources are used State
Computer Systems Structure Input/Output
Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices
Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C
Embedded Systems A Review of ANSI C and Considerations for Embedded C Programming Dr. Jeff Jackson Lecture 2-1 Review of ANSI C Topics Basic features of C C fundamentals Basic data types Expressions Selection
Client/Server Computing Distributed Processing, Client/Server, and Clusters
Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the
Glossary of Object Oriented Terms
Appendix E Glossary of Object Oriented Terms abstract class: A class primarily intended to define an instance, but can not be instantiated without additional methods. abstract data type: An abstraction
Linux Driver Devices. Why, When, Which, How?
Bertrand Mermet Sylvain Ract Linux Driver Devices. Why, When, Which, How? Since its creation in the early 1990 s Linux has been installed on millions of computers or embedded systems. These systems may
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber
Introduction to grid technologies, parallel and cloud computing Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber OUTLINES Grid Computing Parallel programming technologies (MPI- Open MP-Cuda )
Mathematical Libraries on JUQUEEN. JSC Training Course
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries on JUQUEEN JSC Training Course May 10, 2012 Outline General Informations Sequential Libraries, planned Parallel Libraries and Application Systems:
Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)
Advanced MPI Hybrid programming, profiling and debugging of MPI applications Hristo Iliev RZ Rechen- und Kommunikationszentrum (RZ) Agenda Halos (ghost cells) Hybrid programming Profiling of MPI applications
HSL and its out-of-core solver
HSL and its out-of-core solver Jennifer A. Scott [email protected] Prague November 2006 p. 1/37 Sparse systems Problem: we wish to solve where A is Ax = b LARGE Informal definition: A is sparse if many
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card
Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification
Introduction Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Advanced Topics in Software Engineering 1 Concurrent Programs Characterized by
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
HPCC - Hrothgar Getting Started User Guide MPI Programming
HPCC - Hrothgar Getting Started User Guide MPI Programming High Performance Computing Center Texas Tech University HPCC - Hrothgar 2 Table of Contents 1. Introduction... 3 2. Setting up the environment...
Moving from CS 61A Scheme to CS 61B Java
Moving from CS 61A Scheme to CS 61B Java Introduction Java is an object-oriented language. This document describes some of the differences between object-oriented programming in Scheme (which we hope you
The programming language C. sws1 1
The programming language C sws1 1 The programming language C invented by Dennis Ritchie in early 1970s who used it to write the first Hello World program C was used to write UNIX Standardised as K&C (Kernighan
Building an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
Client/Server and Distributed Computing
Adapted from:operating Systems: Internals and Design Principles, 6/E William Stallings CS571 Fall 2010 Client/Server and Distributed Computing Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Traditional
RA MPI Compilers Debuggers Profiling. March 25, 2009
RA MPI Compilers Debuggers Profiling March 25, 2009 Examples and Slides To download examples on RA 1. mkdir class 2. cd class 3. wget http://geco.mines.edu/workshop/class2/examples/examples.tgz 4. tar
How To Visualize Performance Data In A Computer Program
Performance Visualization Tools 1 Performance Visualization Tools Lecture Outline : Following Topics will be discussed Characteristics of Performance Visualization technique Commercial and Public Domain
Monitors, Java, Threads and Processes
Monitors, Java, Threads and Processes 185 An object-oriented view of shared memory A semaphore can be seen as a shared object accessible through two methods: wait and signal. The idea behind the concept
1 The Java Virtual Machine
1 The Java Virtual Machine About the Spec Format This document describes the Java virtual machine and the instruction set. In this introduction, each component of the machine is briefly described. This
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Exceptions in MIPS. know the exception mechanism in MIPS be able to write a simple exception handler for a MIPS machine
7 Objectives After completing this lab you will: know the exception mechanism in MIPS be able to write a simple exception handler for a MIPS machine Introduction Branches and jumps provide ways to change
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC [email protected]
McMPI Managed-code MPI library in Pure C# Dr D Holmes, EPCC [email protected] Outline Yet another MPI library? Managed-code, C#, Windows McMPI, design and implementation details Object-orientation,
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
Java (12 Weeks) Introduction to Java Programming Language
Java (12 Weeks) Topic Lecture No. Introduction to Java Programming Language 1 An Introduction to Java o Java as a Programming Platform, The Java "White Paper" Buzzwords, Java and the Internet, A Short
Chapter 6, The Operating System Machine Level
Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General
GPI Global Address Space Programming Interface
GPI Global Address Space Programming Interface SEPARS Meeting Stuttgart, December 2nd 2010 Dr. Mirko Rahn Fraunhofer ITWM Competence Center for HPC and Visualization 1 GPI Global address space programming
Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12
Universität Dortmund 12 Middleware Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: Alexandra Nolte, Gesine Marwedel, 2003 2010 年 11 月 26 日 These slides use Microsoft clip arts. Microsoft copyright
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of
MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.
MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column
Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
Caml Virtual Machine File & data formats Document version: 1.4 http://cadmium.x9c.fr
Caml Virtual Machine File & data formats Document version: 1.4 http://cadmium.x9c.fr Copyright c 2007-2010 Xavier Clerc [email protected] Released under the LGPL version 3 February 6, 2010 Abstract: This
CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL
CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL This chapter is to introduce the client-server model and its role in the development of distributed network systems. The chapter
Parallel and Distributed Computing Programming Assignment 1
Parallel and Distributed Computing Programming Assignment 1 Due Monday, February 7 For programming assignment 1, you should write two C programs. One should provide an estimate of the performance of ping-pong
Informatica e Sistemi in Tempo Reale
Informatica e Sistemi in Tempo Reale Introduction to C programming Giuseppe Lipari http://retis.sssup.it/~lipari Scuola Superiore Sant Anna Pisa October 25, 2010 G. Lipari (Scuola Superiore Sant Anna)
Mathematical Libraries and Application Software on JUROPA and JUQUEEN
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUROPA and JUQUEEN JSC Training Course May 2014 I.Gutheil Outline General Informations Sequential Libraries Parallel
Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture
Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts
THE NAS KERNEL BENCHMARK PROGRAM
THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures
WESTMORELAND COUNTY PUBLIC SCHOOLS 2011 2012 Integrated Instructional Pacing Guide and Checklist Computer Math
Textbook Correlation WESTMORELAND COUNTY PUBLIC SCHOOLS 2011 2012 Integrated Instructional Pacing Guide and Checklist Computer Math Following Directions Unit FIRST QUARTER AND SECOND QUARTER Logic Unit
Chapter 12 File Management
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access
SYSTEM ecos Embedded Configurable Operating System
BELONGS TO THE CYGNUS SOLUTIONS founded about 1989 initiative connected with an idea of free software ( commercial support for the free software ). Recently merged with RedHat. CYGNUS was also the original
Chapter 12 File Management. Roadmap
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access
Introduction to Embedded Systems. Software Update Problem
Introduction to Embedded Systems CS/ECE 6780/5780 Al Davis logistics minor Today s topics: more software development issues 1 CS 5780 Software Update Problem Lab machines work let us know if they don t
A Java Based Tool for Testing Interoperable MPI Protocol Conformance
A Java Based Tool for Testing Interoperable MPI Protocol Conformance William George National Institute of Standards and Technology 100 Bureau Drive Stop 8951 Gaithersburg MD 20899 8951 1 301 975 4943 [email protected]
18-447 Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013
18-447 Computer Architecture Lecture 3: ISA Tradeoffs Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013 Reminder: Homeworks for Next Two Weeks Homework 0 Due next Wednesday (Jan 23), right
A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster
Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906
COMP5426 Parallel and Distributed Computing. Distributed Systems: Client/Server and Clusters
COMP5426 Parallel and Distributed Computing Distributed Systems: Client/Server and Clusters Client/Server Computing Client Client machines are generally single-user workstations providing a user-friendly
Kommunikation in HPC-Clustern
Kommunikation in HPC-Clustern Communication/Computation Overlap in MPI W. Rehm and T. Höfler Department of Computer Science TU Chemnitz http://www.tu-chemnitz.de/informatik/ra 11.11.2005 Outline 1 2 Optimize
Developing Parallel Applications with the Eclipse Parallel Tools Platform
Developing Parallel Applications with the Eclipse Parallel Tools Platform Greg Watson IBM STG [email protected] Parallel Tools Platform Enabling Parallel Application Development Best practice tools for experienced
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
Debugging with TotalView
Tim Cramer 17.03.2015 IT Center der RWTH Aachen University Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again and...... enrich
Chapter 7D The Java Virtual Machine
This sub chapter discusses another architecture, that of the JVM (Java Virtual Machine). In general, a VM (Virtual Machine) is a hypothetical machine (implemented in either hardware or software) that directly
It is the thinnest layer in the OSI model. At the time the model was formulated, it was not clear that a session layer was needed.
Session Layer The session layer resides above the transport layer, and provides value added services to the underlying transport layer services. The session layer (along with the presentation layer) add
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
The C Programming Language course syllabus associate level
TECHNOLOGIES The C Programming Language course syllabus associate level Course description The course fully covers the basics of programming in the C programming language and demonstrates fundamental programming
Computer Network. Interconnected collection of autonomous computers that are able to exchange information
Introduction Computer Network. Interconnected collection of autonomous computers that are able to exchange information No master/slave relationship between the computers in the network Data Communications.
Introduction. What is an Operating System?
Introduction What is an Operating System? 1 What is an Operating System? 2 Why is an Operating System Needed? 3 How Did They Develop? Historical Approach Affect of Architecture 4 Efficient Utilization
The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters
The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters User s Manual for the HPCVL DMSM Library Gang Liu and Hartmut L. Schmider High Performance Computing
