Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud Computer

Transcription

1 Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud Computer Steffen Christgau, Bettina Schnor Potsdam University Institute of Computer Science Operating Systems and Distributed Systems 21. May 2012

2 Outline The Single-Chip Cloud Computer (SCC) RCKMPI MPI on the SCC Topology-Awareness for RCKMPI The concept Topology-Awareness for RCKMPI Evaluation Summary and Future Work Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 2 of 19

7 1. The Single-Chip Cloud Computer (SCC) Many-Core Architecture Research Community (MARC) established by Intel research of universities together with Intel world-wide community (dominated by European institutions) Website, regular symposia (every 6 months, 5 up to now) Our group is MARC member Focus: application scalability Experiences with parallel ASP, climate simulation Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 3 of 19

8 Single-Chip Cloud Computer L2 Cache 256 KB Router Core1 MPB 16 KB MC 1 MC 3 L2 Cache 256 KB Core0 MC 0 MC 2 VRC SIF 24 tiles, 48 P54C cores, connected via Network-on-Chip, no Cache-Coherence fast 16 KB tile SRAM on each tile Message Passing Buffer (MPB) Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 4 of 19

9 2. RCKMPI: MPI on the SCC fork of MPICH2 Process Management Interface Sock Application Message Passing Interface MPICH2 ROMIO Abstract Device Interface (ADI3) Channel-3 Device BG Cray Nemesis SCCMPB SCCSHM SCCMulti (MPI IO Implementation) ADIO MPI Implementation MPI Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 5 of 19

10 2. RCKMPI: MPI on the SCC fork of MPICH2 Process Management Interface Sock Application Message Passing Interface MPICH2 ROMIO Abstract Device Interface (ADI3) Channel-3 Device BG Cray Nemesis SCCMPB SCCSHM SCCMulti (MPI IO Implementation) ADIO MPI Implementation MPI Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 5 of 19

11 SCCMPB uses the fast Message Passing Buffer of each tile as shared memory and divides it into n equal-size Exclusive Write Sections (EWS) = remote write, local read MPI_COMM_WORLD MPI Rank 0 MPI Rank n Core 0 Core n rank 1 rank rank n-1 rank n MPB of core/rank 0 Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 6 of 19

12 SCCMPB uses the fast Message Passing Buffer of each tile as shared memory and divides it into n equal-size Exclusive Write Sections (EWS) = remote write, local read MPI_COMM_WORLD MPI Rank 0 MPI Rank n Core 0 Core n rank 1 rank rank n-1 rank n MPB of core/rank 0 Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 6 of 19

13 Comparison of different CH3-devices at maximum Manhattan distance bandwidth / MByte/s RCKMPI sccmulti CH device RCKMPI sccmpb CH device RCKMPI sccshm CH device Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi message size / Byte Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 7 of 19

14 Bandwidths for Manhattan distance 0, 5 and 8 (two processes started) Core 00 and 01 Core 00 and 10 Core 00 and bandwidth / MByte/s Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi message size / Byte Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 8 of 19

15 Bandwidths for maximum Manhattan distance 8, and varied number of MPI processes MPI processes 12 MPI processes 24 MPI processes 48 MPI processes 50 bandwidth / MByte/s Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi message size / Byte Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 9 of 19

16 Remember: SCCMPB uses the fast Message Passing Buffer of each tile as shared memory (total 384 KB) MPI_COMM_WORLD MPI Rank 0 MPI Rank n Core 0 Core n rank 1 rank rank n-1 rank n MPB of core/rank 0 The MPB is equally devided in n sections = depending on the number of started MPI processes. Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 10 of 19

17 3. Topology Awareness for RCKMPI: The Concept The bandwidth between 2 RCKMPI processes depends on: the number of started MPI processes since a fully-connected network between all MPI processes is managed. Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 11 of 19

18 Application Behaviour Tv process 1 qv Tu process 2 qu uu process n vv uv Goal: The bandwidth between communicating processes, so-called neighbors in the Task Interaction Graph should be increased. Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 12 of 19

19 Application Behaviour Tv process 1 qv Tu process 2 qu uu process n vv uv Goal: The bandwidth between communicating processes, so-called neighbors in the Task Interaction Graph should be increased. Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 12 of 19

20 Requirements: 1 An improved MPB layout must consider both: communication neighbors and group communication (barriers, broadcasts, gather/scatter,...) 2 Each MPI process has to know its new offset within all remote MPBs. Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 13 of 19

21 Putting things together... message payload channel header original layout proc. 1 proc. 2 proc. n-1 proc. n write section new layout without topology information p 1 p 2 p n proc. 1 proc. 2 proc. n-1 proc. n channel headers payload area with topology information p 1 p 2 p n p1 proc. 2 pn Internal barrier for recalculation phase of new MPB addresses. Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 14 of 19

22 MPI offers API to specify virtual process topology 1 #d e f i n e NUM_DIMS i n t grid_dims [NUM_DIMS] ; 4 i n t grid_periods [NUM_DIMS] ; 5 MPI_Comm comm_topo ; 6 7 /* f o r a grid, s e t a l l items o f grid_periods to 0 */ 8 f o r ( i n t i = 0 ; i < NUM_DIMS; i++) 9 grid_periods [ i ] = 0 ; 0 1 MPI_Dims_create ( numprocs, NUM_DIMS, grid_dims ) ; 2 MPI_Cart_create (MPI_COMM_WORLD, NUM_DIMS, grid_dims, 3 grid_periods, true, &comm_topo ) ; Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 15 of 19

23 MPI offers API to specify virtual process topology 1 #d e f i n e NUM_DIMS i n t grid_dims [NUM_DIMS] ; 4 i n t grid_periods [NUM_DIMS] ; 5 MPI_Comm comm_topo ; 6 7 /* f o r a grid, s e t a l l items o f grid_periods to 0 */ 8 f o r ( i n t i = 0 ; i < NUM_DIMS; i++) 9 grid_periods [ i ] = 0 ; 0 1 MPI_Dims_create ( numprocs, NUM_DIMS, grid_dims ) ; 2 MPI_Cart_create (MPI_COMM_WORLD, NUM_DIMS, grid_dims, 3 grid_periods, true, &comm_topo ) ; Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 15 of 19

24 4. Topology Awareness for RCKMPI: Evaluation bandwidth / MByte/s Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi message size / Byte enhanced RCKMPI with 1D topology (48 procs, 2 Cache lines) enhanced RCKMPI with 1D topology (48 procs, 3 Cache lines) enhanced RCKMPI without topology (48 procs) Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 16 of 19

25 Results for 2D CFD application with ring topology: process 1 process 2 process n Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 17 of 19

26 speedup enhanced RCKMPI with topology information, 2 CL original RCKMPI number of processes Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 18 of 19

27 Summary and Future work SCC is equipped with a fast NoC. Message Passing Buffer on tile is beneficial for fast communication. Bandwidth performance gain for MPI applications by using MPI s virtual process topologies and rearranging the MPB layout. Current/Future Work: Comparison with I. C. Ureña, and M. Gerndt: Improved RCKMPI s SCCMPB Channel: Scaling and Dynamic Processes Support, ARCS 2012 Fixed the One-Sided Communication in RCKMPI = support of applications based on Global Arrays Bettina Schnor (Potsdam University) MPI Topology Awareness on the SCC Frame 19 of 19