GPI Global Address Space Programming Interface

Size: px

Start display at page:

Download "GPI Global Address Space Programming Interface"

Derrick Calvin Joseph
10 years ago
Views:

1 GPI Global Address Space Programming Interface SEPARS Meeting Stuttgart, December 2nd 2010 Dr. Mirko Rahn Fraunhofer ITWM Competence Center for HPC and Visualization 1

GPI Global address space programming interface PGAS Partitioned global address space read/write global data, one-sided, non-blocking, asynchron, zero-copy

2 GPI Global address space programming interface PGAS Partitioned global address space read/write global data, one-sided, non-blocking, asynchron, zero-copy send/recv messages, passive mode atomic counters barrier, allreduce queues ranks fault tolerance daemon concept environment checks auto detect and auto select network 2

send/recv messages, passive mode atomic counters barrier, allreduce queues ranks

3 GPI Example code: Alltoall in C #include <GPI.h> extern void dump (const char *, const int *); int main (int argc, char *argv[]) { startgpi (argc, argv, "", (1UL << 30)); // 1 GiB DMA enabled memory per node const int iproc = getrankgpi (); const int nproc = getnodecountgpi (); int *mem = (int *) getdmamemptrgpi (); // begin of DMA enabled memory int *src = mem; // offset 0 int *dst = mem + nproc; // offset nproc * sizeof(int) for (int p = 0; p < nproc; ++p) { src[p] = iproc * nproc + p; const unsigned long locoff = p * sizeof (int); const unsigned long remoff = (nproc + iproc) * sizeof (int); writedmagpi (locoff, remoff, sizeof (int), p, GPIQueue0); } waitdmagpi (GPIQueue0); barriergpi (); dump ( src, src); dump ( dst, dst); shutdowngpi (); } 3

getrankgpi (); const int nproc = getnodecountgpi (); int *mem = (int *) getdmamemptrgpi (); // begin of DMA enabled memory int *src = mem; // offset 0 int *dst = mem + nproc; // offset

4 GPI Example code: Alltoall in C #include <GPI.h> extern void dump (const char *, const int *); int main (int argc, char *argv[]) { startgpi (argc, argv, "", (1UL << 30)); // 1 GiB DMA enabled memory per node const int iproc = getrankgpi (); const int nproc = getnodecountgpi (); int *mem = (int *) getdmamemptrgpi (); // begin of DMA enabled memory int *src = mem; // offset 0 int *dst = mem + nproc; // offset nproc * sizeof(int) for (int p = 0; p < nproc; ++p) { src[p] = iproc * nproc + p; const unsigned long locoff = p * sizeof (int); const unsigned long remoff = (nproc + iproc) * sizeof (int); writedmagpi (locoff, remoff, sizeof (int), p, GPIQueue0); } waitdmagpi (GPIQueue0); barriergpi (); dump ( src, src); dump ( dst, dst); shutdowngpi (); } 4 > gcc alltoall.c -I $GPI_HOME/include -L $GPI_HOME/lib64 -lgpi -libverbs15 > cp a.out $GPI_HOME/bin > getnode -n 4 > $GPI_HOME/bin/a.out [...collected and sorted output...] src 0: dst 0: src 1: dst 1: src 2: dst 2: src 3: dst 3:

const int nproc = getnodecountgpi (); int *mem = (int *) getdmamemptrgpi (); // begin of DMA enabled memory int *src = mem; // offset 0 int *dst = mem + nproc; // offset nproc * sizeof(int) for (int

5 GPI Microbenchmark 5 Xeon 5460, ConnectX, mvapich2 1.2p1

6 GPI Permute 5 secs for reordering: From latency bound to compute bound again 6 2x2 woodcrest, MHGS18-DDR

7 GPI 2D-FFT: Perfect overlap 7

8 GPI BQCD: 30% better than MPI 8

9 GPI instead of MPI, MCTP instead of OpenMP Easy to use: Single programming model Fast: Wirespeed, zero-copy Scalable: Perfect overlap, no cycles for communication Robust: Fault tolerance Extends to heterogeneous systems: Read/write to accelerators like GPUs MCTP: Full hardware awareness (NUMA-Barrier) Coming next: Collectives offload, relaxed collectives, sub-collectives Coming next: auto thread-awareness for wait [x] done Coming next: Queue-pair auto-recovery Coming next: reliable HW multicast 9

like GPUs MCTP: Full hardware awareness (NUMA-Barrier) Coming next: Collectives offload, relaxed collectives, sub-collectives

10 GPIspace Architecture overview Coordination Aggregator Virtual Memory 10

11 GPIspace Future programming platform separate coordination and programming (David Gelernter) Programming: algorithms in modules, deal with data Tools: C++, Fortran,... Coordination: load/save data, schedule algorithm to data, take care of system state, fault tolerance, (everything automatic) Tools: (graphical) (functional) (domain specific) programming language based on modified Petri nets (typesafe, hierarchical, builtin expressions) Automatic parallelism, fault tolerance, throughput optimization Graphical programming, explorative development 11

.. Coordination: load/save data, schedule algorithm to data, take care of system state, fault tolerance, (everything automatic) Tools:

12 backup 12

13 MCTP and GPI onmousepressed: for (tid = 1; tid < mctpgetnumberofactivatedcores(); ++tid) mctpsuspend(tid); // do visualization onmousereleased: for (tid = 1; tid < mctpgetnumberofactivatedcores(); ++tid) mctpresume(tid); // continue with workflow 13

do visualization onmousereleased: for (tid = 1; tid <

14 GPI Microbenchmark 14 2x2 woodcrest, MHGS18-DDR

15 MCTP many core thread package platform independent (Linux/Windows 32/64bit) fast, very thin abstraction layer threadpool fabric + synchronization/communication between pools full thread control (init, start, stop, wait, exit) full system control wrt. HW/Core mappings (logical and physical, SSE level, NUMA layouts, cache affinity,...) thread affinity, get/set mask atomic operations fetchadd, cmpswap fast lock/unlock active/passive synchronization, mixable, NUMA-Barrier aligned thread safe alloc/free, special pages for global variables high resolution timer 15

HW/Core mappings (logical and physical, SSE level, NUMA layouts, cache affinity,.

16 MCTP barrier synchronization 16

17 MCTP critical section latency Tile based rendering (fps) MCTP 1.52: TBB3: (-5.7%) OMP_parfor: (-5.5%) 17

GASPI A PGAS API for Scalable and Fault Tolerant Computing

GASPI A PGAS API for Scalable and Fault Tolerant Computing Specification of a general purpose API for one-sided and asynchronous communication and provision of libraries, tools, examples and best practices