Ecole Doctorale Mathématiques, Sciences et Technologies de l'information, Informatique

Size: px

Start display at page:

Download "Ecole Doctorale Mathématiques, Sciences et Technologies de l'information, Informatique"

Frank Wheeler
7 years ago
Views:

1 Ecole Doctorale Mathématiques, Sciences et Technologies de l'information, Informatique Data optimization for linear algebra and stencil compact on Many-core processor Kalray MPPA-56 Etudiant : Equipe : Directeur de thèse : Co-encadrant : DRAKKAR Bernard TOURANCHEAU Christian OBRECHT

stencil compact on Many-core processor Kalray MPPA-56 Etudiant :

2 Outline Context Applications Work done this year Conclusion and perspectives

3 Context Exascale (8) by Power =< MW >= 5 Gflops/W Intel Xeon E5-687W Intel Xeon Phi co-processor SP Tesla KX Kalray MPPA-56? Gflops/W (FP6) >= 5 Number of cores/node ? Frequency (MHz) 7?

Tesla KX Kalray MPPA-56? Gflops/W (FP6).5..8 6.

4 Applications Compact stencils PDE solvers Linear algebra (e.g Lattice Boltzmann Method) (BLAS) Vector & Matrix operations Irregular access pattern X Y Access to neighbor nodes Non-contiguous memory access Cache coherence. How and Where? Hardware complexity & latency increases with # of cores Incoherence-aware software design Non-trivial Memory Non-Uniform-Memory-Access (NUMA) Difficult to do and maintain Global Memory + overhead Bandwidth could hardly follow # of cores and clock Programming framework?

Non-contiguous memory access Cache coherence. How and Where?

5 Work done. Many-core as Stand-alone node MPI design and implementation MPI optimizations comparison. Many-core as Accelerator clblas clmagma. Benchmark High Performance Linkpack (HPL) 5

. Many-core as Stand-alone node Issues : Isolated cores, limited memory Design : 6 + MPI processes per Many-core MPI compute ranks : 6 x MB on-chip memory MPI I/O rank : x GB DDR MPPAMPI components

6 . Many-core as Stand-alone node Issues : Isolated cores, limited memory Design : 6 + MPI processes per Many-core MPI compute ranks : 6 x MB on-chip memory MPI I/O rank : x GB DDR MPPAMPI components Local buffers Main thread (PE) datatype MPI_NOC_API MPI implementation comm pending_irecv DMA manager NoC barrier pending_isend recv_post Data MPI send_post Control messages buffers Control messages (outgoing) Data Control messages (incoming) Server (RM) Callback Handler MPPAIPC Network-on-Chip 6

MPI_NOC_API MPI implementation comm pending_irecv DMA manager NoC barrier pending_isend recv_post Data MPI send_post

7 . Many-core as Stand-alone node Optimizations : Eager send for short messages Lazy send for medium messages 7

$generation for level.6 Gflops (SGEMM). Gflops (DGEMM) A, B,... do_gemm(a, B, C,...); do_something(){... } get_gemm(c); C 8$

8 . Many-core as Accelerator CLBLAS Offloading computation from Host to Many-core Static kernels for level - Dynamic code generation for level.6 Gflops (SGEMM). Gflops (DGEMM) A, B,... do_gemm(a, B, C,...); do_something(){... } get_gemm(c); C 8

$Gflops (DGEMM) A, B,... do_gemm(a, B, C,...); do_something(){.$

9 . Many-core as Accelerator CLMAGMA LAPACK-based interface (QR, LU, CHOL ) Auxiliary kernels written in OpenCL Use clblas to offload computation on device(s) 9

10 . Many-core as Accelerator CLBLAS / CLMAGMA issues Block decomposition too small compared to page-size Concurrent accesses to same memory block Kernels written for GPU, non-optimal for Many-core

11 . Benchmark HPL benchmark with 6 MPI ranks, BLAS / OpenBLAS Limited local memory space (6 x MB) Matrix size 5 x 5 MPI Eager send gains % performance

12 Conclusion Many-core as Stand-alone node MPI mono-many-core, BLAS/OpenBLAS and HPL Next : MPI + DSM to increase memory space via I/O DDR + optimization modeling MPI multi-many-core for scalability Many-core as Accelerator clblas + clmagma Next : Block decomposing tuning for Many-core {Page Cache}-size-aware kernels LBM design & accelerating [] M. Q. Ho, B. Tourancheau, C. Obrecht et al., MPI communication on MPPA Manycore NoC: design, modeling and performance issues, PARCO, Edinburgh, Scotland, UK, - September 5. [] Kalray Inc., ManycoreLabs report, vol. 6, pp. 9-, 5

Many-core {Page Cache}-size-aware kernels LBM design & accelerating [] M. Q. Ho, B. Tourancheau, C. Obrecht et al.

13 Question?

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France