Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Size: px

Start display at page:

Download "Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems"

Clifford Dennis
7 years ago
Views:

1 Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems Murilo Boratto Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Universidade do Estado da Bahia, Brazil July 2, 2016

Boratto Núcleo de Arquitetura de Computadores e Sistemas

2 Outline Motivation TRSM Auto-Tuning Modelling Experimental Results Perspectives

3 Objective Auto-tuning: automatically decide how to run a routine for low execution time.

4 Objective Auto-tuning: automatically decide how to run a routine for low execution time. Predict the routine behavior depending on the problem size and the computational system.

5 Objective Auto-tuning: automatically decide how to run a routine for low execution time. Predict the routine behavior depending on the problem size and the computational system. Use of parameterized routines and models, with theoretical or empirical estimation of the parameters of the system and selection of the routine parameters at execution time.

Use of parameterized routines and models, with theoretical or empirical estimation of

6 Hybrid parallelism Routines combining different sources of parallelism: Multilevel parallelism with OpenMP OpenMP+BLAS+MAGMA parallelism CPU+GPU+Coprocessador parallelism There are possible extensions: MPI+OpenMP+BLAS+MAGMA+Multi-GPU+Multi- Coprocessor... Systems and computation hybrid, heterogeneous, hierarchical.

CPU+GPU+Coprocessador parallelism There are possible extensions:

7 Applications Collaboration with the Scientific Computing and Parallel Programming (SCPP) group at the University of Murcia: Auto-Tuning of parallel routines and applications of parallelism ( domingo/investigacion.html) In this presentation we summarize our on-going work on auto-tuning a hybrid-parallel linear algebra routine: Triangular Linear Systems Solver (TRSM)

(http://dis.um.es/ domingo/investigacion.

8 Applications: Linear algebra Basic routines: matrix multiplication, factorizations... with OpenMP+BLAS+MAGMA parallelism to be used in large computational problems (electromagnetism, statistic models...) Related with: J. Cuenca, L. P. García, D. Giménez, F. J. Herrera: Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU+multiGPUs Platforms, CMMSE16. M. Boratto, P. Alonso, D. Giménez, A. Lastovetsky: Automatic Tuning to Performance Modelling of Matrix Polynomials on Multicore and Multi-GPU Systems, CMMSE15, The Journal of Supercomputing.

9 Computational systems: CPU+GPU+Coprocessor Combination of Multicores + GPU + Coprocessor Preliminary analysis on the combination of the three types of systems Modeling or empirical installation? Goal Auto-Tuning methodology for nested parallelism (OpenMP+BLAS+MAGMA), experiments with the TRSM routine.

systems Modeling or empirical installation?

10 TRSM

11 TRSM The TRSM routine solves the linear equation AX = B, where A is an upper or lower triangular matrix and B is a known matrix, called right-hand side matrix. Implementation with a block algorithm in parallel, with solution of small triangular systems with TRSM and matrix multiplications with GEMM. 00 TRSM 10 GEMM 11 TRSM Distribution of blocks to the computational components with a estructure of tasks T1 T2 2 1 T3 20 GEMM GEMM TRSM = Task done 0 = Task waiting to be done 1 = Task with single dependency 2 = Task with double dependency Producer Thread Consumer Thread Thread waiting for Task T1 T2 T3 L10 L20 L30 GEMM GEMM GEMM TRSM

00 TRSM 10 GEMM 11 TRSM Distribution of blocks to the computational components with a estructure of tasks.

12 Auto-Tuning

13 Auto-Tuning Auto-Tuning is a software technique that is expected to fulfill its requirements at runtime, in response to changes in the system (different system configurations) or the problem (size and particular input). The principal objectives are: Self-configuration Self-optimization

system (different system configurations) or the problem (size and

14 Auto-Tuning Methodology

15 Empirical Installation To run some selected executions at installation time: Experiment with parameters of the model A large installation time may be needed For some problem sizes, search for the best parameters combination: exhaustive search / guided search (search in the most promising directions). Experiments with: Some problem sizes: Installation set and values of the algorithmic parameters: Parameters set The equation that predicts the execution time is generated, or the preferred parameters are stored.

search (search in the most promising directions).

16 Experimental Results

17 Table: Execution Environment: Hardware and Software Processor IntelRCore (TM) Memory 12GB Clock 3.60GHz Number of Processors 2 Cores per Processor 8 GPU Type NVIDIA TESLA C2070 Number of GPUs 2 CUDA cores per GPU 2496 GPU Memory per GPU 5GB GRR3 Version CUDA 4.0 MIC Intel Xeon Phi Number of MIC 1 Cores per MIC 57

60GHz Number of Processors 2 Cores per Processor 8 GPU Type NVIDIA TESLA C2070

18 Installation Algorithmic parameters: c: number of threads w: block size Installation set={100, 200, 300, 400, 500} Parameters set={2,4,6,8,10,12,14,16} {16,32,64} The best parameters combination if obtained, (12, 64), and the equation that predicts the execution time is generated

set={2,4,6,8,10,12,14,16} {16,32,64} The best parameters combination

19 Execution Time Execution time with different parameters values (in seconds). Platform n = n = n = n = c w time w time w time w time

78 2 16 52.90 16 70.06 32 105.80 32 173.19 4 16 28.79 32 39.23 32 59.22 64 85.46 6 32 17.46 32 22.87 32 33.60 64 48.42 8 32 14.

20 Perspectives

21 Perspectives Modelling can help in the auto-tuning of basic parallel routines and scientific codes, so contributing to the efficient use of parallel programs. Hybrid parallelism (different types of parallelism, different paradigms...) introduces additional difficulties. Sometimes the theoretical models are combined with empirical analysis. Satisfactory results with a TRSM routine, but better modeling techniques are needed, especially for complex scientific problems in more complex (hybrid, heterogeneous, hierarchical) systems.

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la