Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems Murilo Boratto Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Universidade do Estado da Bahia, Brazil July 2, 2016
Outline Motivation TRSM Auto-Tuning Modelling Experimental Results Perspectives
Objective Auto-tuning: automatically decide how to run a routine for low execution time.
Objective Auto-tuning: automatically decide how to run a routine for low execution time. Predict the routine behavior depending on the problem size and the computational system.
Objective Auto-tuning: automatically decide how to run a routine for low execution time. Predict the routine behavior depending on the problem size and the computational system. Use of parameterized routines and models, with theoretical or empirical estimation of the parameters of the system and selection of the routine parameters at execution time.
Hybrid parallelism Routines combining different sources of parallelism: Multilevel parallelism with OpenMP OpenMP+BLAS+MAGMA parallelism CPU+GPU+Coprocessador parallelism There are possible extensions: MPI+OpenMP+BLAS+MAGMA+Multi-GPU+Multi- Coprocessor... Systems and computation hybrid, heterogeneous, hierarchical.
Applications Collaboration with the Scientific Computing and Parallel Programming (SCPP) group at the University of Murcia: Auto-Tuning of parallel routines and applications of parallelism (http://dis.um.es/ domingo/investigacion.html) In this presentation we summarize our on-going work on auto-tuning a hybrid-parallel linear algebra routine: Triangular Linear Systems Solver (TRSM)
Applications: Linear algebra Basic routines: matrix multiplication, factorizations... with OpenMP+BLAS+MAGMA parallelism to be used in large computational problems (electromagnetism, statistic models...) Related with: J. Cuenca, L. P. García, D. Giménez, F. J. Herrera: Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU+multiGPUs Platforms, CMMSE16. M. Boratto, P. Alonso, D. Giménez, A. Lastovetsky: Automatic Tuning to Performance Modelling of Matrix Polynomials on Multicore and Multi-GPU Systems, CMMSE15, The Journal of Supercomputing.
Computational systems: CPU+GPU+Coprocessor Combination of Multicores + GPU + Coprocessor Preliminary analysis on the combination of the three types of systems Modeling or empirical installation? Goal Auto-Tuning methodology for nested parallelism (OpenMP+BLAS+MAGMA), experiments with the TRSM routine.
TRSM
TRSM The TRSM routine solves the linear equation AX = B, where A is an upper or lower triangular matrix and B is a known matrix, called right-hand side matrix. Implementation with a block algorithm in parallel, with solution of small triangular systems with TRSM and matrix multiplications with GEMM. 00 TRSM 10 GEMM 11 TRSM Distribution of blocks to the computational components with a estructure of tasks. -1 0 0 0 1 2 2 1 T1 T2 2 1 T3 20 GEMM 30 21 GEMM 31 22 TRSM 32 33-1= Task done 0 = Task waiting to be done 1 = Task with single dependency 2 = Task with double dependency Producer Thread Consumer Thread Thread waiting for Task T1 T2 T3 L10 L20 L30 GEMM GEMM GEMM TRSM
Auto-Tuning
Auto-Tuning Auto-Tuning is a software technique that is expected to fulfill its requirements at runtime, in response to changes in the system (different system configurations) or the problem (size and particular input). The principal objectives are: Self-configuration Self-optimization
Auto-Tuning Methodology
Empirical Installation To run some selected executions at installation time: Experiment with parameters of the model A large installation time may be needed For some problem sizes, search for the best parameters combination: exhaustive search / guided search (search in the most promising directions). Experiments with: Some problem sizes: Installation set and values of the algorithmic parameters: Parameters set The equation that predicts the execution time is generated, or the preferred parameters are stored.
Experimental Results
Table: Execution Environment: Hardware and Software Processor IntelRCore (TM) Memory 12GB Clock 3.60GHz Number of Processors 2 Cores per Processor 8 GPU Type NVIDIA TESLA C2070 Number of GPUs 2 CUDA cores per GPU 2496 GPU Memory per GPU 5GB GRR3 Version CUDA 4.0 MIC Intel Xeon Phi Number of MIC 1 Cores per MIC 57
Installation Algorithmic parameters: c: number of threads w: block size Installation set={100, 200, 300, 400, 500} Parameters set={2,4,6,8,10,12,14,16} {16,32,64} The best parameters combination if obtained, (12, 64), and the equation that predicts the execution time is generated
Execution Time Execution time with different parameters values (in seconds). Platform n = 10240 n = 11200 n = 12800 n = 14400 c w time w time w time w time 1 16 97.86 16 125.64 16 193.59 32 270.78 2 16 52.90 16 70.06 32 105.80 32 173.19 4 16 28.79 32 39.23 32 59.22 64 85.46 6 32 17.46 32 22.87 32 33.60 64 48.42 8 32 14.25 32 19.09 32 29.71 64 41.99 10 64 13.24 64 17.85 64 27.87 64 38.93 12 64 12.28 64 14.74 64 23.24 64 32.16 14 64 13.76 64 16.50 64 26.45 64 35.65 16 64 15.04 64 17.73 64 28.04 64 33.86
Perspectives
Perspectives Modelling can help in the auto-tuning of basic parallel routines and scientific codes, so contributing to the efficient use of parallel programs. Hybrid parallelism (different types of parallelism, different paradigms...) introduces additional difficulties. Sometimes the theoretical models are combined with empirical analysis. Satisfactory results with a TRSM routine, but better modeling techniques are needed, especially for complex scientific problems in more complex (hybrid, heterogeneous, hierarchical) systems.