On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations

On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations 7 th European Conference on Antennas and Propagation Swedish Exhibition & Congress Centre Gothenburg, Sweden Alejandro Álvarez-Melcón, Fernando D. Quesada, Domingo Giménez, Carlos Pérez-Alcaraz, Tomás Ramírez, and José Ginés Picón alejandro.alvarez@upct.es; domingo@um.es Universidad Politécnica de Cartagena/ Universidad de Murcia ETSI. Telecomunicación/ Facultad de Informatica Dpto. Tecnologías de la Información y las Comunicaciones/ Dpto. de Informática y Sistemas Signal Theory and Communications 08-12 April 2013 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 1 / 18

Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 2 / 18

Introduction and motivation Motivation of the work 1 High interest in the development of full-wave techniques based on Integral Equation formulations for the analysis of microwave components and antennas. 2 Need of efficient software tools that allow optimization of complex devices in real time. 3 Complexity of devices increases computational time as the cube of the problem size. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 3 / 18

Introduction and motivation Motivation of the work 1 High interest in the development of full-wave techniques based on Integral Equation formulations for the analysis of microwave components and antennas. 2 Need of efficient software tools that allow optimization of complex devices in real time. 3 Complexity of devices increases computational time as the cube of the problem size. Identification of bottle-necks Two important elements in integral equation formulations: 1 Calculation of Green s functions inside waveguides maybe slow due to low convergence rate of series (images, modes). 2 In Volume Integral Equation formulations, size of the MoM matrices increases as N 3. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 3 / 18

Introduction and motivation Objectives of the work 1 Increase efficiency using parallel computing. 2 Application of several hybrid-heterogeneous parallelism strategies is proposed in this context. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 4 / 18

Introduction and motivation Objectives of the work 1 Increase efficiency using parallel computing. 2 Application of several hybrid-heterogeneous parallelism strategies is proposed in this context. Strategies explored 1 At a low level, application of hybrid parallelism (MPI+OpenMP+CUDA) for the computation of Green s functions in rectangular waveguides. 2 At a higher level, combination of two level parallelism (OpenMP and MKL multithread routines) in cc-numa systems applied to accelerate MoM solutions in VIE formulation. 3 Possibilities to use autotuning strategies. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 4 / 18

Computation of Green s functions on hybrid systems Hybrid parallelism 1 MPI+OpenMP, OpenMP+CUDA and MPI+OpenMP+CUDA routines are developed to accelerate the calculation of 2D waveguide Green s functions. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 5 / 18

Computation of Green s functions on hybrid systems Hybrid parallelism 1 MPI+OpenMP, OpenMP+CUDA and MPI+OpenMP+CUDA routines are developed to accelerate the calculation of 2D waveguide Green s functions. For each MPI process P k, 0 k < p: omp_set_num_threads(h + g) for i = k m p to (k + 1) m p 1 do node=omp_get_thread_num() if node < h then Compute with OpenMP thread else Call to CUDA kernel end if end for As seen, (p) MPI processes are started. In addition, (h + g) threads run inside each process. Threads (0) to (h 1) works on the CPU (OpenMP, OMP). Remaining threads from (h) to (h + g 1) works in GPU calling CUDA kernels. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 5 / 18

Computation of Green s functions on hybrid systems Routines developed p \ h + g 1 + 0 h + 0 0 + g h + g 1 SEQ OMP CUDA OMP+CUDA p MPI MPI+OMP MPI+CUDA MPI+OMP+CUDA Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 6 / 18

Computation of Green s functions on hybrid systems Routines developed p \ h + g 1 + 0 h + 0 0 + g h + g 1 SEQ OMP CUDA OMP+CUDA p MPI MPI+OMP MPI+CUDA MPI+OMP+CUDA Computational systems tested Saturno is a NUMA system with 24 cores, Intel Xeon, 1.87 GHz, 32 GB of shared-memory. Plus NVIDIA Tesla C2050, CUDA with total of 448 CUDA cores, 2.8 Gb and 1.15 GHz. Marte and Mercurio are AMD Phenom II X6 1075T (hexa-core), 3 GHz, 15 GB (Marte) and 8 GB (Mercurio). Plus NVIDIA GeForce GTX 590 with two devices, with 512 CUDA cores; machines are connected in a homogeneous cluster. Luna is an Intel Core 2 Quad Q6600, 2.4 GHz, 4 GB. With NVIDIA GeForce 9800 GT, CUDA with a total of 112 CUDA cores. All them connected in a heterogeneous cluster. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 6 / 18

Computation of Green s functions on hybrid systems Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 7 / 18

Computation of Green s functions on hybrid systems Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. Plot is presented as a function of the problem size (#images, #points). S=T(#threads=#cores)/ T(#kernels=3). S > 1 means GPU is preferred over CPU. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 7 / 18

Computation of Green s functions on hybrid systems Comparison between GPU and optimum parameters The selection of the optimum values for p, h and g produces lower execution times that blind GPU use. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 8 / 18

Computation of Green s functions on hybrid systems Comparison between GPU and optimum parameters The selection of the optimum values for p, h and g produces lower execution times that blind GPU use. Plot is presented as a function of the problem size (#images, #points). S=T(#kernels=3)/ T(lowest). S > 1 means GPU is worse than lowest. Speed-up of two is obtained for large problems using optimum. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 8 / 18

Computation of Green s functions on hybrid systems Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p, h and g for different nodes. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 9 / 18

Computation of Green s functions on hybrid systems Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p, h and g for different nodes. Plot is presented as a function of the problem size (#images, #points). S=T(#kernels=3*#nodes)/ T(lowest). Important reduction of the execution time with the hetereogeneous cluster. Execution time closer to the lowest experimental. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 9 / 18

Parallelization in CC-NUMA at MoM level of a VIE technique Set-up of the problem Free space GF are used, so parallelism is applied at the MoM level. Use of cc-numa system, with shared-memory but with a large memory hierarchy; difficult to achieve the maximum theoretical speed-up. Use of two-level parallelism: OpenMP. Implicit multithread parallelism of MKL. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 10 / 18

Parallelization in CC-NUMA at MoM level of a VIE technique Set-up of the problem Free space GF are used, so parallelism is applied at the MoM level. Use of cc-numa system, with shared-memory but with a large memory hierarchy; difficult to achieve the maximum theoretical speed-up. Use of two-level parallelism: OpenMP. Implicit multithread parallelism of MKL. Systems used in experiments Saturno, with 24 cores in four hexa-cores. Ben and Bscsmp, HP Integrity Superdome SX2000 and SGI Altix 4700, with 128 cores of Intel Itanium-2 dual-core Montvale and Montecito (1.6 GHz, and 18 MB and 8 MB cache of L3) and 1.5 TB shared-memory. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 10 / 18

Parallelization in NUMA at MoM level of a VIE Parallelism of MKL for inversion of large matrices zsysv/dsysv: Use for complex/real symmetric matrices. zgesv/dgesv: Use for general complex/real matrices; slower but exhibits better scalability. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 11 / 18

Parallelization in NUMA at MoM level of a VIE Parallelism of MKL for inversion of large matrices zsysv/dsysv: Use for complex/real symmetric matrices. zgesv/dgesv: Use for general complex/real matrices; slower but exhibits better scalability. Two different problem sizes: zsysv/dsysv saturate faster; zgesv/dgesv achieve lower times for larger number of cores. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 11 / 18

Parallelization in cc-numa at MoM level of a VIE Test in large cc-numa (Bscsmp machine) In Bscsmp, two different problem sizes: zsysv/dsysv saturate faster; zgesv/dgesv achieve lower times for larger number of cores. In large systems it is preferable not to use all the available cores. It is necessary some autotuning engine to select the optimum number of threads at each parallelism level. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 12 / 18

Parallelization in cc-numa at MoM level of a VIE Two level parallelism code Multithread linear algebra routines has been combined with OpenMP in a two-level parallelism code. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 13 / 18

Parallelization in cc-numa at MoM level of a VIE Two level parallelism code Multithread linear algebra routines has been combined with OpenMP in a two-level parallelism code. omp_set_nested(1) mkl_set_dynamic(0) omp_set_num_threads(ntomp) mkl_set_num_threads(ntmkl) for i = 0 num_freq do fillmatrix(i,init_freq,step) end for # pragma omp parallel for private(i) for i = 0 num_freq do solvesystem(i) end for for i = 0 num_freq do circuitalparameters(i,init_freq,step) end for A number of frequencies is assigned to a group of ntomp OpenMP threads. Inside each thread the linear system is solved with an MKL routine with ntmkl threads. Maximum speed-up in Ben is 35 using 64 cores. Superior to speed-up of 6 obtained with only multithread MKL. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 13 / 18

Parallelization in cc-numa at MoM level of a VIE Final test with hybrid OpenMP and MKL parallelism Computation of 128 frequencies with 64 cores in Ben with zsysv or zgesv, for three mesh sizes. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 14 / 18

Parallelization in cc-numa at MoM level of a VIE Final test with hybrid OpenMP and MKL parallelism Computation of 128 frequencies with 64 cores in Ben with zsysv or zgesv, for three mesh sizes. ntomp-ntmkl Mesh compl. 64-1 32-2 16-4 zsysv Simple 3.08 2.22 2.85 Medium 48.53 63.66 97.65 Complex 114.68 152.91 241.79 zgesv Simple 4.94 4.93 6.12 Medium 96.49 81.36 89.04 Complex 222.01 171.42 193.46 Lowest times are obtained with zsysv due to low number of threads in MKL. Use of nested parallelism is definitely better in some cases, specially when using zgesv. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 14 / 18

Autotuning parallel codes Autotuning strategies High complexity of today s hybrid, heterogeneous and hierarchical parallel systems; difficult to estimate optimum parameters leading to lowest execution times. Solution is to develop codes with auto-tuning engines. Tries to ensure execution times close to optimum, independently of the particular problem and of the characteristics of the computing systems. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 15 / 18

Autotuning parallel codes Autotuning strategies High complexity of today s hybrid, heterogeneous and hierarchical parallel systems; difficult to estimate optimum parameters leading to lowest execution times. Solution is to develop codes with auto-tuning engines. Tries to ensure execution times close to optimum, independently of the particular problem and of the characteristics of the computing systems. Types of Autotuning techniques Empirical autotuning. Modeling of the execution time. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 15 / 18

Autotuning parallel codes Empirical autotuning Run some test executions during the initial installation phase of the routine (installation set; keep installation set small). This information is used at running time when a particular problem is being solved (validation set). Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 16 / 18

Autotuning parallel codes Empirical autotuning Run some test executions during the initial installation phase of the routine (installation set; keep installation set small). This information is used at running time when a particular problem is being solved (validation set). images-points 1000-25 100000-25 1000-100 100000-100 AUTO-TUNING 0.155 5.012 1.706 87.814 LOWEST 0.114 5.012 1.656 79.453 DEVIATION 35.96% 0% 3.02% 10.52% Waveguide GF: different problem sizes (images, number of points). Execution times with the autotuning technique and with the optimum parameters (lowest). Autotuning routine performs well for all problem sizes investigated. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 16 / 18

Autotuning parallel codes Modeling of linear algebra routines Automatic selection of the preferred routine (zsysv/zgesv) helps to reduce the execution time of VIE code. n 3 T (n, h) = k 1 h + k 2n 2 h + k 3 n 2 n 2 + k 4 h + k 5nh + k 6 n (1) Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 17 / 18

Autotuning parallel codes Modeling of linear algebra routines Automatic selection of the preferred routine (zsysv/zgesv) helps to reduce the execution time of VIE code. n 3 T (n, h) = k 1 h + k 2n 2 h + k 3 n 2 n 2 + k 4 h + k 5nh + k 6 n (1) Model and experiments are in good agreement. Model is used to select autom. best routine and No. of MKL threads. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 17 / 18

Conclusions Combination of several parallelism paradigms allows the efficient solution of electromagnetic problems in today s computational systems, which are hybrid, heterogeneous and hierarchical. Calculation of Green s functions inside waveguides has been adapted for heterogeneous clusters with CPUs and GPUs with different speeds. Large systems arising in VIE formulation were solved combining OpenMP and MKL parallelism. We expect further improvements if large systems are solved combining (MPI+OpenMP+GPU) with out-of-core techniques and efficient use of multithread and multigpu linear algebra routines. Auto-tuning techniques can be incorporated so that non parallelism-experts can use routines efficiently in complex computational systems. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 18 / 18