On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations

Similar documents
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Parallel Programming Survey

A quick tutorial on Intel's Xeon Phi Coprocessor

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

High Performance Computing in CST STUDIO SUITE

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

High Performance Matrix Inversion with Several GPUs

GPUs for Scientific Computing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

ultra fast SOM using CUDA

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Scalability evaluation of barrier algorithms for OpenMP

Multi-Threading Performance on Commodity Multi-Core Processors

Retargeting PLAPACK to Clusters with Hardware Accelerators

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Recommended hardware system configurations for ANSYS users

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Operating System Multilevel Load Balancing

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

ST810 Advanced Computing

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

BLM 413E - Parallel Programming Lecture 3

Multilevel Load Balancing in NUMA Computers

MPI and Hybrid Programming Models. William Gropp

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Performance Characteristics of Large SMP Machines

Clustering Billions of Data Points Using GPUs

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Multicore Parallel Computing with OpenMP

HPC with Multicore and GPUs

Overview of HPC Resources at Vanderbilt

CUDA programming on NVIDIA GPUs

Parallel Algorithm Engineering

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Chapter 18: Database System Architectures. Centralized Systems

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

64-Bit versus 32-Bit CPUs in Scientific Computing

Auto-Tunning of Data Communication on Heterogeneous Systems

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Acceleration of a CFD Code with a GPU

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Intelligent Heuristic Construction with Active Learning

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Altix Usage and Application Programming. Welcome and Introduction

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Case Study on Productivity and Performance of GPGPUs

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Final Project Report. Trading Platform Server

GeoImaging Accelerator Pansharp Test Results

Parallel Computing with MATLAB

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Concurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes

Part I Courses Syllabus

Introduction to GPU hardware and to CUDA

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Turbomachinery CFD on many-core platforms experiences and strategies

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Modern Platform for Parallel Algorithms Testing: Java on Intel Xeon Phi

Performance Study of Parallel Programming Paradigms on a Multicore Clusters using Ant Colony Optimization for Job-flow scheduling problems

OpenMP Programming on ScaleMP

The Spanish Parallel Programming Contest and its use as an educational resource

Keys to node-level performance analysis and threading in HPC applications

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GPU Programming Languages

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

CHAPTER 1 INTRODUCTION

Symmetric Multiprocessing

GPU Usage. Requirements

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Accelerating CFD using OpenFOAM with GPUs

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

2: Computer Performance

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Chapter 2 Parallel Architecture, Software And Performance

Introduction to GPU Computing

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Transcription:

On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations 7 th European Conference on Antennas and Propagation Swedish Exhibition & Congress Centre Gothenburg, Sweden Alejandro Álvarez-Melcón, Fernando D. Quesada, Domingo Giménez, Carlos Pérez-Alcaraz, Tomás Ramírez, and José Ginés Picón alejandro.alvarez@upct.es; domingo@um.es Universidad Politécnica de Cartagena/ Universidad de Murcia ETSI. Telecomunicación/ Facultad de Informatica Dpto. Tecnologías de la Información y las Comunicaciones/ Dpto. de Informática y Sistemas Signal Theory and Communications 08-12 April 2013 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 1 / 18

Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 2 / 18

Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 2 / 18

Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 2 / 18

Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 2 / 18

Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 2 / 18

Introduction and motivation Motivation of the work 1 High interest in the development of full-wave techniques based on Integral Equation formulations for the analysis of microwave components and antennas. 2 Need of efficient software tools that allow optimization of complex devices in real time. 3 Complexity of devices increases computational time as the cube of the problem size. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 3 / 18

Introduction and motivation Motivation of the work 1 High interest in the development of full-wave techniques based on Integral Equation formulations for the analysis of microwave components and antennas. 2 Need of efficient software tools that allow optimization of complex devices in real time. 3 Complexity of devices increases computational time as the cube of the problem size. Identification of bottle-necks Two important elements in integral equation formulations: 1 Calculation of Green s functions inside waveguides maybe slow due to low convergence rate of series (images, modes). 2 In Volume Integral Equation formulations, size of the MoM matrices increases as N 3. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 3 / 18

Introduction and motivation Objectives of the work 1 Increase efficiency using parallel computing. 2 Application of several hybrid-heterogeneous parallelism strategies is proposed in this context. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 4 / 18

Introduction and motivation Objectives of the work 1 Increase efficiency using parallel computing. 2 Application of several hybrid-heterogeneous parallelism strategies is proposed in this context. Strategies explored 1 At a low level, application of hybrid parallelism (MPI+OpenMP+CUDA) for the computation of Green s functions in rectangular waveguides. 2 At a higher level, combination of two level parallelism (OpenMP and MKL multithread routines) in cc-numa systems applied to accelerate MoM solutions in VIE formulation. 3 Possibilities to use autotuning strategies. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 4 / 18

Computation of Green s functions on hybrid systems Hybrid parallelism 1 MPI+OpenMP, OpenMP+CUDA and MPI+OpenMP+CUDA routines are developed to accelerate the calculation of 2D waveguide Green s functions. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 5 / 18

Computation of Green s functions on hybrid systems Hybrid parallelism 1 MPI+OpenMP, OpenMP+CUDA and MPI+OpenMP+CUDA routines are developed to accelerate the calculation of 2D waveguide Green s functions. For each MPI process P k, 0 k < p: omp_set_num_threads(h + g) for i = k m p to (k + 1) m p 1 do node=omp_get_thread_num() if node < h then Compute with OpenMP thread else Call to CUDA kernel end if end for As seen, (p) MPI processes are started. In addition, (h + g) threads run inside each process. Threads (0) to (h 1) works on the CPU (OpenMP, OMP). Remaining threads from (h) to (h + g 1) works in GPU calling CUDA kernels. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 5 / 18

Computation of Green s functions on hybrid systems Routines developed p \ h + g 1 + 0 h + 0 0 + g h + g 1 SEQ OMP CUDA OMP+CUDA p MPI MPI+OMP MPI+CUDA MPI+OMP+CUDA Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 6 / 18

Computation of Green s functions on hybrid systems Routines developed p \ h + g 1 + 0 h + 0 0 + g h + g 1 SEQ OMP CUDA OMP+CUDA p MPI MPI+OMP MPI+CUDA MPI+OMP+CUDA Computational systems tested Saturno is a NUMA system with 24 cores, Intel Xeon, 1.87 GHz, 32 GB of shared-memory. Plus NVIDIA Tesla C2050, CUDA with total of 448 CUDA cores, 2.8 Gb and 1.15 GHz. Marte and Mercurio are AMD Phenom II X6 1075T (hexa-core), 3 GHz, 15 GB (Marte) and 8 GB (Mercurio). Plus NVIDIA GeForce GTX 590 with two devices, with 512 CUDA cores; machines are connected in a homogeneous cluster. Luna is an Intel Core 2 Quad Q6600, 2.4 GHz, 4 GB. With NVIDIA GeForce 9800 GT, CUDA with a total of 112 CUDA cores. All them connected in a heterogeneous cluster. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 6 / 18

Computation of Green s functions on hybrid systems Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 7 / 18

Computation of Green s functions on hybrid systems Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. Plot is presented as a function of the problem size (#images, #points). S=T(#threads=#cores)/ T(#kernels=3). S > 1 means GPU is preferred over CPU. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 7 / 18

Computation of Green s functions on hybrid systems Comparison between GPU and optimum parameters The selection of the optimum values for p, h and g produces lower execution times that blind GPU use. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 8 / 18

Computation of Green s functions on hybrid systems Comparison between GPU and optimum parameters The selection of the optimum values for p, h and g produces lower execution times that blind GPU use. Plot is presented as a function of the problem size (#images, #points). S=T(#kernels=3)/ T(lowest). S > 1 means GPU is worse than lowest. Speed-up of two is obtained for large problems using optimum. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 8 / 18

Computation of Green s functions on hybrid systems Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p, h and g for different nodes. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 9 / 18

Computation of Green s functions on hybrid systems Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p, h and g for different nodes. Plot is presented as a function of the problem size (#images, #points). S=T(#kernels=3*#nodes)/ T(lowest). Important reduction of the execution time with the hetereogeneous cluster. Execution time closer to the lowest experimental. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 9 / 18

Parallelization in CC-NUMA at MoM level of a VIE technique Set-up of the problem Free space GF are used, so parallelism is applied at the MoM level. Use of cc-numa system, with shared-memory but with a large memory hierarchy; difficult to achieve the maximum theoretical speed-up. Use of two-level parallelism: OpenMP. Implicit multithread parallelism of MKL. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 10 / 18

Parallelization in CC-NUMA at MoM level of a VIE technique Set-up of the problem Free space GF are used, so parallelism is applied at the MoM level. Use of cc-numa system, with shared-memory but with a large memory hierarchy; difficult to achieve the maximum theoretical speed-up. Use of two-level parallelism: OpenMP. Implicit multithread parallelism of MKL. Systems used in experiments Saturno, with 24 cores in four hexa-cores. Ben and Bscsmp, HP Integrity Superdome SX2000 and SGI Altix 4700, with 128 cores of Intel Itanium-2 dual-core Montvale and Montecito (1.6 GHz, and 18 MB and 8 MB cache of L3) and 1.5 TB shared-memory. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 10 / 18

Parallelization in NUMA at MoM level of a VIE Parallelism of MKL for inversion of large matrices zsysv/dsysv: Use for complex/real symmetric matrices. zgesv/dgesv: Use for general complex/real matrices; slower but exhibits better scalability. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 11 / 18

Parallelization in NUMA at MoM level of a VIE Parallelism of MKL for inversion of large matrices zsysv/dsysv: Use for complex/real symmetric matrices. zgesv/dgesv: Use for general complex/real matrices; slower but exhibits better scalability. Two different problem sizes: zsysv/dsysv saturate faster; zgesv/dgesv achieve lower times for larger number of cores. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 11 / 18

Parallelization in cc-numa at MoM level of a VIE Test in large cc-numa (Bscsmp machine) In Bscsmp, two different problem sizes: zsysv/dsysv saturate faster; zgesv/dgesv achieve lower times for larger number of cores. In large systems it is preferable not to use all the available cores. It is necessary some autotuning engine to select the optimum number of threads at each parallelism level. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 12 / 18

Parallelization in cc-numa at MoM level of a VIE Two level parallelism code Multithread linear algebra routines has been combined with OpenMP in a two-level parallelism code. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 13 / 18

Parallelization in cc-numa at MoM level of a VIE Two level parallelism code Multithread linear algebra routines has been combined with OpenMP in a two-level parallelism code. omp_set_nested(1) mkl_set_dynamic(0) omp_set_num_threads(ntomp) mkl_set_num_threads(ntmkl) for i = 0 num_freq do fillmatrix(i,init_freq,step) end for # pragma omp parallel for private(i) for i = 0 num_freq do solvesystem(i) end for for i = 0 num_freq do circuitalparameters(i,init_freq,step) end for A number of frequencies is assigned to a group of ntomp OpenMP threads. Inside each thread the linear system is solved with an MKL routine with ntmkl threads. Maximum speed-up in Ben is 35 using 64 cores. Superior to speed-up of 6 obtained with only multithread MKL. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 13 / 18

Parallelization in cc-numa at MoM level of a VIE Final test with hybrid OpenMP and MKL parallelism Computation of 128 frequencies with 64 cores in Ben with zsysv or zgesv, for three mesh sizes. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 14 / 18

Parallelization in cc-numa at MoM level of a VIE Final test with hybrid OpenMP and MKL parallelism Computation of 128 frequencies with 64 cores in Ben with zsysv or zgesv, for three mesh sizes. ntomp-ntmkl Mesh compl. 64-1 32-2 16-4 zsysv Simple 3.08 2.22 2.85 Medium 48.53 63.66 97.65 Complex 114.68 152.91 241.79 zgesv Simple 4.94 4.93 6.12 Medium 96.49 81.36 89.04 Complex 222.01 171.42 193.46 Lowest times are obtained with zsysv due to low number of threads in MKL. Use of nested parallelism is definitely better in some cases, specially when using zgesv. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 14 / 18

Autotuning parallel codes Autotuning strategies High complexity of today s hybrid, heterogeneous and hierarchical parallel systems; difficult to estimate optimum parameters leading to lowest execution times. Solution is to develop codes with auto-tuning engines. Tries to ensure execution times close to optimum, independently of the particular problem and of the characteristics of the computing systems. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 15 / 18

Autotuning parallel codes Autotuning strategies High complexity of today s hybrid, heterogeneous and hierarchical parallel systems; difficult to estimate optimum parameters leading to lowest execution times. Solution is to develop codes with auto-tuning engines. Tries to ensure execution times close to optimum, independently of the particular problem and of the characteristics of the computing systems. Types of Autotuning techniques Empirical autotuning. Modeling of the execution time. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 15 / 18

Autotuning parallel codes Empirical autotuning Run some test executions during the initial installation phase of the routine (installation set; keep installation set small). This information is used at running time when a particular problem is being solved (validation set). Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 16 / 18

Autotuning parallel codes Empirical autotuning Run some test executions during the initial installation phase of the routine (installation set; keep installation set small). This information is used at running time when a particular problem is being solved (validation set). images-points 1000-25 100000-25 1000-100 100000-100 AUTO-TUNING 0.155 5.012 1.706 87.814 LOWEST 0.114 5.012 1.656 79.453 DEVIATION 35.96% 0% 3.02% 10.52% Waveguide GF: different problem sizes (images, number of points). Execution times with the autotuning technique and with the optimum parameters (lowest). Autotuning routine performs well for all problem sizes investigated. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 16 / 18

Autotuning parallel codes Modeling of linear algebra routines Automatic selection of the preferred routine (zsysv/zgesv) helps to reduce the execution time of VIE code. n 3 T (n, h) = k 1 h + k 2n 2 h + k 3 n 2 n 2 + k 4 h + k 5nh + k 6 n (1) Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 17 / 18

Autotuning parallel codes Modeling of linear algebra routines Automatic selection of the preferred routine (zsysv/zgesv) helps to reduce the execution time of VIE code. n 3 T (n, h) = k 1 h + k 2n 2 h + k 3 n 2 n 2 + k 4 h + k 5nh + k 6 n (1) Model and experiments are in good agreement. Model is used to select autom. best routine and No. of MKL threads. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 17 / 18

Conclusions Combination of several parallelism paradigms allows the efficient solution of electromagnetic problems in today s computational systems, which are hybrid, heterogeneous and hierarchical. Calculation of Green s functions inside waveguides has been adapted for heterogeneous clusters with CPUs and GPUs with different speeds. Large systems arising in VIE formulation were solved combining OpenMP and MKL parallelism. We expect further improvements if large systems are solved combining (MPI+OpenMP+GPU) with out-of-core techniques and efficient use of multithread and multigpu linear algebra routines. Auto-tuning techniques can be incorporated so that non parallelism-experts can use routines efficiently in complex computational systems. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 18 / 18