On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations

Size: px
Start display at page:

Download "On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations"

Transcription

1 On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations 7 th European Conference on Antennas and Propagation Swedish Exhibition & Congress Centre Gothenburg, Sweden Alejandro Álvarez-Melcón, Fernando D. Quesada, Domingo Giménez, Carlos Pérez-Alcaraz, Tomás Ramírez, and José Ginés Picón alejandro.alvarez@upct.es; domingo@um.es Universidad Politécnica de Cartagena/ Universidad de Murcia ETSI. Telecomunicación/ Facultad de Informatica Dpto. Tecnologías de la Información y las Comunicaciones/ Dpto. de Informática y Sistemas Signal Theory and Communications April 2013 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

2 Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

3 Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

4 Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

5 Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

6 Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

7 Introduction and motivation Motivation of the work 1 High interest in the development of full-wave techniques based on Integral Equation formulations for the analysis of microwave components and antennas. 2 Need of efficient software tools that allow optimization of complex devices in real time. 3 Complexity of devices increases computational time as the cube of the problem size. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

8 Introduction and motivation Motivation of the work 1 High interest in the development of full-wave techniques based on Integral Equation formulations for the analysis of microwave components and antennas. 2 Need of efficient software tools that allow optimization of complex devices in real time. 3 Complexity of devices increases computational time as the cube of the problem size. Identification of bottle-necks Two important elements in integral equation formulations: 1 Calculation of Green s functions inside waveguides maybe slow due to low convergence rate of series (images, modes). 2 In Volume Integral Equation formulations, size of the MoM matrices increases as N 3. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

9 Introduction and motivation Objectives of the work 1 Increase efficiency using parallel computing. 2 Application of several hybrid-heterogeneous parallelism strategies is proposed in this context. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

10 Introduction and motivation Objectives of the work 1 Increase efficiency using parallel computing. 2 Application of several hybrid-heterogeneous parallelism strategies is proposed in this context. Strategies explored 1 At a low level, application of hybrid parallelism (MPI+OpenMP+CUDA) for the computation of Green s functions in rectangular waveguides. 2 At a higher level, combination of two level parallelism (OpenMP and MKL multithread routines) in cc-numa systems applied to accelerate MoM solutions in VIE formulation. 3 Possibilities to use autotuning strategies. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

11 Computation of Green s functions on hybrid systems Hybrid parallelism 1 MPI+OpenMP, OpenMP+CUDA and MPI+OpenMP+CUDA routines are developed to accelerate the calculation of 2D waveguide Green s functions. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

12 Computation of Green s functions on hybrid systems Hybrid parallelism 1 MPI+OpenMP, OpenMP+CUDA and MPI+OpenMP+CUDA routines are developed to accelerate the calculation of 2D waveguide Green s functions. For each MPI process P k, 0 k < p: omp_set_num_threads(h + g) for i = k m p to (k + 1) m p 1 do node=omp_get_thread_num() if node < h then Compute with OpenMP thread else Call to CUDA kernel end if end for As seen, (p) MPI processes are started. In addition, (h + g) threads run inside each process. Threads (0) to (h 1) works on the CPU (OpenMP, OMP). Remaining threads from (h) to (h + g 1) works in GPU calling CUDA kernels. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

13 Computation of Green s functions on hybrid systems Routines developed p \ h + g h g h + g 1 SEQ OMP CUDA OMP+CUDA p MPI MPI+OMP MPI+CUDA MPI+OMP+CUDA Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

14 Computation of Green s functions on hybrid systems Routines developed p \ h + g h g h + g 1 SEQ OMP CUDA OMP+CUDA p MPI MPI+OMP MPI+CUDA MPI+OMP+CUDA Computational systems tested Saturno is a NUMA system with 24 cores, Intel Xeon, 1.87 GHz, 32 GB of shared-memory. Plus NVIDIA Tesla C2050, CUDA with total of 448 CUDA cores, 2.8 Gb and 1.15 GHz. Marte and Mercurio are AMD Phenom II X6 1075T (hexa-core), 3 GHz, 15 GB (Marte) and 8 GB (Mercurio). Plus NVIDIA GeForce GTX 590 with two devices, with 512 CUDA cores; machines are connected in a homogeneous cluster. Luna is an Intel Core 2 Quad Q6600, 2.4 GHz, 4 GB. With NVIDIA GeForce 9800 GT, CUDA with a total of 112 CUDA cores. All them connected in a heterogeneous cluster. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

15 Computation of Green s functions on hybrid systems Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

16 Computation of Green s functions on hybrid systems Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. Plot is presented as a function of the problem size (#images, #points). S=T(#threads=#cores)/ T(#kernels=3). S > 1 means GPU is preferred over CPU. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

17 Computation of Green s functions on hybrid systems Comparison between GPU and optimum parameters The selection of the optimum values for p, h and g produces lower execution times that blind GPU use. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

18 Computation of Green s functions on hybrid systems Comparison between GPU and optimum parameters The selection of the optimum values for p, h and g produces lower execution times that blind GPU use. Plot is presented as a function of the problem size (#images, #points). S=T(#kernels=3)/ T(lowest). S > 1 means GPU is worse than lowest. Speed-up of two is obtained for large problems using optimum. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

19 Computation of Green s functions on hybrid systems Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p, h and g for different nodes. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

20 Computation of Green s functions on hybrid systems Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p, h and g for different nodes. Plot is presented as a function of the problem size (#images, #points). S=T(#kernels=3*#nodes)/ T(lowest). Important reduction of the execution time with the hetereogeneous cluster. Execution time closer to the lowest experimental. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

21 Parallelization in CC-NUMA at MoM level of a VIE technique Set-up of the problem Free space GF are used, so parallelism is applied at the MoM level. Use of cc-numa system, with shared-memory but with a large memory hierarchy; difficult to achieve the maximum theoretical speed-up. Use of two-level parallelism: OpenMP. Implicit multithread parallelism of MKL. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

22 Parallelization in CC-NUMA at MoM level of a VIE technique Set-up of the problem Free space GF are used, so parallelism is applied at the MoM level. Use of cc-numa system, with shared-memory but with a large memory hierarchy; difficult to achieve the maximum theoretical speed-up. Use of two-level parallelism: OpenMP. Implicit multithread parallelism of MKL. Systems used in experiments Saturno, with 24 cores in four hexa-cores. Ben and Bscsmp, HP Integrity Superdome SX2000 and SGI Altix 4700, with 128 cores of Intel Itanium-2 dual-core Montvale and Montecito (1.6 GHz, and 18 MB and 8 MB cache of L3) and 1.5 TB shared-memory. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

23 Parallelization in NUMA at MoM level of a VIE Parallelism of MKL for inversion of large matrices zsysv/dsysv: Use for complex/real symmetric matrices. zgesv/dgesv: Use for general complex/real matrices; slower but exhibits better scalability. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

24 Parallelization in NUMA at MoM level of a VIE Parallelism of MKL for inversion of large matrices zsysv/dsysv: Use for complex/real symmetric matrices. zgesv/dgesv: Use for general complex/real matrices; slower but exhibits better scalability. Two different problem sizes: zsysv/dsysv saturate faster; zgesv/dgesv achieve lower times for larger number of cores. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

25 Parallelization in cc-numa at MoM level of a VIE Test in large cc-numa (Bscsmp machine) In Bscsmp, two different problem sizes: zsysv/dsysv saturate faster; zgesv/dgesv achieve lower times for larger number of cores. In large systems it is preferable not to use all the available cores. It is necessary some autotuning engine to select the optimum number of threads at each parallelism level. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

26 Parallelization in cc-numa at MoM level of a VIE Two level parallelism code Multithread linear algebra routines has been combined with OpenMP in a two-level parallelism code. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

27 Parallelization in cc-numa at MoM level of a VIE Two level parallelism code Multithread linear algebra routines has been combined with OpenMP in a two-level parallelism code. omp_set_nested(1) mkl_set_dynamic(0) omp_set_num_threads(ntomp) mkl_set_num_threads(ntmkl) for i = 0 num_freq do fillmatrix(i,init_freq,step) end for # pragma omp parallel for private(i) for i = 0 num_freq do solvesystem(i) end for for i = 0 num_freq do circuitalparameters(i,init_freq,step) end for A number of frequencies is assigned to a group of ntomp OpenMP threads. Inside each thread the linear system is solved with an MKL routine with ntmkl threads. Maximum speed-up in Ben is 35 using 64 cores. Superior to speed-up of 6 obtained with only multithread MKL. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

28 Parallelization in cc-numa at MoM level of a VIE Final test with hybrid OpenMP and MKL parallelism Computation of 128 frequencies with 64 cores in Ben with zsysv or zgesv, for three mesh sizes. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

29 Parallelization in cc-numa at MoM level of a VIE Final test with hybrid OpenMP and MKL parallelism Computation of 128 frequencies with 64 cores in Ben with zsysv or zgesv, for three mesh sizes. ntomp-ntmkl Mesh compl zsysv Simple Medium Complex zgesv Simple Medium Complex Lowest times are obtained with zsysv due to low number of threads in MKL. Use of nested parallelism is definitely better in some cases, specially when using zgesv. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

30 Autotuning parallel codes Autotuning strategies High complexity of today s hybrid, heterogeneous and hierarchical parallel systems; difficult to estimate optimum parameters leading to lowest execution times. Solution is to develop codes with auto-tuning engines. Tries to ensure execution times close to optimum, independently of the particular problem and of the characteristics of the computing systems. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

31 Autotuning parallel codes Autotuning strategies High complexity of today s hybrid, heterogeneous and hierarchical parallel systems; difficult to estimate optimum parameters leading to lowest execution times. Solution is to develop codes with auto-tuning engines. Tries to ensure execution times close to optimum, independently of the particular problem and of the characteristics of the computing systems. Types of Autotuning techniques Empirical autotuning. Modeling of the execution time. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

32 Autotuning parallel codes Empirical autotuning Run some test executions during the initial installation phase of the routine (installation set; keep installation set small). This information is used at running time when a particular problem is being solved (validation set). Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

33 Autotuning parallel codes Empirical autotuning Run some test executions during the initial installation phase of the routine (installation set; keep installation set small). This information is used at running time when a particular problem is being solved (validation set). images-points AUTO-TUNING LOWEST DEVIATION 35.96% 0% 3.02% 10.52% Waveguide GF: different problem sizes (images, number of points). Execution times with the autotuning technique and with the optimum parameters (lowest). Autotuning routine performs well for all problem sizes investigated. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

34 Autotuning parallel codes Modeling of linear algebra routines Automatic selection of the preferred routine (zsysv/zgesv) helps to reduce the execution time of VIE code. n 3 T (n, h) = k 1 h + k 2n 2 h + k 3 n 2 n 2 + k 4 h + k 5nh + k 6 n (1) Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

35 Autotuning parallel codes Modeling of linear algebra routines Automatic selection of the preferred routine (zsysv/zgesv) helps to reduce the execution time of VIE code. n 3 T (n, h) = k 1 h + k 2n 2 h + k 3 n 2 n 2 + k 4 h + k 5nh + k 6 n (1) Model and experiments are in good agreement. Model is used to select autom. best routine and No. of MKL threads. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

36 Conclusions Combination of several parallelism paradigms allows the efficient solution of electromagnetic problems in today s computational systems, which are hybrid, heterogeneous and hierarchical. Calculation of Green s functions inside waveguides has been adapted for heterogeneous clusters with CPUs and GPUs with different speeds. Large systems arising in VIE formulation were solved combining OpenMP and MKL parallelism. We expect further improvements if large systems are solved combining (MPI+OpenMP+GPU) with out-of-core techniques and efficient use of multithread and multigpu linear algebra routines. Auto-tuning techniques can be incorporated so that non parallelism-experts can use routines efficiently in complex computational systems. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la

More information

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems Murilo Boratto Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Universidade do Estado

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

A quick tutorial on Intel's Xeon Phi Coprocessor

A quick tutorial on Intel's Xeon Phi Coprocessor A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed

More information

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

High Performance Matrix Inversion with Several GPUs

High Performance Matrix Inversion with Several GPUs High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Scalability evaluation of barrier algorithms for OpenMP

Scalability evaluation of barrier algorithms for OpenMP Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Retargeting PLAPACK to Clusters with Hardware Accelerators

Retargeting PLAPACK to Clusters with Hardware Accelerators Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Operating System Multilevel Load Balancing

Operating System Multilevel Load Balancing Operating System Multilevel Load Balancing M. Corrêa, A. Zorzo Faculty of Informatics - PUCRS Porto Alegre, Brazil {mcorrea, zorzo}@inf.pucrs.br R. Scheer HP Brazil R&D Porto Alegre, Brazil roque.scheer@hp.com

More information

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD Kashif.iqbal@ichec.ie ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo

More information

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,

More information

ST810 Advanced Computing

ST810 Advanced Computing ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State

More information

BLM 413E - Parallel Programming Lecture 3

BLM 413E - Parallel Programming Lecture 3 BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several

More information

Multilevel Load Balancing in NUMA Computers

Multilevel Load Balancing in NUMA Computers FACULDADE DE INFORMÁTICA PUCRS - Brazil http://www.pucrs.br/inf/pos/ Multilevel Load Balancing in NUMA Computers M. Corrêa, R. Chanin, A. Sales, R. Scheer, A. Zorzo Technical Report Series Number 049 July,

More information

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source

More information

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do

More information

An introduction to Fyrkat

An introduction to Fyrkat Cluster Computing May 25, 2011 How to get an account https://fyrkat.grid.aau.dk/useraccount How to get help https://fyrkat.grid.aau.dk/wiki What is a Cluster Anyway It is NOT something that does any of

More information

Performance Characteristics of Large SMP Machines

Performance Characteristics of Large SMP Machines Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark

More information

Clustering Billions of Data Points Using GPUs

Clustering Billions of Data Points Using GPUs Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate

More information

Lecture 1. Course Introduction

Lecture 1. Course Introduction Lecture 1 Course Introduction Welcome to CSE 262! Your instructor is Scott B. Baden Office hours (week 1) Tues/Thurs 3.30 to 4.30 Room 3244 EBU3B 2010 Scott B. Baden / CSE 262 /Spring 2011 2 Content Our

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

VII ENCUENTRO IBÉRICO DE ELECTROMAGNETISMO COMPUTACIONAL, MONFRAGÜE, CÁCERES, 19-21 MAYO 2010 29

VII ENCUENTRO IBÉRICO DE ELECTROMAGNETISMO COMPUTACIONAL, MONFRAGÜE, CÁCERES, 19-21 MAYO 2010 29 VII ENCUENTRO IBÉRICO DE ELECTROMAGNETISMO COMPUTACIONAL, MONFRAGÜE, CÁCERES, 19-21 MAYO 2010 29 Shared Memory Supercomputing as Technique for Computational Electromagnetics César Gómez-Martín, José-Luis

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

Overview of HPC Resources at Vanderbilt

Overview of HPC Resources at Vanderbilt Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158

More information

Chapter 18: Database System Architectures. Centralized Systems

Chapter 18: Database System Architectures. Centralized Systems Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

64-Bit versus 32-Bit CPUs in Scientific Computing

64-Bit versus 32-Bit CPUs in Scientific Computing 64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples

More information

Auto-Tunning of Data Communication on Heterogeneous Systems

Auto-Tunning of Data Communication on Heterogeneous Systems 1 Auto-Tunning of Data Communication on Heterogeneous Systems Marc Jordà 1, Ivan Tanasic 1, Javier Cabezas 1, Lluís Vilanova 1, Isaac Gelado 1, and Nacho Navarro 1, 2 1 Barcelona Supercomputing Center

More information

Performance analysis with Periscope

Performance analysis with Periscope Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München September 2010 Outline Motivation Periscope architecture Periscope performance analysis

More information

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*

More information

Acceleration of a CFD Code with a GPU

Acceleration of a CFD Code with a GPU Acceleration of a CFD Code with a GPU Dennis C. Jespersen ABSTRACT The Computational Fluid Dynamics code Overflow includes as one of its solver options an algorithm which is a fairly small piece of code

More information

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of

More information

Intelligent Heuristic Construction with Active Learning

Intelligent Heuristic Construction with Active Learning Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field

More information

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca Carlo Cavazzoni CINECA Supercomputing Application & Innovation www.cineca.it 21 Aprile 2015 FERMI Name: Fermi Architecture: BlueGene/Q

More information

Altix Usage and Application Programming. Welcome and Introduction

Altix Usage and Application Programming. Welcome and Introduction Zentrum für Informationsdienste und Hochleistungsrechnen Altix Usage and Application Programming Welcome and Introduction Zellescher Weg 12 Tel. +49 351-463 - 35450 Dresden, November 30th 2005 Wolfgang

More information

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,

More information

Case Study on Productivity and Performance of GPGPUs

Case Study on Productivity and Performance of GPGPUs Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia

More information

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER Tender Notice No. 3/2014-15 dated 29.12.2014 (IIT/CE/ENQ/COM/HPC/2014-15/569) Tender Submission Deadline Last date for submission of sealed bids is extended

More information

Multi-GPU Jacobian Accelerated Computing for Soft Field Tomography

Multi-GPU Jacobian Accelerated Computing for Soft Field Tomography Multi-GPU Jacobian Accelerated Computing for Soft Field Tomography A. Borsic, E. A. Attardo, R. J. Halter Thayer School Of Engineering, Dartmouth College, 8000 Cummings Hall Hanover, NH 03755, US E-mail:

More information

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics

More information

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University

More information

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

More information

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture Programming Techniques for Supercomputers: Multicore processors There is no way back Modern multi-/manycore chips Basic ompute Node Architecture SimultaneousMultiThreading (SMT) Prof. Dr. G. Wellein (a,b),

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

15-418 Final Project Report. Trading Platform Server

15-418 Final Project Report. Trading Platform Server 15-418 Final Project Report Yinghao Wang yinghaow@andrew.cmu.edu May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support

More information

GeoImaging Accelerator Pansharp Test Results

GeoImaging Accelerator Pansharp Test Results GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best

More information

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012 Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter Strazdins (Research School of Computer Science),

More information

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial Bill Barth, Kent Milfeld, Dan Stanzione Tommy Minyard Texas Advanced Computing Center Jim Jeffers, Intel June 2013, Leipzig, Germany

More information

Concurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes

Concurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes Concurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes Oluwapelumi Adenikinju1, Julian Gilyard2, Joshua Massey1, Thomas Stitt 1 Department of Computer Science and Electrical Engineering, UMBC

More information

Part I Courses Syllabus

Part I Courses Syllabus Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model 5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model C99, C++, F2003 Compilers Optimizing Vectorizing Parallelizing Graphical parallel tools PGDBG debugger PGPROF profiler Intel, AMD, NVIDIA

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Parallel Processing and Software Performance. Lukáš Marek

Parallel Processing and Software Performance. Lukáš Marek Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel

More information

Modern Platform for Parallel Algorithms Testing: Java on Intel Xeon Phi

Modern Platform for Parallel Algorithms Testing: Java on Intel Xeon Phi I.J. Information Technology and Computer Science, 2015, 09, 8-14 Published Online August 2015 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2015.09.02 Modern Platform for Parallel Algorithms

More information

Performance Study of Parallel Programming Paradigms on a Multicore Clusters using Ant Colony Optimization for Job-flow scheduling problems

Performance Study of Parallel Programming Paradigms on a Multicore Clusters using Ant Colony Optimization for Job-flow scheduling problems Performance Study of Parallel Programming Paradigms on a Multicore Clusters using Ant Colony Optimization for Job-flow scheduling problems Nagaveni V # Dr. G T Raju * # Department of Computer Science and

More information

OpenMP Programming on ScaleMP

OpenMP Programming on ScaleMP OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign

More information

The Spanish Parallel Programming Contest and its use as an educational resource

The Spanish Parallel Programming Contest and its use as an educational resource The Spanish Parallel Programming Contest and its use as an educational resource Francisco Almeida Departamento de Estadística, Investigación Operativa y Computación, University of La Laguna Javier Cuenca,

More information

Keys to node-level performance analysis and threading in HPC applications

Keys to node-level performance analysis and threading in HPC applications Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION

More information

Improving System Scalability of OpenMP Applications Using Large Page Support

Improving System Scalability of OpenMP Applications Using Large Page Support Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

Dense Linear Algebra Solvers for Multicore with GPU Accelerators

Dense Linear Algebra Solvers for Multicore with GPU Accelerators Dense Linear Algebra Solvers for Multicore with GPU Accelerators Stanimire Tomov, Rajib Nath, Hatem Ltaief, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee,

More information

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

GPU Accelerated Monte Carlo Simulations and Time Series Analysis GPU Accelerated Monte Carlo Simulations and Time Series Analysis Institute of Physics, Johannes Gutenberg-University of Mainz Center for Polymer Studies, Department of Physics, Boston University Artemis

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

GPU Usage. Requirements

GPU Usage. Requirements GPU Usage Use the GPU Usage tool in the Performance and Diagnostics Hub to better understand the high-level hardware utilization of your Direct3D app. You can use it to determine whether the performance

More information

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive

More information

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Building an energy dashboard. Energy measurement and visualization in current HPC systems Building an energy dashboard Energy measurement and visualization in current HPC systems Thomas Geenen 1/58 thomas.geenen@surfsara.nl SURFsara The Dutch national HPC center 2H 2014 > 1PFlop GPGPU accelerators

More information

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,

More information

Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

2: Computer Performance

2: Computer Performance 2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12

More information

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM

More information

Chapter 2 Parallel Architecture, Software And Performance

Chapter 2 Parallel Architecture, Software And Performance Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program

More information

Introduction to GPU Computing

Introduction to GPU Computing Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture

More information

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer GU Computing 1 2 3 The GU Advantage To ExaScale and Beyond The GU is the Computer The GU Advantage The GU Advantage A Tale of Two Machines Tianhe-1A at NSC Tianjin Tianhe-1A at NSC Tianjin The World s

More information

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications Dr. Bjoern Landmann Dr. Kerstin Wieczorek Stefan Bachschuster 18.03.2015 FluiDyna GmbH, Lichtenbergstr. 8, 85748 Garching

More information

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster Introduction to Linux and Cluster Basics for the CCR General Computing Cluster Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY 14203 Phone: 716-881-8959

More information

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Experiences on using GPU accelerators for data analysis in ROOT/RooFit Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,

More information