On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations
|
|
- Andrea Townsend
- 7 years ago
- Views:
Transcription
1 On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations 7 th European Conference on Antennas and Propagation Swedish Exhibition & Congress Centre Gothenburg, Sweden Alejandro Álvarez-Melcón, Fernando D. Quesada, Domingo Giménez, Carlos Pérez-Alcaraz, Tomás Ramírez, and José Ginés Picón alejandro.alvarez@upct.es; domingo@um.es Universidad Politécnica de Cartagena/ Universidad de Murcia ETSI. Telecomunicación/ Facultad de Informatica Dpto. Tecnologías de la Información y las Comunicaciones/ Dpto. de Informática y Sistemas Signal Theory and Communications April 2013 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
2 Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
3 Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
4 Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
5 Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
6 Outline 1 Introduction and motivation 2 Computation of Green s functions on hybrid systems 3 Parallelization in CC-NUMA at MoM level of a VIE technique 4 Autotuning parallel codes 5 Conclusions Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
7 Introduction and motivation Motivation of the work 1 High interest in the development of full-wave techniques based on Integral Equation formulations for the analysis of microwave components and antennas. 2 Need of efficient software tools that allow optimization of complex devices in real time. 3 Complexity of devices increases computational time as the cube of the problem size. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
8 Introduction and motivation Motivation of the work 1 High interest in the development of full-wave techniques based on Integral Equation formulations for the analysis of microwave components and antennas. 2 Need of efficient software tools that allow optimization of complex devices in real time. 3 Complexity of devices increases computational time as the cube of the problem size. Identification of bottle-necks Two important elements in integral equation formulations: 1 Calculation of Green s functions inside waveguides maybe slow due to low convergence rate of series (images, modes). 2 In Volume Integral Equation formulations, size of the MoM matrices increases as N 3. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
9 Introduction and motivation Objectives of the work 1 Increase efficiency using parallel computing. 2 Application of several hybrid-heterogeneous parallelism strategies is proposed in this context. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
10 Introduction and motivation Objectives of the work 1 Increase efficiency using parallel computing. 2 Application of several hybrid-heterogeneous parallelism strategies is proposed in this context. Strategies explored 1 At a low level, application of hybrid parallelism (MPI+OpenMP+CUDA) for the computation of Green s functions in rectangular waveguides. 2 At a higher level, combination of two level parallelism (OpenMP and MKL multithread routines) in cc-numa systems applied to accelerate MoM solutions in VIE formulation. 3 Possibilities to use autotuning strategies. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
11 Computation of Green s functions on hybrid systems Hybrid parallelism 1 MPI+OpenMP, OpenMP+CUDA and MPI+OpenMP+CUDA routines are developed to accelerate the calculation of 2D waveguide Green s functions. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
12 Computation of Green s functions on hybrid systems Hybrid parallelism 1 MPI+OpenMP, OpenMP+CUDA and MPI+OpenMP+CUDA routines are developed to accelerate the calculation of 2D waveguide Green s functions. For each MPI process P k, 0 k < p: omp_set_num_threads(h + g) for i = k m p to (k + 1) m p 1 do node=omp_get_thread_num() if node < h then Compute with OpenMP thread else Call to CUDA kernel end if end for As seen, (p) MPI processes are started. In addition, (h + g) threads run inside each process. Threads (0) to (h 1) works on the CPU (OpenMP, OMP). Remaining threads from (h) to (h + g 1) works in GPU calling CUDA kernels. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
13 Computation of Green s functions on hybrid systems Routines developed p \ h + g h g h + g 1 SEQ OMP CUDA OMP+CUDA p MPI MPI+OMP MPI+CUDA MPI+OMP+CUDA Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
14 Computation of Green s functions on hybrid systems Routines developed p \ h + g h g h + g 1 SEQ OMP CUDA OMP+CUDA p MPI MPI+OMP MPI+CUDA MPI+OMP+CUDA Computational systems tested Saturno is a NUMA system with 24 cores, Intel Xeon, 1.87 GHz, 32 GB of shared-memory. Plus NVIDIA Tesla C2050, CUDA with total of 448 CUDA cores, 2.8 Gb and 1.15 GHz. Marte and Mercurio are AMD Phenom II X6 1075T (hexa-core), 3 GHz, 15 GB (Marte) and 8 GB (Mercurio). Plus NVIDIA GeForce GTX 590 with two devices, with 512 CUDA cores; machines are connected in a homogeneous cluster. Luna is an Intel Core 2 Quad Q6600, 2.4 GHz, 4 GB. With NVIDIA GeForce 9800 GT, CUDA with a total of 112 CUDA cores. All them connected in a heterogeneous cluster. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
15 Computation of Green s functions on hybrid systems Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
16 Computation of Green s functions on hybrid systems Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. Plot is presented as a function of the problem size (#images, #points). S=T(#threads=#cores)/ T(#kernels=3). S > 1 means GPU is preferred over CPU. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
17 Computation of Green s functions on hybrid systems Comparison between GPU and optimum parameters The selection of the optimum values for p, h and g produces lower execution times that blind GPU use. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
18 Computation of Green s functions on hybrid systems Comparison between GPU and optimum parameters The selection of the optimum values for p, h and g produces lower execution times that blind GPU use. Plot is presented as a function of the problem size (#images, #points). S=T(#kernels=3)/ T(lowest). S > 1 means GPU is worse than lowest. Speed-up of two is obtained for large problems using optimum. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
19 Computation of Green s functions on hybrid systems Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p, h and g for different nodes. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
20 Computation of Green s functions on hybrid systems Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p, h and g for different nodes. Plot is presented as a function of the problem size (#images, #points). S=T(#kernels=3*#nodes)/ T(lowest). Important reduction of the execution time with the hetereogeneous cluster. Execution time closer to the lowest experimental. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
21 Parallelization in CC-NUMA at MoM level of a VIE technique Set-up of the problem Free space GF are used, so parallelism is applied at the MoM level. Use of cc-numa system, with shared-memory but with a large memory hierarchy; difficult to achieve the maximum theoretical speed-up. Use of two-level parallelism: OpenMP. Implicit multithread parallelism of MKL. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
22 Parallelization in CC-NUMA at MoM level of a VIE technique Set-up of the problem Free space GF are used, so parallelism is applied at the MoM level. Use of cc-numa system, with shared-memory but with a large memory hierarchy; difficult to achieve the maximum theoretical speed-up. Use of two-level parallelism: OpenMP. Implicit multithread parallelism of MKL. Systems used in experiments Saturno, with 24 cores in four hexa-cores. Ben and Bscsmp, HP Integrity Superdome SX2000 and SGI Altix 4700, with 128 cores of Intel Itanium-2 dual-core Montvale and Montecito (1.6 GHz, and 18 MB and 8 MB cache of L3) and 1.5 TB shared-memory. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
23 Parallelization in NUMA at MoM level of a VIE Parallelism of MKL for inversion of large matrices zsysv/dsysv: Use for complex/real symmetric matrices. zgesv/dgesv: Use for general complex/real matrices; slower but exhibits better scalability. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
24 Parallelization in NUMA at MoM level of a VIE Parallelism of MKL for inversion of large matrices zsysv/dsysv: Use for complex/real symmetric matrices. zgesv/dgesv: Use for general complex/real matrices; slower but exhibits better scalability. Two different problem sizes: zsysv/dsysv saturate faster; zgesv/dgesv achieve lower times for larger number of cores. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
25 Parallelization in cc-numa at MoM level of a VIE Test in large cc-numa (Bscsmp machine) In Bscsmp, two different problem sizes: zsysv/dsysv saturate faster; zgesv/dgesv achieve lower times for larger number of cores. In large systems it is preferable not to use all the available cores. It is necessary some autotuning engine to select the optimum number of threads at each parallelism level. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
26 Parallelization in cc-numa at MoM level of a VIE Two level parallelism code Multithread linear algebra routines has been combined with OpenMP in a two-level parallelism code. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
27 Parallelization in cc-numa at MoM level of a VIE Two level parallelism code Multithread linear algebra routines has been combined with OpenMP in a two-level parallelism code. omp_set_nested(1) mkl_set_dynamic(0) omp_set_num_threads(ntomp) mkl_set_num_threads(ntmkl) for i = 0 num_freq do fillmatrix(i,init_freq,step) end for # pragma omp parallel for private(i) for i = 0 num_freq do solvesystem(i) end for for i = 0 num_freq do circuitalparameters(i,init_freq,step) end for A number of frequencies is assigned to a group of ntomp OpenMP threads. Inside each thread the linear system is solved with an MKL routine with ntmkl threads. Maximum speed-up in Ben is 35 using 64 cores. Superior to speed-up of 6 obtained with only multithread MKL. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
28 Parallelization in cc-numa at MoM level of a VIE Final test with hybrid OpenMP and MKL parallelism Computation of 128 frequencies with 64 cores in Ben with zsysv or zgesv, for three mesh sizes. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
29 Parallelization in cc-numa at MoM level of a VIE Final test with hybrid OpenMP and MKL parallelism Computation of 128 frequencies with 64 cores in Ben with zsysv or zgesv, for three mesh sizes. ntomp-ntmkl Mesh compl zsysv Simple Medium Complex zgesv Simple Medium Complex Lowest times are obtained with zsysv due to low number of threads in MKL. Use of nested parallelism is definitely better in some cases, specially when using zgesv. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
30 Autotuning parallel codes Autotuning strategies High complexity of today s hybrid, heterogeneous and hierarchical parallel systems; difficult to estimate optimum parameters leading to lowest execution times. Solution is to develop codes with auto-tuning engines. Tries to ensure execution times close to optimum, independently of the particular problem and of the characteristics of the computing systems. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
31 Autotuning parallel codes Autotuning strategies High complexity of today s hybrid, heterogeneous and hierarchical parallel systems; difficult to estimate optimum parameters leading to lowest execution times. Solution is to develop codes with auto-tuning engines. Tries to ensure execution times close to optimum, independently of the particular problem and of the characteristics of the computing systems. Types of Autotuning techniques Empirical autotuning. Modeling of the execution time. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
32 Autotuning parallel codes Empirical autotuning Run some test executions during the initial installation phase of the routine (installation set; keep installation set small). This information is used at running time when a particular problem is being solved (validation set). Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
33 Autotuning parallel codes Empirical autotuning Run some test executions during the initial installation phase of the routine (installation set; keep installation set small). This information is used at running time when a particular problem is being solved (validation set). images-points AUTO-TUNING LOWEST DEVIATION 35.96% 0% 3.02% 10.52% Waveguide GF: different problem sizes (images, number of points). Execution times with the autotuning technique and with the optimum parameters (lowest). Autotuning routine performs well for all problem sizes investigated. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
34 Autotuning parallel codes Modeling of linear algebra routines Automatic selection of the preferred routine (zsysv/zgesv) helps to reduce the execution time of VIE code. n 3 T (n, h) = k 1 h + k 2n 2 h + k 3 n 2 n 2 + k 4 h + k 5nh + k 6 n (1) Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
35 Autotuning parallel codes Modeling of linear algebra routines Automatic selection of the preferred routine (zsysv/zgesv) helps to reduce the execution time of VIE code. n 3 T (n, h) = k 1 h + k 2n 2 h + k 3 n 2 n 2 + k 4 h + k 5nh + k 6 n (1) Model and experiments are in good agreement. Model is used to select autom. best routine and No. of MKL threads. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
36 Conclusions Combination of several parallelism paradigms allows the efficient solution of electromagnetic problems in today s computational systems, which are hybrid, heterogeneous and hierarchical. Calculation of Green s functions inside waveguides has been adapted for heterogeneous clusters with CPUs and GPUs with different speeds. Large systems arising in VIE formulation were solved combining OpenMP and MKL parallelism. We expect further improvements if large systems are solved combining (MPI+OpenMP+GPU) with out-of-core techniques and efficient use of multithread and multigpu linear algebra routines. Auto-tuning techniques can be incorporated so that non parallelism-experts can use routines efficiently in complex computational systems. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP / 18
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
More informationAuto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems
Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems Murilo Boratto Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Universidade do Estado
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationA quick tutorial on Intel's Xeon Phi Coprocessor
A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed
More informationAccelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
More informationHigh Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
More informationOverview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
More informationHigh Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationScalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationRetargeting PLAPACK to Clusters with Hardware Accelerators
Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
More informationRecommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationOperating System Multilevel Load Balancing
Operating System Multilevel Load Balancing M. Corrêa, A. Zorzo Faculty of Informatics - PUCRS Porto Alegre, Brazil {mcorrea, zorzo}@inf.pucrs.br R. Scheer HP Brazil R&D Porto Alegre, Brazil roque.scheer@hp.com
More informationKashif Iqbal - PhD Kashif.iqbal@ichec.ie
HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD Kashif.iqbal@ichec.ie ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo
More informationImplementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,
More informationST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationMixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
More informationBLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
More informationMultilevel Load Balancing in NUMA Computers
FACULDADE DE INFORMÁTICA PUCRS - Brazil http://www.pucrs.br/inf/pos/ Multilevel Load Balancing in NUMA Computers M. Corrêa, R. Chanin, A. Sales, R. Scheer, A. Zorzo Technical Report Series Number 049 July,
More informationMPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
More informationCentralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures
Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do
More informationAn introduction to Fyrkat
Cluster Computing May 25, 2011 How to get an account https://fyrkat.grid.aau.dk/useraccount How to get help https://fyrkat.grid.aau.dk/wiki What is a Cluster Anyway It is NOT something that does any of
More informationPerformance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
More informationClustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate
More informationLecture 1. Course Introduction
Lecture 1 Course Introduction Welcome to CSE 262! Your instructor is Scott B. Baden Office hours (week 1) Tues/Thurs 3.30 to 4.30 Room 3244 EBU3B 2010 Scott B. Baden / CSE 262 /Spring 2011 2 Content Our
More informationA Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
More informationVII ENCUENTRO IBÉRICO DE ELECTROMAGNETISMO COMPUTACIONAL, MONFRAGÜE, CÁCERES, 19-21 MAYO 2010 29
VII ENCUENTRO IBÉRICO DE ELECTROMAGNETISMO COMPUTACIONAL, MONFRAGÜE, CÁCERES, 19-21 MAYO 2010 29 Shared Memory Supercomputing as Technique for Computational Electromagnetics César Gómez-Martín, José-Luis
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
More informationOverview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationData-parallel Acceleration of PARSEC Black-Scholes Benchmark
Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158
More informationChapter 18: Database System Architectures. Centralized Systems
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
More informationPerformance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
More informationLBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
More information64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
More informationAuto-Tunning of Data Communication on Heterogeneous Systems
1 Auto-Tunning of Data Communication on Heterogeneous Systems Marc Jordà 1, Ivan Tanasic 1, Javier Cabezas 1, Lluís Vilanova 1, Isaac Gelado 1, and Nacho Navarro 1, 2 1 Barcelona Supercomputing Center
More informationPerformance analysis with Periscope
Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München September 2010 Outline Motivation Periscope architecture Periscope performance analysis
More informationACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
More informationAcceleration of a CFD Code with a GPU
Acceleration of a CFD Code with a GPU Dennis C. Jespersen ABSTRACT The Computational Fluid Dynamics code Overflow includes as one of its solver options an algorithm which is a fairly small piece of code
More informationHigh Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
More informationIntelligent Heuristic Construction with Active Learning
Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field
More informationEvoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca
Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca Carlo Cavazzoni CINECA Supercomputing Application & Innovation www.cineca.it 21 Aprile 2015 FERMI Name: Fermi Architecture: BlueGene/Q
More informationAltix Usage and Application Programming. Welcome and Introduction
Zentrum für Informationsdienste und Hochleistungsrechnen Altix Usage and Application Programming Welcome and Introduction Zellescher Weg 12 Tel. +49 351-463 - 35450 Dresden, November 30th 2005 Wolfgang
More informationMapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
More informationCase Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
More informationCORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER
CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER Tender Notice No. 3/2014-15 dated 29.12.2014 (IIT/CE/ENQ/COM/HPC/2014-15/569) Tender Submission Deadline Last date for submission of sealed bids is extended
More informationMulti-GPU Jacobian Accelerated Computing for Soft Field Tomography
Multi-GPU Jacobian Accelerated Computing for Soft Field Tomography A. Borsic, E. A. Attardo, R. J. Halter Thayer School Of Engineering, Dartmouth College, 8000 Cummings Hall Hanover, NH 03755, US E-mail:
More informationScalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics
More informationUnleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
More informationGPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
More informationProgramming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture
Programming Techniques for Supercomputers: Multicore processors There is no way back Modern multi-/manycore chips Basic ompute Node Architecture SimultaneousMultiThreading (SMT) Prof. Dr. G. Wellein (a,b),
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More information15-418 Final Project Report. Trading Platform Server
15-418 Final Project Report Yinghao Wang yinghaow@andrew.cmu.edu May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support
More informationGeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
More informationParallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
More informationWorkshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012
Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter Strazdins (Research School of Computer Science),
More informationUsing the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial
Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial Bill Barth, Kent Milfeld, Dan Stanzione Tommy Minyard Texas Advanced Computing Center Jim Jeffers, Intel June 2013, Leipzig, Germany
More informationConcurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes
Concurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes Oluwapelumi Adenikinju1, Julian Gilyard2, Joshua Massey1, Thomas Stitt 1 Department of Computer Science and Electrical Engineering, UMBC
More informationPart I Courses Syllabus
Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More information5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model
5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model C99, C++, F2003 Compilers Optimizing Vectorizing Parallelizing Graphical parallel tools PGDBG debugger PGPROF profiler Intel, AMD, NVIDIA
More informationTurbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
More informationComparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationParallel Processing and Software Performance. Lukáš Marek
Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel
More informationModern Platform for Parallel Algorithms Testing: Java on Intel Xeon Phi
I.J. Information Technology and Computer Science, 2015, 09, 8-14 Published Online August 2015 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2015.09.02 Modern Platform for Parallel Algorithms
More informationPerformance Study of Parallel Programming Paradigms on a Multicore Clusters using Ant Colony Optimization for Job-flow scheduling problems
Performance Study of Parallel Programming Paradigms on a Multicore Clusters using Ant Colony Optimization for Job-flow scheduling problems Nagaveni V # Dr. G T Raju * # Department of Computer Science and
More informationOpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
More informationThe Spanish Parallel Programming Contest and its use as an educational resource
The Spanish Parallel Programming Contest and its use as an educational resource Francisco Almeida Departamento de Estadística, Investigación Operativa y Computación, University of La Laguna Javier Cuenca,
More informationKeys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
More informationImproving System Scalability of OpenMP Applications Using Large Page Support
Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationIntroduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationDense Linear Algebra Solvers for Multicore with GPU Accelerators
Dense Linear Algebra Solvers for Multicore with GPU Accelerators Stanimire Tomov, Rajib Nath, Hatem Ltaief, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee,
More informationGPU Accelerated Monte Carlo Simulations and Time Series Analysis
GPU Accelerated Monte Carlo Simulations and Time Series Analysis Institute of Physics, Johannes Gutenberg-University of Mainz Center for Polymer Studies, Department of Physics, Boston University Artemis
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
More informationSymmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
More informationGPU Usage. Requirements
GPU Usage Use the GPU Usage tool in the Performance and Diagnostics Hub to better understand the high-level hardware utilization of your Direct3D app. You can use it to determine whether the performance
More informationThree Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
More informationBuilding an energy dashboard. Energy measurement and visualization in current HPC systems
Building an energy dashboard Energy measurement and visualization in current HPC systems Thomas Geenen 1/58 thomas.geenen@surfsara.nl SURFsara The Dutch national HPC center 2H 2014 > 1PFlop GPGPU accelerators
More informationAccelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
More informationAccelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
More informationDesign and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
More information2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12
More informationDavid Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
More informationChapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
More informationIntroduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
More informationGPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer
GU Computing 1 2 3 The GU Advantage To ExaScale and Beyond The GU is the Computer The GU Advantage The GU Advantage A Tale of Two Machines Tianhe-1A at NSC Tianjin Tianhe-1A at NSC Tianjin The World s
More informationAeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications
AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications Dr. Bjoern Landmann Dr. Kerstin Wieczorek Stefan Bachschuster 18.03.2015 FluiDyna GmbH, Lichtenbergstr. 8, 85748 Garching
More informationIntroduction to Linux and Cluster Basics for the CCR General Computing Cluster
Introduction to Linux and Cluster Basics for the CCR General Computing Cluster Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY 14203 Phone: 716-881-8959
More informationExperiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
More information