2IP WP8 Materiel Science Activity report March 6, 2013

Size: px
Start display at page:

Download "2IP WP8 Materiel Science Activity report March 6, 2013"

Transcription

1 2IP WP8 Materiel Science Activity report March 6, 2013

2 Codes involved in this task ABINIT (M.Torrent) Quantum ESPRESSO (F. Affinito) YAMBO + Octopus (F. Nogueira) SIESTA (G. Huhs) EXCITING/ELK (A. Kozhevnikov) 2

3 ABINIT Prace 2IP-WP8 Developers groups CEA (TGCC, Paris) Marc Torrent, Florent Dahm CEA (INAC, Grenoble) Luigi Genovese, Brice Videau UCL (Louvain-la-Neuve) Xavier Gonze, Matteo Giantomassi BSC (Barcelona) Georg Huhs Prace 2IP-WP8 Task list Ground state + plane-waves Introduce shared memory parallelism Improve load balancing Automate processor topology and use of libraries Ground state + wavelets Implement a complete solution of automatic code generation for the convolutions Excited states Implement a hybrid MPI-OpenMP approach Use Scalapack for the inversion of the dielectric matrix Implementation a new MPI distribution for the orbitals Use MPI-IO routines to read the wave functions Response function Remove bottlenecks, distribute wave- functions Parallelize several sections that are done in sequential Parallelize the outer loop on perturbations Everything is done and committed in the trunk of the development version (v7.3.0) 3

4 ABINIT Hybrid version MPI-openMP Fourier Transforms BEFOR E TGCC-CURIE, INTEL SANDY BRIDGE AFTER MKL NATIVE IMPLEMENTATION 4

5 ABINIT Hybrid version MPI-openMP Non-local operator (Hamiltonian) # threads Speedup 1 1,00 4 3,85 8 6,81 Specific kernels (re-)written Up to 8 threads : 85% efficiency More than 8 threads : thread synchronization issues Under development: openacc version TGCC-CURIE, INTEL SANDY BRIDGE TEST CASE : 107 GOLD ATOMS 128 MPI processes 5

6 ABINIT Ground-state + plane-wave section Load balancing Load balancing on bands : improved! Load balancing on plane waves : need a small communication BEFORE WITH L. BALANCING ON BANDS WITH L. BALANCING ON PW TGCC-CURIE, INTEL SANDY BRIDGE TEST CASE : 107 GOLD ATOMS Waiting time has decreased! 6

7 ABINIT Response function section NEW PARALLELIZATION LEVEL OVER PERTURBATIONS Mostly re-used work done for the image parallelization (NEB, Path-Integral MD) Divide-and-conquer scheme prepared IMPROVEMENT OF WAVE-FUNCTIONS INITIALIZATION Orthogonalization suppressed New MPI-IO routines # MPI proc. Speedup vs Test case : 29 BaTiO 3 atoms 16 irreduccible perturbations 16 k-points, 120 bands TGCC-CURIE, INTEL SANDY BRIDGE 7

8 ABINIT Response function section BEFOR E AFTER Remaining functions under study 8

9 ABINIT Automatic processes distribution among parallelization levels ADAPT PROCESSES TOPOLOGY TO THE PROBLEM SIZE AND THE ARCHITECTURE First level : a simple heuristic used to predict scaling factor Second level : micro-benchmark used to choose libraries and (Cuda, ScaLapack, ) adjust processes topology REAL SPEEDUP PREDICTED SPEEDUP TGCC-CURIE, INTEL SANDY BRIDGE TEST CASE : 107 GOLD ATOMS 9

10 ABINIT-BigDFT Automatic code generator for convolutions (wavelet basis) A NEW PARAMETRIZED GENERATOR that can optimize bigdft convolutions (badly optimized and vectorized by compilers) Simple 0.3 GFLOPS BOAST Bringing Optimization through Automatic Source-to-Source Transformations Hand vectorized 6.5 GFLOPS Used to generate multiple version of the reference convolution Architecture dependent optimization parameters : optimal unroll degree, resource usage pattern, PORTED ON TIBIDABO (PRACE PROTOTYPE AT BSC) 10

11 Quantum ESPRESSO Parallelization on bands (PW, CP, GIPAW, Phonon) OpenMP parallelization EPW parallelization ELPA implementation PP testing (G2 test set) Improvement of portability (MIC) Groups involved: CINECA (Italy) ICHEC (Ireland) University of Sofia (Bulgary) IPB (Serbia) 11

12 Quantum ESPRESSO - ELPA Implemented 1-stage ELPA for symmetric matrices (gamma point) 12

13 Quantum ESPRESSO MIC/GPUs - Several modules have been already ported to GPUs with phigemm (outside PRACE activity) - CP and PW have been now ported in native mode (no offload) to KNC 13

14 Quantum ESPRESSO - PHonon Introducing a new level of MPI parallelism into PHonon code Will allow PHonon to scale on petascale machines Parallelising over bands similar implementation in PWScf & GIPAW PHonon more complicated than GIPAW more dependencies So far, implemented for gamma + non-gamma parts of code Currently (Debugging) Testing + Benchmarking Will rely on more input data sets from community 14

15 Quantum ESPRESSO - EPW 15

16 Quantum ESPRESSO - benchmarking Benchmarking and testing on the G2 test set. Checking the accuracy of DFT calculatons using B3LYP functionals. Differences in the B3LYP total electronic energies (with QE) between single point calculations with the MP2 optimized geometry and structures optimized with B3LYP (QE) for selected molecules. Molecule Difference, ev neutrals cations H 2 O CH NH PH SiH H 2 S CO C 2 H C 2 H C 2 H Average

17 Yambo Work has been shared on two tasks: University of COIMBRA CINECA University of Coimbra wasn t able to fulfill the commitment for administrative issues CINECA took part with the involvment of the italian community of developers 17

18 Yambo The code needed a deep refactoring to be able to run on Tier-0 architectures Multilevel parallelization OpenMP parallelization Distribution of data structures Before this work scaling was strongly limited up to several hundreds of cores (mainly due to memory limitations) 18

19 Yambo 19

20 20

21 21

22 Siesta Work done: l MAGMA solver implemented Range of applications is limited l Suguira-Sakurai algorithm: Prototype implemented and tested l FEAST library tested l New method: PEXSI The Suguira-Sakurai and FEAST methods show a load balancing problem dropped these algorithms l l Prototype phase finished Started implementation into Siesta 22

23 PEXSI: l 2 Levels of parallelization: l Independent Nodes (good number: 80) Siesta PEXSI l Per node e.g. 16, 64, 256,... processes è Thousands of cores still efficient l Favorable computational complexity l O(n^2) for 3D systems l O(n^3/2) for quasi-2d systems l O(n) for 1D systems Without additional simplifications!! l Target are huge systems (tens of thousands of atoms) l Cooperation with a group working on layered systems of this size 23

24 Siesta PEXSI results PEXSI outperforms ScaLAPACK when applied to ADNA-10, but not for a small C-BN-C (layered system) example. When increasing the problem size by stacking unitcells: l Effort grows with O(n^3/2), then even linar (see table, times in seconds) è Example with 8 unitcells, meaning more than atoms, becomes solvable 24

25 Conclusions so far: - Work has been successfully accomplished for all the codes involved in this task - Meetings during the work package have been useful to compare and validate obtained results and to share ideas and perspectives - We all agree that rather than work for a common code we can look forward for common rules to facilitate inter-exchange between different codes

26 Validation strategy - we need to show that obtained results are relevant to the communities - we need a feedback of the interplay between communities and PRACE computing centers - we want to highlight the importance of initiatives like the WP8 package, where communities are working together with scientists to produce improvements on the most relevant codes We proposed to introduce into the final deliverable a short section (1-2 pages) written by one representative per code from scientific communities. In this section the work made on the WP8 will be assessed by stressing the importance of the obtained results for the community

27 Work to do: - complete the documentation on the wiki HPC-Forge - individuate people for the communities for the writing of the deliverable - completing the WP8 work (synchronization of the repos, documentation, benchmarking, reintegration) whereas it is needed

Graphic Processing Units: a possible answer to High Performance Computing?

Graphic Processing Units: a possible answer to High Performance Computing? 4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/

More information

A Pattern-Based Approach to. Automated Application Performance Analysis

A Pattern-Based Approach to. Automated Application Performance Analysis A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,

More information

SEVENTH FRAMEWORK PROGRAMME Research Infrastructures

SEVENTH FRAMEWORK PROGRAMME Research Infrastructures SEVENTH FRAMEWORK PROGRAMME Research Infrastructures INFRA-2011-2.3.5 Second Implementation Phase of the European High Performance Computing (HPC) service PRACE PRACE-2IP PRACE Second Implementation Project

More information

A Case Study - Scaling Legacy Code on Next Generation Platforms

A Case Study - Scaling Legacy Code on Next Generation Platforms Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy

More information

A quick tutorial on Intel's Xeon Phi Coprocessor

A quick tutorial on Intel's Xeon Phi Coprocessor A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed

More information

Optimization on Huygens

Optimization on Huygens Optimization on Huygens Wim Rijks wimr@sara.nl Contents Introductory Remarks Support team Optimization strategy Amdahls law Compiler options An example Optimization Introductory Remarks Modern day supercomputers

More information

HPC Wales Skills Academy Course Catalogue 2015

HPC Wales Skills Academy Course Catalogue 2015 HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

The Quantum ESPRESSO Software Distribution

The Quantum ESPRESSO Software Distribution The Quantum ESPRESSO Software Distribution The DEMOCRITOS center of Italian INFM is dedicated to atomistic simulations of materials, with a strong emphasis on the development of high-quality scientific

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

Boosting Long Term Evolution (LTE) Application Performance with Intel System Studio

Boosting Long Term Evolution (LTE) Application Performance with Intel System Studio Case Study Intel Boosting Long Term Evolution (LTE) Application Performance with Intel System Studio Challenge: Deliver high performance code for time-critical tasks in LTE wireless communication applications.

More information

DARPA, NSF-NGS/ITR,ACR,CPA,

DARPA, NSF-NGS/ITR,ACR,CPA, Spiral Automating Library Development Markus Püschel and the Spiral team (only part shown) With: Srinivas Chellappa Frédéric de Mesmay Franz Franchetti Daniel McFarlin Yevgen Voronenko Electrical and Computer

More information

YALES2 porting on the Xeon- Phi Early results

YALES2 porting on the Xeon- Phi Early results YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

Q-Chem: Quantum Chemistry Software for Large Systems. Peter M.W. Gill. Q-Chem, Inc. Four Triangle Drive Export, PA 15632, USA. and

Q-Chem: Quantum Chemistry Software for Large Systems. Peter M.W. Gill. Q-Chem, Inc. Four Triangle Drive Export, PA 15632, USA. and Q-Chem: Quantum Chemistry Software for Large Systems Peter M.W. Gill Q-Chem, Inc. Four Triangle Drive Export, PA 15632, USA and Department of Chemistry University of Cambridge Cambridge, CB2 1EW, England

More information

Case Study on Productivity and Performance of GPGPUs

Case Study on Productivity and Performance of GPGPUs Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia

More information

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University

More information

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

More information

Part I Courses Syllabus

Part I Courses Syllabus Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment

More information

Performance analysis of parallel applications on modern multithreaded processor architectures

Performance analysis of parallel applications on modern multithreaded processor architectures Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance analysis of parallel applications on modern multithreaded processor architectures Maciej Cytowski* a, Maciej

More information

SR-IOV: Performance Benefits for Virtualized Interconnects!

SR-IOV: Performance Benefits for Virtualized Interconnects! SR-IOV: Performance Benefits for Virtualized Interconnects! Glenn K. Lockwood! Mahidhar Tatineni! Rick Wagner!! July 15, XSEDE14, Atlanta! Background! High Performance Computing (HPC) reaching beyond traditional

More information

Parallel Algorithm for Dense Matrix Multiplication

Parallel Algorithm for Dense Matrix Multiplication Parallel Algorithm for Dense Matrix Multiplication CSE633 Parallel Algorithms Fall 2012 Ortega, Patricia Outline Problem definition Assumptions Implementation Test Results Future work Conclusions Problem

More information

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

More information

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden smakadir@csc.kth.se,

More information

Fortran Program Development with Visual Studio* 2005 ~ Use Intel Visual Fortran with Visual Studio* ~

Fortran Program Development with Visual Studio* 2005 ~ Use Intel Visual Fortran with Visual Studio* ~ Fortran Program Development with Visual Studio* 2005 ~ Use Intel Visual Fortran with Visual Studio* ~ 31/Oct/2006 Software &Solutions group * Agenda Features of Intel Fortran Compiler Integrate with Visual

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

Advanced Computational Software

Advanced Computational Software Advanced Computational Software Scientific Libraries: Part 2 Blue Waters Undergraduate Petascale Education Program May 29 June 10 2011 Outline Quick review Fancy Linear Algebra libraries - ScaLAPACK -PETSc

More information

Resource Scheduling Best Practice in Hybrid Clusters

Resource Scheduling Best Practice in Hybrid Clusters Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Resource Scheduling Best Practice in Hybrid Clusters C. Cavazzoni a, A. Federico b, D. Galetti a, G. Morelli b, A. Pieretti

More information

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab Performance monitoring at CERN openlab July 20 th 2012 Andrzej Nowak, CERN openlab Data flow Reconstruction Selection and reconstruction Online triggering and filtering in detectors Raw Data (100%) Event

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

More information

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old OpenFOAM: Computational Fluid Dynamics Gauss Siedel iteration : (L + D) * x new = b - U * x old What s unique about my tuning work The OpenFOAM (Open Field Operation and Manipulation) CFD Toolbox is a

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum

More information

Kommunikation in HPC-Clustern

Kommunikation in HPC-Clustern Kommunikation in HPC-Clustern Communication/Computation Overlap in MPI W. Rehm and T. Höfler Department of Computer Science TU Chemnitz http://www.tu-chemnitz.de/informatik/ra 11.11.2005 Outline 1 2 Optimize

More information

Better Digital Signal Processing Performance; Lower Costs With Innovative IntervalZero RTX Real-time Platform

Better Digital Signal Processing Performance; Lower Costs With Innovative IntervalZero RTX Real-time Platform White Paper Better Digital Signal Performance; Lower Costs With Innovative IntervalZero RTX Real-time Platform I. Overview Digital Signal Processors (s) have specialized architectures that are optimized

More information

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model 5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model C99, C++, F2003 Compilers Optimizing Vectorizing Parallelizing Graphical parallel tools PGDBG debugger PGPROF profiler Intel, AMD, NVIDIA

More information

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks OpenACC Parallelization and Optimization of NAS Parallel Benchmarks Presented by Rengan Xu GTC 2014, S4340 03/26/2014 Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, Barbara Chapman HPC Tools

More information

Analysis, post-processing and visualization tools

Analysis, post-processing and visualization tools Analysis, post-processing and visualization tools Javier Junquera Andrei Postnikov Summary of different tools for post-processing and visualization DENCHAR PLRHO DOS, PDOS DOS and PDOS total Fe, d MACROAVE

More information

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Goals of the session Overview of parallel MATLAB Why parallel MATLAB? Multiprocessing in MATLAB Parallel MATLAB using the Parallel Computing

More information

Autotuning dense linear algebra libraries on GPUs and overview of the MAGMA library

Autotuning dense linear algebra libraries on GPUs and overview of the MAGMA library Autotuning dense linear algebra libraries on GPUs and overview of the MAGMA library Rajib Nath, Stan Tomov, Jack Dongarra Innovative Computing Laboratory University of Tennessee, Knoxville Speaker: Emmanuel

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since

More information

Retour d expérience : portage d une application haute-performance vers un langage de haut niveau

Retour d expérience : portage d une application haute-performance vers un langage de haut niveau Retour d expérience : portage d une application haute-performance vers un langage de haut niveau ComPAS/RenPar 2013 Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte 16 Janvier 2013 Our Goals Globally

More information

OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC 2.0 and the PGI Accelerator Compilers OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present

More information

Multi Scale Design of nanomaterials with simulations on hybrid architectures: (the muscade project)

Multi Scale Design of nanomaterials with simulations on hybrid architectures: (the muscade project) Multi Scale Design of nanomaterials with simulations on hybrid architectures: (the muscade project) Pascal Pochet (CEA-INAC) INRIA 2009-2012 Chair of excellence for Normand Mousseau Alain Pasturel Noel

More information

Mathematical Libraries and Application Software on JUROPA and JUQUEEN

Mathematical Libraries and Application Software on JUROPA and JUQUEEN Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUROPA and JUQUEEN JSC Training Course May 2014 I.Gutheil Outline General Informations Sequential Libraries Parallel

More information

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la

More information

Course Development of Programming for General-Purpose Multicore Processors

Course Development of Programming for General-Purpose Multicore Processors Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 wzhang4@vcu.edu

More information

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy Perspectives of GPU Computing in Physics

More information

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Building a Top500-class Supercomputing Cluster at LNS-BUAP Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad

More information

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of

More information

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild. Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

Building an Inexpensive Parallel Computer

Building an Inexpensive Parallel Computer Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem

More information

High Performance Computing for Operation Research

High Performance Computing for Operation Research High Performance Computing for Operation Research IEF - Paris Sud University claude.tadonki@u-psud.fr INRIA-Alchemy seminar, Thursday March 17 Research topics Fundamental Aspects of Algorithms and Complexity

More information

Performance of Software Switching

Performance of Software Switching Performance of Software Switching Based on papers in IEEE HPSR 2011 and IFIP/ACM Performance 2011 Nuutti Varis, Jukka Manner Department of Communications and Networking (COMNET) Agenda Motivation Performance

More information

Analysis and Optimization of a Hybrid Linear Equation Solver using Task-Based Parallel Programming Models

Analysis and Optimization of a Hybrid Linear Equation Solver using Task-Based Parallel Programming Models Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Analysis and Optimization of a Hybrid Linear Equation Solver using Task-Based Parallel Programming Models Claudia Rosas,

More information

Experiments in Unstructured Mesh Finite Element CFD Using CUDA

Experiments in Unstructured Mesh Finite Element CFD Using CUDA Experiments in Unstructured Mesh Finite Element CFD Using CUDA Graham Markall Software Performance Imperial College London http://www.doc.ic.ac.uk/~grm08 grm08@doc.ic.ac.uk Joint work with David Ham and

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

HP ProLiant SL270s Gen8 Server. Evaluation Report

HP ProLiant SL270s Gen8 Server. Evaluation Report HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich schoenemeyer@cscs.ch

More information

Ivan Girotto a,, Axel Kohlmeyer a,b,, David Grellscheid a,c, Shawn T. Brown d

Ivan Girotto a,, Axel Kohlmeyer a,b,, David Grellscheid a,c, Shawn T. Brown d Advanced Techniques for Scientific Programming and Collaborative Development of Open Source Software Packages at the International Centre for Theoretical Physics (ICTP) Ivan Girotto a,, Axel Kohlmeyer

More information

THE NAS KERNEL BENCHMARK PROGRAM

THE NAS KERNEL BENCHMARK PROGRAM THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures

More information

FPGA area allocation for parallel C applications

FPGA area allocation for parallel C applications 1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

Software & systems for the neuromorphic generation of computing. Peter Suma co-ceo 1-416-505-8973 peter.suma@appliedbrainresearch.

Software & systems for the neuromorphic generation of computing. Peter Suma co-ceo 1-416-505-8973 peter.suma@appliedbrainresearch. Software & systems for the neuromorphic generation of computing. Peter Suma co-ceo 1-416-505-8973 peter.suma@appliedbrainresearch.com 15 minutes to explain how well the world s most functional AI runs

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

HPC enabling of OpenFOAM R for CFD applications

HPC enabling of OpenFOAM R for CFD applications HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Integrated Communication Systems

Integrated Communication Systems Integrated Communication Systems Courses, Research, and Thesis Topics Prof. Paul Müller University of Kaiserslautern Department of Computer Science Integrated Communication Systems ICSY http://www.icsy.de

More information

Performance Improvement of Application on the K computer

Performance Improvement of Application on the K computer Performance Improvement of Application on the K computer November 13, 2011 Kazuo Minami Team Leader, Application Development Team Research and Development Group Next-Generation Supercomputer R & D Center

More information

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist The Top Six Advantages of CUDA-Ready Clusters Ian Lumb Bright Evangelist GTC Express Webinar January 21, 2015 We scientists are time-constrained, said Dr. Yamanaka. Our priority is our research, not managing

More information

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster Introduction to Linux and Cluster Basics for the CCR General Computing Cluster Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY 14203 Phone: 716-881-8959

More information

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013 Cluster performance, how to get the most out of Abel Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013 Introduction Architecture x86-64 and NVIDIA Compilers MPI Interconnect Storage Batch queue

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Search Strategies for Automatic Performance Analysis Tools

Search Strategies for Automatic Performance Analysis Tools Search Strategies for Automatic Performance Analysis Tools Michael Gerndt and Edmond Kereku Technische Universität München, Fakultät für Informatik I10, Boltzmannstr.3, 85748 Garching, Germany gerndt@in.tum.de

More information

Rule-Based Program Transformation for Hybrid Architectures CSW Workshop Towards Portable Libraries for Hybrid Systems

Rule-Based Program Transformation for Hybrid Architectures CSW Workshop Towards Portable Libraries for Hybrid Systems Rule-Based Program Transformation for Hybrid Architectures CSW Workshop Towards Portable Libraries for Hybrid Systems M. Carro 1,2, S. Tamarit 2, G. Vigueras 1, J. Mariño 2 1 IMDEA Software Institute,

More information

OpenACC Programming and Best Practices Guide

OpenACC Programming and Best Practices Guide OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What

More information

Relations with ISV and Open Source. Stephane Requena GENCI Stephane.requena@genci.fr

Relations with ISV and Open Source. Stephane Requena GENCI Stephane.requena@genci.fr Relations with ISV and Open Source Stephane Requena GENCI Stephane.requena@genci.fr Agenda of this session 09:15 09:30 Prof. Hrvoje Jasak: Director, Wikki Ltd. «HPC Deployment of OpenFOAM in an Industrial

More information

Performance analysis with Periscope

Performance analysis with Periscope Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München September 2010 Outline Motivation Periscope architecture Periscope performance analysis

More information

IUmd. Performance Analysis of a Molecular Dynamics Code. Thomas William. Dresden, 5/27/13

IUmd. Performance Analysis of a Molecular Dynamics Code. Thomas William. Dresden, 5/27/13 Center for Information Services and High Performance Computing (ZIH) IUmd Performance Analysis of a Molecular Dynamics Code Thomas William Dresden, 5/27/13 Overview IUmd Introduction First Look with Vampir

More information

A Multi-layered Domain-specific Language for Stencil Computations

A Multi-layered Domain-specific Language for Stencil Computations A Multi-layered Domain-specific Language for Stencil Computations Christian Schmitt, Frank Hannig, Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Workshop ExaStencils 2014,

More information

Performance Analysis and Optimization Tool

Performance Analysis and Optimization Tool Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop

More information

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen

More information

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner Research Group Scientific Computing Faculty of Computer Science University of Vienna AUSTRIA http://www.par.univie.ac.at

More information

Mathematical Libraries on JUQUEEN. JSC Training Course

Mathematical Libraries on JUQUEEN. JSC Training Course Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries on JUQUEEN JSC Training Course May 10, 2012 Outline General Informations Sequential Libraries, planned Parallel Libraries and Application Systems:

More information

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle? Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and

More information

PeerMon: A Peer-to-Peer Network Monitoring System

PeerMon: A Peer-to-Peer Network Monitoring System PeerMon: A Peer-to-Peer Network Monitoring System Tia Newhall, Janis Libeks, Ross Greenwood, Jeff Knerr Computer Science Department Swarthmore College Swarthmore, PA USA newhall@cs.swarthmore.edu Target:

More information

PRACE: access to Tier-0 systems and enabling the access to ExaScale systems Dr. Sergi Girona Managing Director and Chair of the PRACE Board of

PRACE: access to Tier-0 systems and enabling the access to ExaScale systems Dr. Sergi Girona Managing Director and Chair of the PRACE Board of PRACE: access to Tier-0 systems and enabling the access to ExaScale systems Dr. Sergi Girona Managing Director and Chair of the PRACE Board of Directors PRACE aisbl, a persistent pan-european supercomputing

More information

64-Bit versus 32-Bit CPUs in Scientific Computing

64-Bit versus 32-Bit CPUs in Scientific Computing 64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples

More information

Pedraforca: ARM + GPU prototype

Pedraforca: ARM + GPU prototype www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of

More information

Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster

Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster Jonatan Ward Sergey Andreev Francisco Heredia Bogdan Lazar Zlatka Manevska Eindhoven University of Technology,

More information