Case Study on Productivity and Performance of GPGPUs

Size: px
Start display at page:

Download "Case Study on Productivity and Performance of GPGPUs"

Transcription

1 Case Study on Productivity and Performance of GPGPUs Sandra Wienke ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ)

2 RWTH GPU-Cluster 56 Nvidia Quadro 6000 (Fermi) 2 GPUs per host Host: 12-core Westmere-CPU High utilization of resources Foto: C. Iwainsky Daytime VR: new CAVE (48 GPUs) HPC: interactive software development (8 GPUs) 2 Nighttime HPC: processing of GPGPU compute jobs (54-56 GPUs) CAVE, VR, RWTH Aachen, since 2004

3 Agenda Introduction Real-world Applications Performance Productivity Conclusion & Outlook 3

4 Introduction Today s GPUs usable for scientific applications Double precision computations ECC Performance Speedup over serial/parallel CPU version (including data transfers) (Performance per Watt) Programmability, productivity Modified lines of code Manpower Ratio of development effort to performance 4

5 Introduction Investigation of 2 real-world software packages using different programming models OpenMP (C/Fortran) By OpenMP ARB: industry standard, shared-memory programming, CPUs CUDA (C/C++) By NVIDIA: GPU programming model, NVIDIA GPUs OpenCL (C) By Khronos Group: open standard, heterogeneous programming, CPU/GPU/ PGI Accelerator Model (C/Fortran) By PGI: directive-based GPU programming model, NVIDIA GPUs OpenACC (C/Fortran) By PGI, Cray, CAPS, NVIDIA: directive-based accelerator programming model, 5 industry standard published in Nov. 2011, NVIDIA GPUs (currently)

6 Real-World Applications: KegelSpan 3D simulation of bevel gear cutting process 1 Compute key values (i.a. chip thickness) to analyze tool load and tool wear Fortran code (chip thickness computation) Loop nest Dependencies in inner loop (minimum computation) Implementation Source: BMW, ZF, Klingelnberg simple: outer loop in parallel on threads, inner loop serially (CPU,GPU) + optimized data access pattern + reduction of data transfers (GPU) vec: simple + code restructuring for auto-vectorization (CPU) smem: simple + storing input/intermediate data in shared memory (GPU) 6 1 C. Brecher, C. Gorgels, and A. Hardjosuwito. Simulation based Tool Wear Analysis in Bevel Gear Cutting. In International Conference on Gears, volume of VDI-Berichte, pages , D usseldorf, VDI Verlag.

7 Real-World Applications: NINA Software 2 for the solution of Neuromagnetic INverse large-scale problems Matlab code (main program), C code (objective function, 1 st - and 2 nd -order derivatives) Source 2 Matrix-vector multiplications, vector operations Implementation simple: outer loop in parallel on threads, inner loop serially (CPU) blocked: simple + blocked matrix-vector multiplication (CPU,GPU) vec: blocked + code restructuring for auto-vectorization (CPU) l2p: blocked + two level of parallelism (GPU) advanced: blocked + async data transfer, async kernel execution, pinned memory, spcification of constant values as predprocessed macros (GPU) Loop unrolling important (pragma vs. manual unrolling) 7 2 M. Bücker, R. Beucker, and A. Rupp. Parallel Minimum p-norm Solution of the Neuromagnetic Inverse Problem for Realistic Signals Using Exact Hessian-Vector Products. SIAM Journal on Scientific Computing, 30(6): , 2008.

8 Performance Setup OpenMP, Serial GPU Host Compiler - Intel Westmere EP 12-core processor Scientific Linux 6.1 Intel CUDA NVIDIA Tesla C2050 Intel Westmere GCC OpenCL ECC on 4-core processor Intel PGI Accelerator CUDA Toolkit 4.0 Scientific Linux 6.1 PGI experimental system from Cray 2 comprises early implementation of OpenACC Some results were removed as they have not been published yet. 8

9 Speedup Performance KegelSpan Single Precision Double Precision 1.0

10 Productivity Kegelspan Added + modified lines of code (host/kernel) CUDA (smem) OpenCL (smem) Original serial version: PGI Acc (simple) ~150 kernel code lines OpenMP (simple) 152/58 183/58 84/14 -/4 =210 =241 =98 =4 Manpower estimated 1st time effort* estimated 2nd time effort** CUDA OpenCL PGI Acc OpenMP 30 days 40 days 25 days 5 days 4.5 days 5 days 1.5 days 0.5 days ** Effort to understand the architecture + programming model, to develop (+code reorganization), to debug ** Effort to just develop the application (assuming knowledge of architecture + programming model) 10

11 Conclusion KegelSpan No real surprises CUDA implementation 2x faster than highly-vectorized OpenMP code Vectorization for DP? Both: lot of code restructuring (high effort) Memory bound? CUDA speedup over simple OpenMP version: 6.3x (SP) and 3.5x (DP) PGI Accelerator: good ratio of effort to performance, especially in DP 11

12 Conclusion Programming model matters 12 Many assumptions hold Low-level GPU programming models (CUDA, OpenCL) Good performance Most effort Directive-based GPU programming model (PGI Acc or OpenACC) Often good ratio of effort to performance (still potential) Essential for further growth and acceptance of accelerators Important step towards (OpenMP-) standardization (OpenMP for Accelerators) (Auto-) Vectorization on CPUs Gets more important in future (e.g. AVX on Sandy Bridge) Performance benefit for double precision floating point operations uncertain Increasing development effort, but better understanding of architecture

13 Outlook Advance of programming models OpenMP for Accelerators Better compiler for auto-vectorization? Advance of computer architectures NVIDIA Kepler Intel MIC Aim: Comprehensive TCO calculation Manpower Performance (runtime) Power consumption Thank you for your attention! 13

RWTH GPU Cluster. Sandra Wienke wienke@rz.rwth-aachen.de November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

RWTH GPU Cluster. Sandra Wienke wienke@rz.rwth-aachen.de November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky RWTH GPU Cluster Fotos: Christian Iwainsky Sandra Wienke wienke@rz.rwth-aachen.de November 2012 Rechen- und Kommunikationszentrum (RZ) The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) innovative

More information

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

OpenACC Basics Directive-based GPGPU Programming

OpenACC Basics Directive-based GPGPU Programming OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. wienke@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES,

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

Introduction to OpenACC Directives. Duncan Poole, NVIDIA Thomas Bradley, NVIDIA

Introduction to OpenACC Directives. Duncan Poole, NVIDIA Thomas Bradley, NVIDIA Introduction to OpenACC Directives Duncan Poole, NVIDIA Thomas Bradley, NVIDIA GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

OpenACC Programming on GPUs

OpenACC Programming on GPUs OpenACC Programming on GPUs Directive-based GPGPU Programming Sandra Wienke, M.Sc. wienke@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools

More information

GPU Tools Sandra Wienke

GPU Tools Sandra Wienke Sandra Wienke Center for Computing and Communication, RWTH Aachen University MATSE HPC Battle 2012/13 Rechen- und Kommunikationszentrum (RZ) Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks OpenACC Parallelization and Optimization of NAS Parallel Benchmarks Presented by Rengan Xu GTC 2014, S4340 03/26/2014 Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, Barbara Chapman HPC Tools

More information

OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC 2.0 and the PGI Accelerator Compilers OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

GPGPU accelerated Computational Fluid Dynamics

GPGPU accelerated Computational Fluid Dynamics t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model 5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model C99, C++, F2003 Compilers Optimizing Vectorizing Parallelizing Graphical parallel tools PGDBG debugger PGPROF profiler Intel, AMD, NVIDIA

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

ASC Workshop Catalogue Brochure CSIRO ASC Version 1.0 August 2, 2013

ASC Workshop Catalogue Brochure CSIRO ASC Version 1.0 August 2, 2013 INFORMATION MANAGEMENT AND TECHNOLOGY www.csiro.au ASC Workshop Catalogue Brochure CSIRO ASC Version 1.0 August 2, 2013 Commercial In Confidence CSIRO Advanced Scientific Computing GPO Box 1289, Melbourne,

More information

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Experiences on using GPU accelerators for data analysis in ROOT/RooFit Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158

More information

Introduction to Hybrid Programming

Introduction to Hybrid Programming Introduction to Hybrid Programming Hristo Iliev Rechen- und Kommunikationszentrum aixcelerate 2012 / Aachen 10. Oktober 2012 Version: 1.1 Rechen- und Kommunikationszentrum (RZ) Motivation for hybrid programming

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA Introduction to the CUDA Toolkit for Building Applications Adam DeConinck HPC Systems Engineer, NVIDIA ! What this talk will cover: The CUDA 5 Toolkit as a toolchain for HPC applications, focused on the

More information

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

More information

CARMA CUDA on ARM Architecture. Developing Accelerated Applications on ARM

CARMA CUDA on ARM Architecture. Developing Accelerated Applications on ARM CARMA CUDA on ARM Architecture Developing Accelerated Applications on ARM CARMA is an architectural prototype for high performance, energy efficient hybrid computing Schedule Motivation System Overview

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

A quick tutorial on Intel's Xeon Phi Coprocessor

A quick tutorial on Intel's Xeon Phi Coprocessor A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed

More information

HPC Wales Skills Academy Course Catalogue 2015

HPC Wales Skills Academy Course Catalogue 2015 HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses

More information

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, and Barbara Chapman Department of Computer Science, University

More information

ST810 Advanced Computing

ST810 Advanced Computing ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview

More information

Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide

More information

CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University

CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University CUDA Debugging GPGPU Workshop, August 2012 Sandra Wienke Center for Computing and Communication, RWTH Aachen University Nikolay Piskun, Chris Gottbrath Rogue Wave Software Rechen- und Kommunikationszentrum

More information

OpenACC Programming and Best Practices Guide

OpenACC Programming and Best Practices Guide OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What

More information

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy Perspectives of GPU Computing in Physics

More information

The Feasibility of Using OpenCL Instead of OpenMP for Parallel CPU Programming

The Feasibility of Using OpenCL Instead of OpenMP for Parallel CPU Programming The Feasibility of Using OpenCL Instead of OpenMP for Parallel CPU Programming Kamran Karimi Neak Solutions Calgary, Alberta, Canada kamran@neak-solutions.com Abstract OpenCL, along with CUDA, is one of

More information

Performance Characteristics of Large SMP Machines

Performance Characteristics of Large SMP Machines Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark

More information

OpenCL for programming shared memory multicore CPUs

OpenCL for programming shared memory multicore CPUs Akhtar Ali, Usman Dastgeer and Christoph Kessler. OpenCL on shared memory multicore CPUs. Proc. MULTIPROG-212 Workshop at HiPEAC-212, Paris, Jan. 212. OpenCL for programming shared memory multicore CPUs

More information

Accelerating CST MWS Performance with GPU and MPI Computing. CST workshop series

Accelerating CST MWS Performance with GPU and MPI Computing.  CST workshop series Accelerating CST MWS Performance with GPU and MPI Computing www.cst.com CST workshop series 2010 1 Hardware Based Acceleration Techniques - Overview - Multithreading GPU Computing Distributed Computing

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Analysis of GPU Parallel Computing based on Matlab

Analysis of GPU Parallel Computing based on Matlab Analysis of GPU Parallel Computing based on Matlab Mingzhe Wang, Bo Wang, Qiu He, Xiuxiu Liu, Kunshuai Zhu (School of Computer and Control Engineering, University of Chinese Academy of Sciences, Huairou,

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

IUmd. Performance Analysis of a Molecular Dynamics Code. Thomas William. Dresden, 5/27/13

IUmd. Performance Analysis of a Molecular Dynamics Code. Thomas William. Dresden, 5/27/13 Center for Information Services and High Performance Computing (ZIH) IUmd Performance Analysis of a Molecular Dynamics Code Thomas William Dresden, 5/27/13 Overview IUmd Introduction First Look with Vampir

More information

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

More information

Introduction to GPU Computing

Introduction to GPU Computing Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture

More information

Programming the Intel Xeon Phi Coprocessor

Programming the Intel Xeon Phi Coprocessor Programming the Intel Xeon Phi Coprocessor Tim Cramer cramer@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Motivation Many Integrated Core (MIC) Architecture Programming Models Native

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Kalray MPPA Massively Parallel Processing Array

Kalray MPPA Massively Parallel Processing Array Kalray MPPA Massively Parallel Processing Array Next-Generation Accelerated Computing February 2015 2015 Kalray, Inc. All Rights Reserved February 2015 1 Accelerated Computing 2015 Kalray, Inc. All Rights

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

HPC usage in geo-resources exploration and modeling

HPC usage in geo-resources exploration and modeling HPC usage in geo-resources exploration and modeling Oscar Peredo Associate Researcher ALGES Lab, Advanced Mining Technology Center, U. de Chile BioMed-HPC 2014, January, 2014 - Santiago, Chile Outline

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing

More information

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

GPUs: Doing More Than Just Games. Mark Gahagan CSE 141 November 29, 2012

GPUs: Doing More Than Just Games. Mark Gahagan CSE 141 November 29, 2012 GPUs: Doing More Than Just Games Mark Gahagan CSE 141 November 29, 2012 Outline Introduction: Why multicore at all? Background: What is a GPU? Quick Look: Warps and Threads (SIMD) NVIDIA Tesla: The First

More information

Building Blocks. CPUs, Memory and Accelerators

Building Blocks. CPUs, Memory and Accelerators Building Blocks CPUs, Memory and Accelerators Outline Computer layout CPU and Memory What does performance depend on? Limits to performance Silicon-level parallelism Single Instruction Multiple Data (SIMD/Vector)

More information

Part I Courses Syllabus

Part I Courses Syllabus Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment

More information

Designing the Future: How Successful Codesign Helps Shape Hardware and Software Development

Designing the Future: How Successful Codesign Helps Shape Hardware and Software Development Official Use Only Designing the Future: How Successful Codesign Helps Shape Hardware and Software Development Christian Trott Unclassified, Unlimited release Sandia National Laboratories is a multi-program

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

Project INF BigData. Figure 1: Plot of the learned function from the checker board data set.

Project INF BigData. Figure 1: Plot of the learned function from the checker board data set. Project INF BigData Roberto Fontanarosa, Tobias Rupp, and Steffen Hirschmann Figure 1: Plot of the learned function from the checker board data set. Abstract Prediction and forecasting has become very

More information

Resource Scheduling Best Practice in Hybrid Clusters

Resource Scheduling Best Practice in Hybrid Clusters Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Resource Scheduling Best Practice in Hybrid Clusters C. Cavazzoni a, A. Federico b, D. Galetti a, G. Morelli b, A. Pieretti

More information

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la

More information

www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING GPU COMPUTING VISUALISATION XENON Accelerating Exploration Mineral, oil and gas exploration is an expensive and challenging

More information

Overview of HPC systems and software available within

Overview of HPC systems and software available within Overview of HPC systems and software available within Overview Available HPC Systems Ba Cy-Tera Available Visualization Facilities Software Environments HPC System at Bibliotheca Alexandrina SUN cluster

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set

More information

#OpenPOWERSummit. Join the conversation at #OpenPOWERSummit 1

#OpenPOWERSummit. Join the conversation at #OpenPOWERSummit 1 XLC/C++ and GPU Programming on Power Systems Kelvin Li, Kit Barton, John Keenleyside IBM {kli, kbarton, keenley}@ca.ibm.com John Ashley NVIDIA jashley@nvidia.com #OpenPOWERSummit Join the conversation

More information

Experiments in Unstructured Mesh Finite Element CFD Using CUDA

Experiments in Unstructured Mesh Finite Element CFD Using CUDA Experiments in Unstructured Mesh Finite Element CFD Using CUDA Graham Markall Software Performance Imperial College London http://www.doc.ic.ac.uk/~grm08 grm08@doc.ic.ac.uk Joint work with David Ham and

More information

A PORTABLE BENCHMARK SUITE FOR HIGHLY PARALLEL DATA INTENSIVE QUERY PROCESSING

A PORTABLE BENCHMARK SUITE FOR HIGHLY PARALLEL DATA INTENSIVE QUERY PROCESSING A PORTABLE BENCHMARK SUITE FOR HIGHLY PARALLEL DATA INTENSIVE QUERY PROCESSING Ifrah Saeed, Sudhakar Yalamanchili, School of ECE Jeff Young, School of Computer Science February 8, 2015 The Need for Accelerated

More information

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

GPU Acceleration of the SENSEI CFD Code Suite

GPU Acceleration of the SENSEI CFD Code Suite GPU Acceleration of the SENSEI CFD Code Suite Chris Roy, Brent Pickering, Chip Jackson, Joe Derlaga, Xiao Xu Aerospace and Ocean Engineering Primary Collaborators: Tom Scogland, Wu Feng (Computer Science)

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

Embedded Systems: map to FPGA, GPU, CPU?

Embedded Systems: map to FPGA, GPU, CPU? Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware

More information

Performance Portability Study of Linear Algebra Kernels in OpenCL

Performance Portability Study of Linear Algebra Kernels in OpenCL Performance Portability Study of Linear Algebra Kernels in OpenCL Karl Rupp 1,2, Philippe Tillet 1, Florian Rudolf 1, Josef Weinbub 1, Ansgar Jüngel 2, Tibor Grasser 1 rupp@iue.tuwien.ac.at @karlrupp 1

More information

Shattering the 1U Server Performance Record. Figure 1: Supermicro Product and Market Opportunity Growth

Shattering the 1U Server Performance Record. Figure 1: Supermicro Product and Market Opportunity Growth Shattering the 1U Server Performance Record Supermicro and NVIDIA recently announced a new class of servers that combines massively parallel GPUs with multi-core CPUs in a single server system. This unique

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

Graphic Processing Units: a possible answer to High Performance Computing?

Graphic Processing Units: a possible answer to High Performance Computing? 4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

Designing the Future: How Successful Codesign Helps Shape Hardware and Software Development

Designing the Future: How Successful Codesign Helps Shape Hardware and Software Development Official Use Only Designing the Future: How Successful Codesign Helps Shape Hardware and Software Development Christian Trott Unclassified, Unlimited release SAND2014-19833 C Sandia National Laboratories

More information

Debugging with TotalView

Debugging with TotalView Tim Cramer cramer@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again

More information

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics

More information

A Case Study - Scaling Legacy Code on Next Generation Platforms

A Case Study - Scaling Legacy Code on Next Generation Platforms Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy

More information

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,

More information

Introduction to HPC Workshop. Center for e-research (eresearch@nesi.org.nz)

Introduction to HPC Workshop. Center for e-research (eresearch@nesi.org.nz) Center for e-research (eresearch@nesi.org.nz) Outline 1 About Us About CER and NeSI The CS Team Our Facilities 2 Key Concepts What is a Cluster Parallel Programming Shared Memory Distributed Memory 3 Using

More information

COSCO 2015 Heterogeneous Computing Programming

COSCO 2015 Heterogeneous Computing Programming COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology

More information

GPU-based Decompression for Medical Imaging Applications

GPU-based Decompression for Medical Imaging Applications GPU-based Decompression for Medical Imaging Applications Al Wegener, CTO Samplify Systems 160 Saratoga Ave. Suite 150 Santa Clara, CA 95051 sales@samplify.com (888) LESS-BITS +1 (408) 249-1500 1 Outline

More information

How to program efficient optimization algorithms on Graphics Processing Units - The Vehicle Routing Problem as a case study

How to program efficient optimization algorithms on Graphics Processing Units - The Vehicle Routing Problem as a case study How to program efficient optimization algorithms on Graphics Processing Units - The Vehicle Routing Problem as a case study Geir Hasle, Christian Schulz Department of, SINTEF ICT, Oslo, Norway Seminar

More information

Best Practice mini-guide accelerated clusters

Best Practice mini-guide accelerated clusters Using General Purpose GPUs Alan Gray, EPCC Anders Sjöström, LUNARC Nevena Ilieva-Litova, NCSA Partial content by CINECA: http://www.hpc.cineca.it/content/gpgpu-general-purpose-graphics-processing-unit

More information

HP Workstations graphics card options

HP Workstations graphics card options Family data sheet HP Workstations graphics card options Quick reference guide Leading-edge professional graphics February 2013 A full range of graphics cards to meet your performance needs compare features

More information