Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Size: px
Start display at page:

Download "Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka"

Transcription

1 Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka René Widera1, Erik Zenker1,2, Guido Juckeland1, Benjamin Worpitz1,2, Axel Huebl1,2, Andreas Knüpfer2, Wolfgang E. Nagel2, Michael Bussmann1 Helmholtz-Zentrum Dresden Rossendorf 2 Technische Universität Dresden 1 Prof. Peter Mustermann I Institut xxxxx I

2 PICon GPU Electron Acceleration with Lasers Ion Acceleration with Lasers Plasma Instabilities Compact X-Ray sources Tumor Therapy Astrophysics 2

3 Domain Decomposition Field and Particle Domain 3

4 Domain Decomposition Field and Particle Domain 4 Moving Particles create Fields Particles change Cells Fields act back on Particles

5 Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain 2 4

6 Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell Cell 1 Cell 2 Cell 4

7 Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell Cell 2 Cell 4 4 Cell 1 chunked in supercells line wise aligned 7

8 Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell Cell 2 Cell 4 4 Cell 1 chunked in supercells fixed size frames line wise aligned struct of aligned arrays

9 Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell Cell 2 Cell 4 4 Cell 1 chunked in supercells fixed size frames line wise aligned struct of aligned arrays

10 Algorithm Driven Cache Strategy Cell Cell 2 3 Cell 1 Cell 4 4

11 Algorithm Driven Cache Strategy Global Memory Cell Cell 2 3 Cell 1 Cell 4 4

12 Algorithm Driven Cache Strategy Global Memory Shared Memory Cell Cell 2 3 Cell 1 Cell 4 4

13 Algorithm Driven Cache Strategy Shared Memory Global Memory Cell Cell 2 3 Cell 1 Cell 4 4

14 High Utilization of Threads Shared Memory Global Memory Cell 1 1 THREAD BLOCK Cell 4 4 THREAD Cell 2 3 Cell 1 THREAD 2 THREAD 3 THREAD 4

15 Task-Parallel Execution of Kernels Asynchronous Communication 15

16 PIConGPU Scales up to 16,384 GPUs strong scaling speedup ideal 1 to 32 8 to to to to number of GPUs 10000

17 PIConGPU Scales up to 16,384 GPUs strong scaling weak scaling efficiency efficiency [%] speedup ideal 1 to 32 8 to to to to number of GPUs ideal PIConGPU number of GPUs 10000

18 PIConGPU Scales up to 16,384 GPUs strong scaling weak scaling efficiency efficiency [%] speedup ideal 1 to 32 8 to to to to number of GPUs 95 Efficiency >95% ideal PIConGPU number of GPUs 10000

19 PIConGPU Scales up to 16,384 GPUs strong scaling weak scaling efficiency efficiency [%] speedup ideal 1 to 32 8 to to to to number of GPUs 95 Efficiency >95% 6.9 PFlop/s (SP) ideal PIConGPU number of GPUs 10000

20 More Physics, More Computations, More Power! s1 s2 20

21 More Physics, More Computations, More Power! Old atom state s1 s2 s1,1 s1,2 s1,3... sn,m 21

22 More Physics, More Computations, More Power! s1 s2 Atom-physical effects Old atom state t1,1 t1,2 t1,3 t1,n t2,1. t3,1.. tn,1 tn,n s1,1 s1,2 s1,3... sn,m 22

23 More Physics, More Computations, More Power! Atom-physical effects s1 s2 t1,1 t1,2 t1,3 t1,n t2,1. t3,1.. tn,1 tn,n Old atom New atom state state s1,1 s1,2 s1,3... sn,m = s1,1 s1,2 s1,3... sn,m 23

24 More Physics, More Computations, More Power! Atom-physical effects s1 t1,1 t1,2 t1,3 t1,n t2,1. t3,1.. tn,1 tn,n s2 s1,1 s1,2 s1,3... sn,m = s1,1 s1,2 s1,3... sn,m Really Big Data Task 24 Old atom New atom state state Random access on big amounts of data > 100 GB Good job for powerful CPUs Efficient CPU/GPU cooperation

25 Small Open Source Communities need Maintainable Codes Heterogeneity Testability Sustainability Validate once, Write once, Porting implies get correct results minimal code execute everywhere everywhere changes Optimizability Openness Tune for good Open source performance and at minimum open standards coding effort Single Source 25

26 Alpaka 26

27 Good News: there are Alpakas on the Compute Meadow C Threads Fibers Next parallelism model Single zero overhead interface to existing parallelism models Single source C11 kernels Data-agnostic memory model 27

28 Abstract Hierarchical Redundant Parallelism Model Synchronize Parallel Grid Sequential 28

29 Abstract Hierarchical Redundant Parallelism Model Synchronize Grid Parallel Block Sequential 29

30 Abstract Hierarchical Redundant Parallelism Model Synchronize Grid Block Parallel Thread Sequential 30

31 Abstract Hierarchical Redundant Parallelism Model Synchronize Grid Block Thread Parallel Element Element level is an explicit sequential layer Sequential 31

32 Data Structure Agnostic Memory Model Explicit deep copy Grid Host Memory 32 Device Global Memory

33 Data Structure Agnostic Memory Model Explicit deep copy Grid Block Host Memory 33 Device Global Memory Shared Memory

34 Data Structure Agnostic Memory Model Explicit deep copy Grid Block Host Memory 34 Device Global Memory Register Memory Shared Memory Thread Register Memory

35 Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 35 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware

36 Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 36 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware

37 Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 37 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R AVX AVX AVX AVX Global Memory Shared Memory Register Memory Explicit mapping of parallelization levels to hardware

38 Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 38 Grid Block Package Package L3 L3 Thread Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware

39 Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 39 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware

40 Map the Abstraction Model to your Desired Acceleration Back-End Specific unsupported levels of the model can be ignored Abstract interface allows to extend the set of mappings 40

41 Alpaka : Vector Addition Kernel struct VectorAdd { template<typename TAcc, typename TElem, typename TSize> ALPAKA_FN_ACC auto operator()( TAcc const & acc, TSize const & numelements, TElem const * const X, TElem * const Y) const -> void { } }; 41

42 Alpaka : Vector Addition Kernel struct VectorAdd { template<typename TAcc, typename TElem, typename TSize> ALPAKA_FN_ACC auto operator()( TAcc const & acc, TSize const & numelements, TElem const * const X, TElem * const Y) const -> void { using alp = alpaka; auto globalidx = alp::idx::getidx<alp::grid, alp::threads>(acc)[0u]; auto elemsperthread = alp::workdiv::getworkdiv<alp::thread, alp::elems>(acc)[0u]; } }; 42

43 Alpaka : Vector Addition Kernel struct VectorAdd { template<typename TAcc, typename TElem, typename TSize> ALPAKA_FN_ACC auto operator()( TAcc const & acc, TSize const & numelements, TElem const * const X, TElem * const Y) const -> void { using alp = alpaka; auto globalidx = alp::idx::getidx<alp::grid, alp::threads>(acc)[0u]; auto elemsperthread = alp::workdiv::getworkdiv<alp::thread, alp::elems>(acc)[0u]; auto begin = globalidx * elemsperthread; auto end = min(begin elemsperthread, numelements); for(tsize i = begin; i < end; i){ Y[i] = X[i] Y[i]; } } }; 43

44 Alpaka : Initialization // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; 44

45 Alpaka : Initialization // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; // Retrieve devices and stream DevHost devhost ( alpaka::dev::devman<host>::getdevbyidx(0) ); DevAcc devacc ( alpaka::dev::devman<acc>::getdevbyidx(0) ); Stream stream ( devacc); 45

46 Alpaka : Initialization // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; // Retrieve devices and stream DevHost devhost ( alpaka::dev::devman<host>::getdevbyidx(0) ); DevAcc devacc ( alpaka::dev::devman<acc>::getdevbyidx(0) ); Stream stream ( devacc); // Specify work division auto elementsperthread ( alpaka::vec<dim, Size>::ones() ); auto threadsperblock ( alpaka::vec<dim, Size>::all(2u) ); auto blockspergrid ( alpaka::vec<dim, Size>(4u, 8u, 16u) ); WorkDiv workdiv(alpaka::workdiv::workdivmembers<dim, Size>(blocksPerGrid, threadsperblock, elementsperthread)); 46

47 Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); 47

48 Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); alpaka::stream::enqueue(stream, exec); 48

49 Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); alpaka::stream::enqueue(stream, exec); // Copy memory back to host alpaka::mem::view::copy(stream, Y_h, Y_d, extent); 49

50 Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); alpaka::stream::enqueue(stream, exec); // Copy memory back to host alpaka::mem::view::copy(stream, Y_h, Y_d, extent); 50

51 Cupla for the rescue : very fast porting! Alpaka 51

52 Cupla for the rescue : very fast porting! Fast enough for live hack session! Alpaka 52

53 Live Cupla Hack Session Live port of CUDA to Cupla application Starting point: matrix multiplication algorithm of CUDA samples Aim: single source code executed on GPU and CPU hardware Tiling matrix-matrix multiplication algorithm 53

54 Single Source Alpaka DGEMM Kernel on Various Architectures 480 GFLOPS 560 GFLOPS 540 GFLOPS 150 GFLOPS 1450 GFLOPS Theoretical Peak Performance DGEMM: C αab βc

55 What happend so far... 55

56 What happend so far... 56

57 What happend so far... 57

58 What happend so far... 58

59 What happend so far... 59

60 PIConGPU Runtime on Various Architectures 60

61 PIConGPU Efficiency on Various Architectures 61

62 Clone us from GitHub git clone git clone git clone Alpaka paper pre-print: 62

63 63

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC 2.0 and the PGI Accelerator Compilers OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1 Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code F. Rossi, S. Sinigardi, P. Londrillo & G. Turchetti University of Bologna & INFN GPU2014, Rome, Sept 17th

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

More information

Retour d expérience : portage d une application haute-performance vers un langage de haut niveau

Retour d expérience : portage d une application haute-performance vers un langage de haut niveau Retour d expérience : portage d une application haute-performance vers un langage de haut niveau ComPAS/RenPar 2013 Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte 16 Janvier 2013 Our Goals Globally

More information

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy Perspectives of GPU Computing in Physics

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

Performance Analysis for GPU Accelerated Applications

Performance Analysis for GPU Accelerated Applications Center for Information Services and High Performance Computing (ZIH) Performance Analysis for GPU Accelerated Applications Working Together for more Insight Willersbau, Room A218 Tel. +49 351-463 - 39871

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

Le langage OCaml et la programmation des GPU

Le langage OCaml et la programmation des GPU Le langage OCaml et la programmation des GPU GPU programming with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte Le projet OpenGPU : un an plus tard Ecole Polytechnique - 8 juin 2011 Outline

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens

More information

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools

More information

Case Study on Productivity and Performance of GPGPUs

Case Study on Productivity and Performance of GPGPUs Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia

More information

Embedded Systems: map to FPGA, GPU, CPU?

Embedded Systems: map to FPGA, GPU, CPU? Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @

More information

?kt. An Unconventional Method for Load Balancing. w = C ( t m a z - ti) = p(tmaz - 0i=l. 1 Introduction. R. Alan McCoy,*

?kt. An Unconventional Method for Load Balancing. w = C ( t m a z - ti) = p(tmaz - 0i=l. 1 Introduction. R. Alan McCoy,* ENL-62052 An Unconventional Method for Load Balancing Yuefan Deng,* R. Alan McCoy,* Robert B. Marr,t Ronald F. Peierlst Abstract A new method of load balancing is introduced based on the idea of dynamically

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

CUDA Programming. Week 4. Shared memory and register

CUDA Programming. Week 4. Shared memory and register CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK

More information

Texture Cache Approximation on GPUs

Texture Cache Approximation on GPUs Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, joshua.sanmiguel@mail.utoronto.ca 1 Our Contribution GPU Core Cache Cache

More information

Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide

More information

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American

More information

An Introduction to Parallel Computing/ Programming

An Introduction to Parallel Computing/ Programming An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European

More information

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Experiences on using GPU accelerators for data analysis in ROOT/RooFit Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,

More information

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

More information

QCD as a Video Game?

QCD as a Video Game? QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Real-time Visual Tracker by Stream Processing

Real-time Visual Tracker by Stream Processing Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol

More information

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks OpenACC Parallelization and Optimization of NAS Parallel Benchmarks Presented by Rengan Xu GTC 2014, S4340 03/26/2014 Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, Barbara Chapman HPC Tools

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

Performance Characteristics of Large SMP Machines

Performance Characteristics of Large SMP Machines Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark

More information

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms

More information

HP ProLiant SL270s Gen8 Server. Evaluation Report

HP ProLiant SL270s Gen8 Server. Evaluation Report HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich schoenemeyer@cscs.ch

More information

GPGPU accelerated Computational Fluid Dynamics

GPGPU accelerated Computational Fluid Dynamics t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Learn CUDA in an Afternoon: Hands-on Practical Exercises Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

Multi-GPU Load Balancing for Simulation and Rendering

Multi-GPU Load Balancing for Simulation and Rendering Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks

More information

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1 AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL AMD Embedded Solutions 1 Optimizing Parallel Processing Performance and Coding Efficiency with AMD APUs and Texas Multicore Technologies SequenceL Auto-parallelizing

More information

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding

More information

CUDA Basics. Murphy Stein New York University

CUDA Basics. Murphy Stein New York University CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

Algorithms of Scientific Computing II

Algorithms of Scientific Computing II Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015 INF5063: Programming heterogeneous multi-core processors because the OS-course is just to easy! Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks October 20 th 2015 Håkon Kvale

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer GU Computing 1 2 3 The GU Advantage To ExaScale and Beyond The GU is the Computer The GU Advantage The GU Advantage A Tale of Two Machines Tianhe-1A at NSC Tianjin Tianhe-1A at NSC Tianjin The World s

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA Tools For Profiling And Monitoring. David Goodwin NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale

More information

Trends in High-Performance Computing for Power Grid Applications

Trends in High-Performance Computing for Power Grid Applications Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

Accelerating variant calling

Accelerating variant calling Accelerating variant calling Mauricio Carneiro GSA Broad Institute Intel Genomic Sequencing Pipeline Workshop Mount Sinai 12/10/2013 This is the work of many Genome sequencing and analysis team Mark DePristo

More information

Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux

Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux Third Workshop on Computer Architecture and Operating System co-design Paris, 25.01.2012 Tobias Beisel, Tobias Wiersema,

More information

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la

More information

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

More information

OpenACC Programming and Best Practices Guide

OpenACC Programming and Best Practices Guide OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What

More information

Accelerator Beam Dynamics on Multicore, GPU and MIC Systems. James Amundson, Qiming Lu, and Panagiotis Spentzouris Fermilab

Accelerator Beam Dynamics on Multicore, GPU and MIC Systems. James Amundson, Qiming Lu, and Panagiotis Spentzouris Fermilab Accelerator Beam Dynamics on Multicore, GPU and MIC Systems James Amundson, Qiming Lu, and Panagiotis Spentzouris Fermilab Synergia Synergia: A comprehensive accelerator beam dynamics package http://web.fnal.gov/sites/synergia/sitepages/synergia%20home.aspx

More information

Retargeting PLAPACK to Clusters with Hardware Accelerators

Retargeting PLAPACK to Clusters with Hardware Accelerators Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.

More information

Jean-Pierre Panziera Teratec 2011

Jean-Pierre Panziera Teratec 2011 Technologies for the future HPC systems Jean-Pierre Panziera Teratec 2011 3 petaflop systems : TERA 100, CURIE & IFERC Tera100 Curie IFERC 1.25 PetaFlops 256 TB ory 30 PB disk storage 140 000+ Xeon cores

More information

OpenACC Basics Directive-based GPGPU Programming

OpenACC Basics Directive-based GPGPU Programming OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. wienke@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES,

More information

Introduction to CUDA C

Introduction to CUDA C Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard

More information

Next Generation Operating Systems

Next Generation Operating Systems Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015 The end of CPU scaling Future computing challenges Power efficiency Performance == parallelism Cisco Confidential 2 Paradox of the

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

GPU Accelerated Monte Carlo Simulations and Time Series Analysis GPU Accelerated Monte Carlo Simulations and Time Series Analysis Institute of Physics, Johannes Gutenberg-University of Mainz Center for Polymer Studies, Department of Physics, Boston University Artemis

More information

Speeding Up RSA Encryption Using GPU Parallelization

Speeding Up RSA Encryption Using GPU Parallelization 2014 Fifth International Conference on Intelligent Systems, Modelling and Simulation Speeding Up RSA Encryption Using GPU Parallelization Chu-Hsing Lin, Jung-Chun Liu, and Cheng-Chieh Li Department of

More information

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 2 Shared memory in detail

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

FPGA-based MapReduce Framework for Machine Learning

FPGA-based MapReduce Framework for Machine Learning FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China

More information

Low-Power Amdahl-Balanced Blades for Data-Intensive Computing

Low-Power Amdahl-Balanced Blades for Data-Intensive Computing Thanks to NVIDIA, Microsoft External Research, NSF, Moore Foundation, OCZ Technology Low-Power Amdahl-Balanced Blades for Data-Intensive Computing Alex Szalay, Andreas Terzis, Alainna White, Howie Huang,

More information

Introduction to Cluster Computing

Introduction to Cluster Computing Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Introduction Goal/Idea Phases Mandatory Assignments Tools Timeline/Exam General info Introduction Supercomputers are expensive Workstations

More information

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:

More information

CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University

CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University CUDA Debugging GPGPU Workshop, August 2012 Sandra Wienke Center for Computing and Communication, RWTH Aachen University Nikolay Piskun, Chris Gottbrath Rogue Wave Software Rechen- und Kommunikationszentrum

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

GPU ACCELERATED DATABASES Database Driven OpenCL Programming. Tim Child 3DMashUp CEO

GPU ACCELERATED DATABASES Database Driven OpenCL Programming. Tim Child 3DMashUp CEO GPU ACCELERATED DATABASES Database Driven OpenCL Programming Tim Child 3DMashUp CEO SPEAKERS BIO Tim Child 35 years experience of software development Formerly VP Engineering, Oracle Corporation VP Engineering,

More information

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware 25 August 2014 Copyright 2001 2014 by NHL Hogeschool and Van de Loosdrecht Machine Vision BV All

More information

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best

More information

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration Sina Meraji sinamera@ca.ibm.com Please Note IBM s statements regarding its plans, directions, and intent are subject to

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information