Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Similar documents
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Introduction to GPU hardware and to CUDA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to GPU Programming Languages

Next Generation GPU Architecture Code-named Fermi

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GPU Computing - CUDA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Case Study on Productivity and Performance of GPGPUs

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

GPU Parallel Computing Architecture and CUDA Programming Model

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPUs for Scientific Computing

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Evaluation of CUDA Fortran for the CFD code Strukti

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

CUDA programming on NVIDIA GPUs

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

Part I Courses Syllabus

HPC with Multicore and GPUs

Parallel Programming Survey

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Overview of HPC Resources at Vanderbilt


L20: GPU Architecture and Models

Stream Processing on GPUs Using Distributed Multimedia Middleware

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

COM 444 Cloud Computing

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Turbomachinery CFD on many-core platforms experiences and strategies

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

Computer Graphics Hardware An Overview

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

HPC Programming Framework Research Team

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

GPU Tools Sandra Wienke

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU Performance Analysis and Optimisation

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

OpenPOWER Software Stack with Big Data Example March 2014

Multi-core Programming System Overview

Le langage OCaml et la programmation des GPU

GPGPU Computing. Yong Cao

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

CUDA Basics. Murphy Stein New York University

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Bringing Big Data Modelling into the Hands of Domain Experts

NVIDIA GeForce GTX 580 GPU Datasheet

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Introduction to GPU Computing

HPC Wales Skills Academy Course Catalogue 2015

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

The Fastest, Most Efficient HPC Architecture Ever Built

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Lecture 1: an introduction to CUDA

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

10- High Performance Compu5ng

BSC vision on Big Data and extreme scale computing

OpenACC Programming and Best Practices Guide

OpenACC 2.0 and the PGI Accelerator Compilers

RevoScaleR Speed and Scalability

HIGH PERFORMANCE BIG DATA ANALYTICS

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

GPU Profiling with AMD CodeXL

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Optimizing Application Performance with CUDA Profiling Tools

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Introduction to Cloud Computing

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Software and the Concurrency Revolution

ST810 Advanced Computing

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

OpenACC Basics Directive-based GPGPU Programming

CHAPTER 1 INTRODUCTION

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Binary search tree with SIMD bandwidth optimization using SSE

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Operating Systems: Basic Concepts and History

Data Centric Systems (DCS)

Trends in High-Performance Computing for Power Grid Applications

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Transcription:

Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Talk outline [30 slides] 1. Introduction [5 slides] 2. The GPU evolution [5 slides] 3. Programming [11 slides] 1. 2. 3. 4. Libraries [2 slides] Switching among hardware platforms [4 slides] Accessing CUDA from other languages [1 slide] OpenACC [4 slides] 4. The new hardware [9 slides] 1. Kepler [8 slides] 2. Echelon [1 slide]

I. Introduction 3

An application which favours CPUs: Task parallelism and/or intensive I/O When applications are bags of tasks (few): Apply task parallelism P1 P2 P3 P4 Try to balance tasks while keeping relation to disk files 4

An application which favours GPUs: Data parallelism [+ large scale] When applications are no streaming workflows: Combine task and data parallelism P1 P2 P3 P4 Data parallelism Task parallelism 5

Heterogeneous case: More likely, requires a wise programmer to exploit each processor When applications are streaming workflows: Task parallelism, data parallelism, and pipelining P1 P2 P3 P4 Pipelining Data parallelism Task parallelism 6

Hardware resources and scope of application for the heterogeneous model Highly parallel computing Graphics GPU (Parallel computing) 4 cores 512 cores Control and communication CPU (Sequential computing) Use CPU and GPU: Every processor executes those parts where it gets more effective Productivity-based applications Data intensive applications Oil & Gas Finance Medical Biophysics Numerics Audio Video Imaging 7

There is a hardware platform for each end user Hundreds of researchers Largescale clusters More than a million $ Thousand of researchers Cluster of Tesla servers Between 50.000 and 1.000.000 dollars Millions of researchers Tesla graphics card Less than 5000 dollars 8

II. The GPU evolution 9

The graphics card within the domestic hardware marketplace (regular PCs) GPUs sold per quarter: 114 millions [Q4 2010] 138.5 millions [Q3 2011] 124 millions [Q4 2011] The marketplace keeps growing, despite of global crisis. Compared to CPUs sold, 93.5 millions [Q4 2011], there are 1.5 GPUs out there for each CPU, and this factor keeps growing relentlessly over the last decade (it was barely 1.15x in 2001). 10

In barely 5 years, CUDA programming has grown to become ubiquitous More than 500 research papers are published each year. More than 500 universities teach CUDA programming. More than 350 million GPUS are programmed with CUDA. More than 150.000 active programmers. More than a million compiler and toolkit downloads. 11

The three generations of processor design Before 2005 2005-2007 2008-2012 12

... and how they are connected to programming trends 13

We also have OpenCL, which extends GPU programming to non-nvidia platforms 14

III. Programming 15

III. 1. Libraries 16

A brief example. Google search is a must before starting an implementation 17

The developer ecosystem enables the application growth 18

III. 2. Switching among hardware platforms 19

Compiling for other target platforms 20

Ocelot http://code.google.com/p/gpuocelot It is a dynamic compilation environment for the PTX code on heterogeneous systems, which allows an extensive analysis of the PTX code and its migration to other platforms. The latest version (2.1, as of April 2012) considers: GPUs from multiple vendors. x86-64 CPUs from AMD/Intel. 21

Swan http://www.multiscalelab.org/swan A source-to-source translator from CUDA to OpenCL: Provides a common API which abstracts the runtime support of CUDA and OpenCL. Preserves the convenience of launching CUDA kernels (<<<grid,block>>>), generating source C code for the entry point kernel functions.... but the conversion process is not automatic and requires human intervention. Useful for: Evaluate OpenCL performance for an already existing CUDA code. Reduce the dependency from nvcc when we compile host code. Support multiple CUDA compute capabilities on a single binary. As runtime library to manage OpenCL kernels on new developments. 22

PGI CUDA x86 compiler http://www.pgroup.com Major differences with previous tools: It is not a translator from the source code, it works at runtime. In 2012, it will allow to build a unified binary which will simplify the software distribution. Main advantages: Speed: The compiled code can run on a x86 platform even without a GPU. This enables the compiler to vectorize code for SSE instructions (128 bits) or the most recent AVX (256 bits). Transparency: Even those applications which use GPU native resources like texture units will have an identical behavior on CPU and GPU. 23

III. 3. Accessing CUDA from other languages 24

Some possibilities CUDA can be incorporated into any language that provides a mechanish for calling C/C++. To simplify the process, we can use general-purpose interface generators. SWIG [http://swig.org] (Simplified Wrapper and Interface Generator) is the most renowned approach in this respect. Actively supported, widely used and already successful with: AllegroCL, C#, CFFI, CHICKEN, CLISP, D, Go language, Guile, Java, Lua, MxScheme/Racket, Ocaml, Octave, Perl, PHP, Python, R, Ruby, Tcl/Tk. A connection with Matlab interface is also available: On a single GPU: Use Jacket, a numerical computing platform. On multiple GPUs: Use MatWorks Parallel Computing Toolbox. 25

III. 4. OpenACC 26

The OpenACC initiative 27

OpenACC is an alternative to computer scientist s CUDA for average programmers The idea: Introduce a parallel programming standard for accelerators based on directives (like OpenMP), which: Are inserted into C, C++ or Fortran programs to direct the compiler to parallelize certain code sections. Provide a common code base: Multi-platform and multi-vendor. Enhance portability across other accelerators and multicore CPUs. Bring an ideal way to preserve investment in legacy applications by enabling an easy migration path to accelerated computing. Relax programming effort (and expected performance). First supercomputing customers: United States: Oak Ridge National Lab. Europe: Swiss National Supercomputing Centre. 28

OpenACC: The way it works 29

OpenACC: Results 30

IV. Hardware designs 31

IV. 1. Kepler 32

The Kepler architecture: Die and block diagram 33

A brief reminder of CUDA 34

Differences in memory hierarchy 35

Kepler resources and limitations vs. previous GPU generation GPU generation Hardware model Compute Capability (CCC) Fermi Kepler GF100 GF104 GK104 GK110 Limitation Impact 2.0 2.1 3.0 3.5 Máx. cores (multiprocessors) 512(16) 336(7) 1536(8) 2880(15) Hardware Scalability 36

Kepler resources and limitations vs. previous GPU generation GPU generation Hardware model Compute Capability (CCC) Máx. cores (multiprocessors) Cores / Multiprocessor Threads / Warp (the warp size) Máx. warps / Multiprocessor Máx. thread-blocks / Multiproc. Máx. threads / Thread-block Máx. threads / Multiprocessor Fermi Kepler GF100 GF104 GK104 GK110 Limitation Impact 2.0 2.1 3.0 3.5 512(16) 336(7) 1536(8) 2880(15) Hardware Scalability 32 48 192 192 Hardware Scalability 32 32 32 32 Software Throughput 48 48 64 64 Software Throughput 8 8 16 16 Software Throughput 1024 1024 1024 1024 Software Parallelism 1536 1536 2048 2048 Software Parallelism 37

Kepler resources and limitations vs. previous GPU generation GPU generation Hardware model Compute Capability (CCC) Máx. cores (multiprocessors) Cores / Multiprocessor Threads / Warp (the warp size) Máx. warps / Multiprocessor Máx. thread-blocks / Multiproc. Máx. threads / Thread-block Máx. threads / Multiprocessor Max. 32-bit registers / thread 32-bit registers / Multiprocessor Shared memory / Multiprocessor Fermi Kepler GF100 GF104 GK104 GK110 Limitation Impact 2.0 2.1 3.0 3.5 512(16) 336(7) 1536(8) 2880(15) Hardware Scalability 32 48 192 192 Hardware Scalability 32 32 32 32 Software Throughput 48 48 64 64 Software Throughput 8 8 16 16 Software Throughput 1024 1024 1024 1024 Software Parallelism 1536 1536 2048 2048 Software Parallelism 63 63 63 255 Software Working set 32 K 32 K 64 K 64 K Hardware Working set 16-48 K 16-48K 16-32-48K 16-32-48K Hardware Working set 38

Kepler resources and limitations vs. previous GPU generation GPU generation Hardware model Compute Capability (CCC) Máx. cores (multiprocessors) Cores / Multiprocessor Threads / Warp (the warp size) Máx. warps / Multiprocessor Máx. thread-blocks / Multiproc. Máx. threads / Thread-block Máx. threads / Multiprocessor Max. 32-bit registers / thread 32-bit registers / Multiprocessor Shared memory / Multiprocessor Máx. X Grid Dimension Fermi Kepler GF100 GF104 GK104 GK110 Limitation Impact 2.0 2.1 3.0 3.5 512(16) 336(7) 1536(8) 2880(15) Hardware Scalability 32 48 192 192 Hardware Scalability 32 32 32 32 Software Throughput 48 48 64 64 Software Throughput 8 8 16 16 Software Throughput 1024 1024 1024 1024 Software Parallelism 1536 1536 2048 2048 Software Parallelism 63 63 63 255 Software Working set 32 K 32 K 64 K 64 K Hardware Working set 16-48 K 16-48K 16-32-48K 16-32-48K Hardware Working set 2^16-1 2^16-1 2^32-1 2^32-1 Software Problem size 39

Kepler resources and limitations vs. previous GPU generation GPU generation Hardware model Compute Capability (CCC) Máx. cores (multiprocessors) Cores / Multiprocessor Threads / Warp (the warp size) Máx. warps / Multiprocessor Máx. thread-blocks / Multiproc. Máx. threads / Thread-block Máx. threads / Multiprocessor Max. 32-bit registers / thread 32-bit registers / Multiprocessor Shared memory / Multiprocessor Máx. X Grid Dimension Dynamic Parallelism Hyper-Q Fermi Kepler GF100 GF104 GK104 GK110 Limitation Impact 2.0 2.1 3.0 3.5 512(16) 336(7) 1536(8) 2880(15) Hardware Scalability 32 48 192 192 Hardware Scalability 32 32 32 32 Software Throughput 48 48 64 64 Software Throughput 8 8 16 16 Software Throughput 1024 1024 1024 1024 Software Parallelism 1536 1536 2048 2048 Software Parallelism 63 63 63 255 Software Working set 32 K 32 K 64 K 64 K Hardware Working set 16-48 K 16-48K 16-32-48K 16-32-48K Hardware Working set 2^16-1 2^16-1 2^32-1 2^32-1 Software Problem size No No No Yes Hardware " structure No No No Yes Hardware T. scheduling 40

Dynamic Parallelism in Kepler Kepler GPUs adapt dynamically to data, launching new threads at run-time. 41

Dynamic Parallelism (2) It makes GPU computing easier and broadens reach. 42

Hyper-Q CPU cores simultaneously run tasks on Kepler 43

Hyper-Q (cont.) 44

IV. 2. Echelon 45

Memory hierarchy A look ahead: The Echelon execution model Object Thread Swift operations: Thread array creation. Messages. Block transfers. Collective operations. A B Global address space A B Load/Store B Bulk Xfer Active message 46

Thanks for your attention! My coordinates: email: ujaldon@uma.es My Web page at University of Malaga: http://manuel.ujaldon.es My web page at Nvidia: http://research.nvidia.com/users/manuel-ujaldon 47