GPU computing. Jochen Gerhard Institut für Informatik Frankfurt Institute for Advanced Studies

Similar documents
GPU System Architecture. Alan Gray EPCC The University of Edinburgh

ST810 Advanced Computing

Next Generation GPU Architecture Code-named Fermi

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introduction to GPU Programming Languages

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introduction to GPU Architecture

Stream Processing on GPUs Using Distributed Multimedia Middleware

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Introduction to GPGPU. Tiziano Diamanti

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Parallel Programming Survey

GPUs for Scientific Computing

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction to GPU hardware and to CUDA

GPGPU Computing. Yong Cao

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Retargeting PLAPACK to Clusters with Hardware Accelerators

Chapter 1 Computer System Overview

Introduction to GPU Computing

College of William & Mary Department of Computer Science

Lecture 3. Optimising OpenCL performance

Turbomachinery CFD on many-core platforms experiences and strategies

Autodesk Revit 2016 Product Line System Requirements and Recommendations

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

How to choose a suitable computer

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Choosing a Computer for Running SLX, P3D, and P5

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

CUDA programming on NVIDIA GPUs

OpenCL Programming for the CUDA Architecture. Version 2.3

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

x64 Servers: Do you want 64 or 32 bit apps with that server?

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Qualified Apple Mac Workstations for Avid Media Composer v5.0.x

GPU Parallel Computing Architecture and CUDA Programming Model

Evaluation of CUDA Fortran for the CFD code Strukti

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Several tips on how to choose a suitable computer

Optimizing Code for Accelerators: The Long Road to High Performance

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

ultra fast SOM using CUDA

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Let s put together a Manual Processor

L20: GPU Architecture and Models

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs


Lecture 1. Course Introduction

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

LSN 2 Computer Processors

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

farmerswife Contents Hourline Display Lists 1.1 Server Application 1.2 Client Application farmerswife.com

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

GPU Profiling with AMD CodeXL

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

System Requirements Table of contents

Introduction to Cloud Computing

Several tips on how to choose a suitable computer

Generations of the computer. processors.

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

Intel Pentium 4 Processor on 90nm Technology

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

MIDeA: A Multi-Parallel Intrusion Detection Architecture

Interactive Level-Set Deformation On the GPU

Multi-core and Linux* Kernel

Accelerating CFD using OpenFOAM with GPUs

Testing Database Performance with HelperCore on Multi-Core Processors

Ray Tracing on Graphics Hardware

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

HPC with Multicore and GPUs

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

GPU Computing - CUDA

OpenACC 2.0 and the PGI Accelerator Compilers

Carlos Villavieja, Nacho Navarro Arati Baliga, Liviu Iftode

2020 Design Update Release Notes November 10, 2015

Binary search tree with SIMD bandwidth optimization using SSE

Introduction to OpenCL Programming. Training Guide

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

NVIDIA GeForce GTX 580 GPU Datasheet

Computer Graphics Hardware An Overview

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Trends in High-Performance Computing for Power Grid Applications

The Bus (PCI and PCI-Express)

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Transcription:

GPU computing Jochen Gerhard Institut für Informatik Frankfurt Institute for Advanced Studies

Overview How is a GPU structured? (Roughly) How does manycore programming work compared to multicore? How can one access the GPU from Python? Some details about the structure of OpenCL Programs. How to make the GPU do what you want?

Hardware A modern Computer has more than just a CPU: More than one socket and one core anyway. But also graphic cards, sometimes even more than one. Wouldn t it be nice to harvest all that computing power?

HPC for the poor man: GPUs As example let s take my MacBook Pro 5.1: 1x Intel Core 2 Duo @2.4 GHz Galaxy Benchmark ~12 GFLOPs NVIDIA GeForce 9600M GT Galaxy Benchmark ~40 GFLOPs Why not harvest both of it? With just coding one program???

The CPU n cores, n rather small (less than 32) each with private L1-cache pairwise/quadwise shared L2-cache shared L3-cache Slow access to system memory Is good in multiple things

Multicore Compute different (rather complicated tasks) Each core computes even different programs. Complicated hardware Sometimes share information. (messages)

The GPU Lots of cores. Not so much memory. pretty simple. good in number crunching.

Manycore Compute (almost) always the same task. Different groups can work on slightly different branches. Simpler hardware. Arrange within groups only.

Ultra-Threaded Dispatch Processor Hardware overview from AMD s Programming Guide (*). To use the former picture: 5 chickens always stay together. (processing elements) Compute Unit Compute Unit Compute Unit Compute Unit Each coop contains 16 of this cliques. (compute units) T-Processing Element Stream Core Instruction and Control Flow Branch Execution Unit Processing Element General-Purpose Registers 1. Much of this is transparent to the programmer.

The GPU s hierarchy Lots of cores (e.g. ATI Radeon 5870) 20 compute units 16 stream cores (4+1) processing elements 1600 SP units, 320 DP units, 320 SF units.

GPU / CPU GPU Mainboard Compute Unit CPU / Socket Stream Core Core Processing Element FPU

The GPU Performance Processing at 850 MHz => theoretical peak performance of 1.36 TFLOPs (for 299$) Not so much memory. (1/2 GB accessible without tricks) Fast Global Memory on GPU Very Fast Local Memory. (32kB come with almost no latency for each compute unit). 8 kb L1 (RO) cache per compute unit.

The OpenCL Platform One host Various Compute devices GPUs and CPUs Each consistent of Compute Units Cores / SIMD engines Platform overview from AMD Programming Guide (*) Again devised into Processing Elements processing Elements / FPUs

Organizing OpenCL First a platform has to be chosen: platform = implementation of OpenCL #platforms 1 (Like Apple + Nvidia on my Laptop) Then you have to query the devices, that can be accessed by means of this platform. In a context devices are tied together. It s used to manage buffers, programs and, kernels. You perform actions on this objects in queues.

Organizing OpenCL First a platform has to be chosen: platform = implementation of OpenCL #platforms 1 (Like Apple + Nvidia on my Laptop) Then you have to query the devices, that can be accessed by means of this platform. In a context devices are tied together. It s used to manage buffers, programs and, kernels. You perform actions on this objects in queues.

Common usage Take first platform, you get! Put your GPU as the only DEVICE into the CONTEXT Have one command QUEUE connected with your GPU.

Organizing OpenCL II Though possible, I would not recommend to use more than one platform. If you want more than one GPU to work on the same memory, they have to share a context! When different devices share a context, the Buffers share the device constraints.

Memory Memory is managed in so called Buffers. Buffers are bound to a context They have to be declared (size and specifiers) You get your Data in and out, via a copy command in the queue. You may also just give a pointer to host memory.

Execution model OpenCL programs are sets of functions in a C99 derivative. Those functions to be executed directly in the queue are called kernels. Kernels operate on every element of an input stream independently. This is orchestrated by the NDRange argument.

Kernels Kernels are functions you put into the command queue. Essentially within a Kernel you explain, what each chicken (work unit) has to do! All work units will do the same thing, written in the kernel!

Kernels Kernels are functions you put into the command queue. Essentially within a Kernel you explain, what each chicken (work unit) has to do! All work units will do the same thing, written in the kernel!

Orchestrating the kernels Kernels are put into command queue Before enqueueing a kernel one has specify where the kernel parameters point to. Kernels are enqueued with a NDRange argument: Gives an N-Dimensional-Range.

The NDRange Giving the number of work items for kernel. Can be organized geometrically: e.g. 1024x1024 work-items suited to problem size. Can be subdivided into workgroups. e.g. 128x128 workgroups, each having 8x8 work-items.

The NDRange: Chicken Version NDRange specifies how many chickens you want to work. You can organize them geometrically. (16 = 4x4) You can also group them together.

Why workgroups? All work items of a workgroup are executed on the same compute unit. They share the local memory. Which is tremendously fast. ( chickens within a coop ) Only within a workgroup you can synchronize. Next finer granularity is the wavefront. The execution stream within a wavefront is uniform. So branching within is extremely expensive.

Why workgroups: Chicken version All chickens within the same group reside in the same coop. They share the same bowl, which is much nearer than the global bowl for everyone. They wait for each other, when going to the local or global bowl. (synchronization) Next finer granularity: Chickens will all do the same! So if in the same wavefront, one chicken has to add and the other has to subtract - they all will do both!

Hands on

The OpenCL part Is a python string contains only one function, which is a kernel: kernel has one parameter data which is a global reachable array of int. gets first its global id in x-direction (0) Each work unit sets its entry to its GID

The Python part I platform / device / context is all managed by magic: create_some_context() queue is to be initialized with given context declare how many work units you want. Here we use 32 x 1 Work units. We need representations of the data on host and on device.

The Python part II First we build the program from source and according to its context. Out of the context, the compiler knows the device architecture. One can pass also compiler options here. (e.g. include files!) Every kernel becomes a method for the program Object.

The Python part III We pass queue, NDRange, and kernel parameters The.wait() ensures we wait for completion. Last step is getting data out of the data_buffer into the Numpy array data We.wait() till this is finished too.

Backup

Synchronization Within a workgroup barrier(clk_local_mem_fence) barrier(clk_global_mem_fence) In a Queue.wait() waits for the Event being computed.

Synchronization There is no global synchronization between work units. Chickens never wait for chickens in other groups.

A template

A practical example Naive matrix multiplication (using only global memory) Still approximately 300 times faster, than Numpy.dot (A,B) for A, B 1024 x 1024 single precision matrices. (On a ATI Radeon 5870)

Global Matrix Multiplication

Global Matrix Multiplication

Local Matrix Multiplication

Local Matrix Multiplication

Local Matrix Multiplication

1 st step Copy data from global memory to local memory Each work item copies one entry per matrix (A, B) per round (k++) from global to local memory

2 nd step Now all memory accesses are within local memory. Each work item in the workgroup computes like in the global example.

2 nd step Now all memory accesses are within local memory. Each work item in the workgroup computes like in the global example.

Metaprogramming We can use Python to modify the OpenCL source before compiling: src = #DEFINE LDIM 16 src += loadfile( matmul.cl ) src = #DEFINE LDIM %i %ldim where ldim is set in Python before...

Resumé Accessing the GPU from Python is quite easy. PyOpenCL works perfectly with Numpy. If you consider porting some slow routines to C (e.g. using Cython), probably you should consider OpenCL. First (even practical!) routines are easily implemented.

Introductory Documents (*) Programming Guide: AMD Accelerated Parallel Processing OpenCL http://www.khronos.org/developers/library/overview/opencl_overview.pdf http://mathema.tician.de/software/pyopencl http://www.khronos.org/registry/cl/