GPGPUs, CUDA and OpenCL



Similar documents
GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to GPU hardware and to CUDA

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introduction to GPU Programming Languages

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Next Generation GPU Architecture Code-named Fermi

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

HPC with Multicore and GPUs

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

CUDA Basics. Murphy Stein New York University

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Radeon HD 2900 and Geometry Generation. Michael Doggett

IMAGE PROCESSING WITH CUDA

Programming GPUs with CUDA

L20: GPU Architecture and Models

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

GPU Computing - CUDA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

ME964 High Performance Computing for Engineering Applications

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPGPU Computing. Yong Cao

OpenCL Programming for the CUDA Architecture. Version 2.3

COSCO 2015 Heterogeneous Computing Programming


Parallel Programming Survey

GPUs for Scientific Computing

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

CUDA programming on NVIDIA GPUs

CUDA Programming. Week 4. Shared memory and register

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

GPU Architecture. Michael Doggett ATI

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

ultra fast SOM using CUDA

LSN 2 Computer Processors

Le langage OCaml et la programmation des GPU

Introduction to GPU Architecture

GPU Tools Sandra Wienke

BLM 413E - Parallel Programming Lecture 3

Lecture 1: an introduction to CUDA

Computer Graphics Hardware An Overview

Embedded Systems: map to FPGA, GPU, CPU?

gpus1 Ubuntu Available via ssh

Evaluation of CUDA Fortran for the CFD code Strukti

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Introduction to Cloud Computing

The Future Of Animation Is Games

Scalability and Classifications

Introduction to GPGPU. Tiziano Diamanti

Stream Processing on GPUs Using Distributed Multimedia Middleware

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Texture Cache Approximation on GPUs

OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5 Posted. Project

Writing Applications for the GPU Using the RapidMind Development Platform

High Performance Computing

Introduction to OpenCL Programming. Training Guide

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Optimizing Application Performance with CUDA Profiling Tools

~ Greetings from WSU CAPPLab ~

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

GPGPU Parallel Merge Sort Algorithm

CUDA. Multicore machines

OpenACC 2.0 and the PGI Accelerator Compilers

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

Course Development of Programming for General-Purpose Multicore Processors

Graphics and Computing GPUs

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

NVIDIA Tools For Profiling And Monitoring. David Goodwin

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

NVIDIA TESLA: AUNIFIED GRAPHICS AND

Chapter 2 Parallel Architecture, Software And Performance

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

A quick tutorial on Intel's Xeon Phi Coprocessor

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Transcription:

GPGPUs, CUDA and OpenCL Timo Lilja January 21, 2010 Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 1 / 42

Course arrangements Course code: T-106.5800 Seminar on Software Techniques Credits: 3 Thursdays 1516 at A232, lecture period III only Mandatory attendance but you can skip 1 session Presentation One hour presentation Two presentations per session Programming project Small programming project from a given topic or your own topic if you haven't received credits from it from some other course The goal is to parallelize the given program You can choose whether you want to use Cuda or OpenCL We provide a development environment for this programming project. More information will be announced later, check the wiki page Check the course wiki page http://wiki.tkk.fi/display/gpgpuk2010/home Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 2 / 42

Contents Introduction 1 Introduction 2 NVidia Hardware Cuda 3 OpenCL 4 Conclusion Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 3 / 42

Why GPGPU? Introduction GPGPU can in many cases oer a hundredfold increase in performance, tenfold decrease in price and threefold increase in power eciency over traditional CPU in many scientic computing eorts. Business opportunities in various elds: medical technology, data mining,... Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 4 / 42

What is a GPGPU? Introduction Original application in computer graphics and games General-Purpose Computing on Graphics Processing Units Origins in programmable vertex and fragment shaders First GPGPU programs where done by using normal graphics APIs in late 90s In early 2000s rst programmable shaders fully programmable GPU cores Ca. 2005 rst fully-programmable shaders Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 5 / 42

Introduction Parallel Computing Architectures According to Flynn's taxonomy dened in 1966 by Michael J. Flynn. Pictures by Colin M.L. Burnett Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 6 / 42

Stream Processing Introduction Programming paradigm related to SIMD Given a stream of data and a series of operations, called kernel functions The kernel function is applied to all elements of a stream concurrently Memory is very hierarchical: local memory easily accessible, global memory much more expensive Memory accesses usually in bulk so memory optimized or high bandwidth and not to low latency Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 7 / 42

GPU vs. CPU Introduction To support SIMD parallelism, ALUs must be abundant whereas control logic and data caches are not needed that much Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 8 / 42

NVIDIA GPU NVidia Hardware Implementation of a stream processor system Unied architecture vertex, pixel and other shaders use the same GPU facilities Highly hierarchical hardware Streaming-Processor core (SP) Streaming multiprocessor (SM) Texture/processor cluster (TPC) Streaming processor array (SPA) Limitations and dierences when compared to CPU Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 9 / 42

NVidia Streaming Multiprocessor (SM) Hardware 8 Streaming Processor (SP) cores scalar multiply-add (MAD) and ALU units single precision oats and ALU operations in 4 cycles Fused Multiply-Add unit (FMAD) IEEE 754R double precision oating points 1 per/processor: double precision oats are slow 2 special function units (SFU) provide transcendental functions other complex functions: reciprocal slow latencies 16-32 cycles or more low-latency interconnect network between SPs and shared-memory banks multi-threaded instruction fetch and issue unit caches: instruction cache and read-only constant cache 16K read/write shared memory Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 10 / 42

NVidia Hardware Texture/Processor Cluster (TPC) Geometry controller maps the operations into Streaming Multiprocessors Provides 2-dimenisional texture cache that uses (x, y)-spatial locality Streaming multiprocessor (SM) controller Older NVidia's cards (G80) have 2 SMs/TPC, newer have (GT200) 3 SMs/TPC Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 11 / 42

Streaming Processor Array NVidia Hardware Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 12 / 42

NVidia Memory and other features Hardware Memory is highly hierarchical and cached Thread local memory Shared memory which is shared inside a Streaming Multiprocessor (SM) Global memory which is accessible to all threads Raster operation processor (ROP) Other units are mainly used for computer graphics Texture unit Rasterization: Raster operations processor (ROP) Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 13 / 42

Die micrograph NVidia Hardware Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 14 / 42

Hardware limitations NVidia Hardware Branching can cause the program to run fully sequentially Double precision oating point numbers are slow Bus bandwidth between CPU and GPU can become a bottleneck Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 15 / 42

NVidia Current hardware specications Hardware Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 16 / 42

Cuda NVidia Cuda Compute Unied Device Architecture NVidia's proprietary stream programming language Available for Linux, Mac OS X and Windows Current release 2.3, rst release in 2007 C for Cuda Compiled through Pathscale's Open64 C compiler Standard C with kernel extensions Cuda driver API Standard C API interface kernels are explicitly loaded Cuda toolkit includes compiler, proler, debugger, manual pages, runtime libraries Cuda SDK Various code examples, some extra libraries Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 17 / 42

Programming Cuda (1/2) NVidia Cuda Consider adding two vectors A and B and storing the result in C. In ordinary C void VecAdd(float *A, float *B, float *C) { for (i = 0; i < N; i++) C[i] = A[i] + B[i]; } In Cuda global void VecAdd(float* A, float *B, float *C) { int i = threadidx.x; C[i] = A[i] + B[i]; } Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 18 / 42

Programming Cuda (2/2) NVidia Cuda In order to run a parallel program 1 Data must be copied to GPU 2 The kernel must be invoked from the CPU code with special syntax 3 and the data must be copied back to CPU The language used in Cuda kernels is limited recursion is not supported function pointers cannot be used few other restrictions documented in Cuda programming manual See example/cuda/vec.cu Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 19 / 42

Processing ow on CUDA NVidia Cuda Picture by Tosaka Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 20 / 42

NVidia Cuda Threads, Blocks and Grids (1/2) Threads perform single scalar operation per cycle Thread blocks Can be 1-, 2- or 3-dimensional can communicate through shared memory can synchronize through syncthreads() at most 512 threads per block Thread blocks are executed in 32 thread warps in a single SM Grids kernel can be executed by multiple thread blocks thread blocks are organized into 1- or 2-dimension grid which can be used indexing the block Kernel invocation syntax is <<<dimgrid,dimblock>>>(args); dimgrid can be 1- or 2-dimensional dimblock can be either 1D,2D or 3D Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 21 / 42

NVidia Cuda Threads, blocks and grids (2/2) Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 22 / 42

NVidia Example: matrix addition (1/2) Cuda In normal C void addmatrix(float *a, float *b, float *c, int N) { int i, j, idx; for (i = 0; i < N; i++) for (j = 0; j < N; j++) idx = i + j*n; c[idx] = a[idx] + b[idx]; } } } int main(void) {... addmatrix(a,b, c, N); } Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 23 / 42

NVidia Example: matrix addition (2/2) Cuda In Cuda global void addmatrixg(float *a, float *b, float *c, int N) { int i = blockidx.x*blockdim.x + threadidx.x; int j = blockidx.y*blockdim.y + threadidx.y; int idx = i + j*n; if (i < N && j < N) c[idx] = a[idx] + b[idx]; } int main(void) { dim3 dimblock (blocksize, blocksize); dim3 dimgrid (N/dimBlock.x, N/dimBlock.y) addmatrixg<<<dimgrid, dimblock>>>(a,b,c, N) } Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 24 / 42

Compiling Cuda NVidia Cuda Cuda for C code is compiled with nvcc compiler and its extension is.cu The host code is compiled to native x86 The device code is rst compiled to Parallel Thread Execution PTX assembler and then to cubin binary format pseudo-assembler with arbitrary large register set almost entirely in SSA form NVidia graphics card driver load cubin code compiles and executes the PTX code With the Cuda C driver API it is possible to upload own, non-nvcc generated cubin code to the driver Used in some HLL to provide Cuda support Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 25 / 42

Other features NVidia Cuda Asynchronous execution Memory hierarchy: Device Memory, Shared Memory, Page-Locked Host Memory Error Handling Multiple Devices Debugger, Proler and the Device emulation mode Performance tuning Check the SDK examples! Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 26 / 42

OpenCL OpenCL Open Computing Language was initially developed by Apple Now developed by Khoronos Group and OpenCL 1.0 was published on December 8, 2008 Both AMD and NVidia support OpenCL 1.0 as of late 2009 Apple's implementation is based on LLVM compiler framework OpenCL is fully open standard with The goal is to support GPGPUs, Cells, DSPs OpenCL language is based on C99 Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 27 / 42

Terminology OpenCL A OpenCL host is the machine controlling one or more OpenCL devices A device consists of one ore more computing cores A computing core consists of one or more processing elements Processing elements execute code as SIMD or SPMD (Single Process Multiple Data, ordinary OpenMP kind multitasking) A Program consists of one or more kernels Computation domains can be 1-, 2- or 3-dimensional Work-items execute kernels in parallel and are grouper to local workgroups Synchronization can be done only within a workgroup Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 28 / 42

Memory Model OpenCL Like in Cuda, memory is hierarchical Private memory is per work-item Local Memory is shared with a workgroup Unsynchronized Local Global/Constant Memory Host Memory Memory must be copied between host, global and local memory Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 29 / 42

OpenCL Objects and Running OpenCL Setup 1 Choose the device (GPU, CPU, Cell) 2 Create a context, which is a collection of devices 3 Create and submit work into a queue Memory consists of buers which can be accessed freely, read/write images which can be either read or written in a kernel, not both and can be accessed only by specic functions. Work is run asynchronously, synchronous access requires blocking API calls Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 30 / 42

OpenCL OpenCL kernel language Based on ISO C99 No function pointers, recursion, variable length arrays, bit elds Syntax and other additions work-items and workgroups vector types and operations synchronization address space qualiers Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 31 / 42

OpenCL OpenCL, CUDA and Linux Compiled with ordinary GCC and linked against Cuda's libopencl Kernels must be embedded into C strings or loaded from external les through OpenCL API, In Cuda kernels are recompiled and linked to the binary NVidia Cuda SDK (at least in 3.0) has lots of OpenCL examples Kernel syntax is dierent! Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 32 / 42

Conclusion Conclusion GPGPUs provide one form of parallelism, namely SIMD Multi-core CPUs provide MIMD parallelism Will the future merge these two into a single platform? NVidia Cuda is strongly stream processing SIMD implementation whereas OpenCL is far more generic supporting both SIMD and SPMD/MIMD What kind of applications and who will benet from GPGPU stream processing? Will it make Oce applications run faster? Will it benet average user? average programmer? average scientist? At least it will benet the average gamer Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 33 / 42

References Conclusion GPUs and CUDA http://www.cis.temple.edu/~ingargio/ /cis307/readings/cuda.html NVIDIA's GT200: Inside a Parallel Processor http://www. realworldtech.com/page.cfm?articleid=rwt090808195242&p=1 NVidia Cuda Programming Guide 2.3 Building NVIDIA's GT200 http://www.anandtech.com/video/showdoc.aspx?i=3334&p=2 ixbt Labs: NVIDIA CUDA http://ixbtlabs.com/articles3/video/cuda-1-p6.html Lindholm et al. NVIDIA Tesla: A Unied Graphics and Computing Architecture. IEEE micro. vol. 28 no. 2, March/April 2008 NVidia OpenCL JumpStart Guide Khronos group's OpenCL overview Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 34 / 42

Possible topics (1/2) Conclusion How to optimize matrix multiplications in Cuda/OpenCL Section 3.2.2 in Cuda Programming Cude Starting from CPU multiplication and ending up in GPGPU benchmarking after each optimization step Performance tuning and Best practices Cuda/OpenCL Best practices Guide Cuda and OpenCL API comparison performance evaluation User experiences and example applications NVidia's Cuda/OpenCL SDK Other applications Libraries and tools already ported to Cuda/OpenCL Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 35 / 42

Possible topics (2/2) Conclusion AMD Hardware overview and comparison to NVidia Overview of AMD's implementation of OpenCL AMD currently leading in GPU performance High-level languages and GPGPU Python bindings for both Cuda and OpenCL C++, FP languages, Matlab GPGPU IDEs and development tools Future GPGPU trends Merging CPU and GPGPU: will it happen? Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 36 / 42

Conclusion Arrangements once more Need two presentantions for the next session Next session on Jan 28th or Feb 4th? For the rest: e-mail me (timo.lilja@tkk.) suitable times and topic suggestions ASAP You can suggest your own programming project topic too by emailing me Check the wiki pages, I will add instructions on how to use Cuda/OpenCL in course server environment http://wiki.tkk.fi/display/gpgpuk2010/running+cuda+and+ OpenCL+in+course+server Would be interesting to get some instructions on running AMD's OpenCL stack into the course wiki as well Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 37 / 42

Helmholtz Dierential Equation (1/2) An elliptic partial DE, in general form: 2 ψ + k 2 ψ = 0 Height of the wave V at coordinates (x, y) accelerates towards the wave height of adjacent places (x d, y), (x + d, y), (x, y d), (x, y + d) D 2 t V (x, y) = C V (x d, y) + V (x + d, y) + V (x, y d) + V (x, y + d) 4 Add a little friction... F D t V (x, y) and impulse... + I (t, x, y) «V (x, y) Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 38 / 42

Helmholtz Dierential Equation (2/2) The coecient C corresponds to the conductivity of the material, C = 0 the wave can't penetrate this material The coecient F corresponds to the friction of the material, F = 0 no friction, the wave continues forever d is the distance between two points in the discretized space We set d = 1 and adjust C and F correspondingly, and store V (x, y) in a two-dimensional table For a numerical solution to be at least somewhat accurate, C < 1 and the wavelength > 4 Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 39 / 42

Solving the DE Numerically (1/3) Given a (set of) DE y = y (t, y) Euler's algorithm: y t+h = y t + h y (t, y t ) follow the tangent for a step h Very inaccurate Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 40 / 42

Solving the DE Numerically (2/3) Runge-Kutta algorithm, one variation: where α + 2β + 2γ + δ y t+h = y t + h 6 α = y (t, y t ), β = y (t + h 2, y t + h 2 α), γ = y (t + h 2, y t + h 2 β), δ = y (t + h, y t + hγ) (Even better algorithms exist, especially for sti problems like Helmholtz, but ignored here.) Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 41 / 42

Solving the DE Numerically (3/3) If the DE is of a higher degree, we can normalize it: V t (x, y) = hv t (x, y) V V (x d, y) +... t (x, y) = C 4 «V (x, y) F D tv (x, y) Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 42 / 42