Introduction to GPU hardware and to CUDA

Similar documents
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

CUDA Basics. Murphy Stein New York University

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Next Generation GPU Architecture Code-named Fermi

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPU Parallel Computing Architecture and CUDA Programming Model

GPUs for Scientific Computing

Parallel Programming Survey

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Lecture 1: an introduction to CUDA

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA programming on NVIDIA GPUs

Introduction to GPU Programming Languages

HPC with Multicore and GPUs

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Introduction to CUDA C

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

L20: GPU Architecture and Models

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

NVIDIA GeForce GTX 580 GPU Datasheet

GPU Computing - CUDA

Introduction to GPGPU. Tiziano Diamanti

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU Performance Analysis and Optimisation

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Optimizing Application Performance with CUDA Profiling Tools

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Guided Performance Analysis with the NVIDIA Visual Profiler

HPC Wales Skills Academy Course Catalogue 2015

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Case Study on Productivity and Performance of GPGPUs

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

GPGPUs, CUDA and OpenCL

OpenCL Programming for the CUDA Architecture. Version 2.3

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

GPU Hardware Performance. Fall 2015

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

1. If we need to use each thread to calculate one output element of a vector addition, what would

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

~ Greetings from WSU CAPPLab ~

QuickSpecs. NVIDIA Quadro M GB Graphics INTRODUCTION. NVIDIA Quadro M GB Graphics. Overview

Introduction to GPU Architecture

Le langage OCaml et la programmation des GPU

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Programming GPUs with CUDA

ultra fast SOM using CUDA

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Parallel Computing with MATLAB

Introduction to GPU Computing

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Introduction to Cloud Computing

BLM 413E - Parallel Programming Lecture 3

GPGPU Computing. Yong Cao


Embedded Systems: map to FPGA, GPU, CPU?

CUDA Programming. Week 4. Shared memory and register

IMAGE PROCESSING WITH CUDA

ST810 Advanced Computing

GPU-based Decompression for Medical Imaging Applications

Parallel Algorithm Engineering

OpenACC Basics Directive-based GPGPU Programming

10- High Performance Compu5ng

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Evaluation of CUDA Fortran for the CFD code Strukti

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

Turbomachinery CFD on many-core platforms experiences and strategies

Spring 2011 Prof. Hyesoon Kim

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Radeon HD 2900 and Geometry Generation. Michael Doggett

GPU Tools Sandra Wienke

GeoImaging Accelerator Pansharp Test Results

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Several tips on how to choose a suitable computer

Texture Cache Approximation on GPUs

Scalability and Classifications

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Transcription:

Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37

Course outline Introduction to GPU hardware structure and CUDA programming model Writing and compiling a simple CUDA code Example application: 2D Euler equations Optimization strategies and performance tuning In-depth coverage of CUDA features Combining CUDA with MPI, and OpenCL Philip Blakely (LSC) GPU introduction 2 / 37

Introduction to GPUs References Why use GPUs? Short history of GPUs Hardware structure CUDA Language Installing CUDA Philip Blakely (LSC) GPU introduction 3 / 37

References http://www.nvidia.com/cuda - Official site http://developer.nvidia.com/cuda-downloads - Drivers, Compilers, Manuals CUDA C Programming Guide CUDA C Best Practices Guide Programming Massively Parallel Processors, David B. Kirk and Wen-mei W. Hwu (2010) CUDA by Example, Jason Sanders and Edward Kandrot (2010) The CUDA Handbook, Nicholas Wilt (2013) Philip Blakely (LSC) GPU introduction 4 / 37

Course caveats Only applicable to NVIDIA GPUs (except OpenCL material) Assume CUDA-enabled graphics card compute capability = 2.0. Material relevant to all CUDA-enabled GPUs Assume good knowledge of C/C++, including simple optimization Some knowledge of C++ (simple classes and templates) Assume basic knowledge of Linux command-line Some knowledge of compilation useful Experience of parallel algorithms and issues is useful Linux is assumed (although CUDA is available under Windows and Mac OS) Philip Blakely (LSC) GPU introduction 5 / 37

The power of GPUs All PCs have dedicated graphics cards. Usually used for realistic rendering in games, visualization, etc. However, when programmed effectively, GPUs can provide a lot of computing power: Live demos fluidgl particles Mandelbrot Philip Blakely (LSC) GPU introduction 6 / 37

Why should we use GPUs? Can provide large speed-up over CPU performance For some applications, a factor of 100 (although that was comparing a single Intel core with a full NVIDIA GPU). More typical improvement around x10. Relatively cheap - a few hundred pounds - NVIDIA are in gamers market Widely available Easy to install Philip Blakely (LSC) GPU introduction 7 / 37

Improved GFLOPS Theoretical GFLOPS are higher than Intel s processors (not entirely clear which processor is being compared to) Philip Blakely (LSC) GPU introduction 8 / 37

Improved bandwidth In order to get good performance out of any processor, we need high bandwidth as well: Philip Blakely (LSC) GPU introduction 9 / 37

Why shouldn t we use GPUs? Speed up with double precision less good than single (Factor of 8 difference in performance as opposed to CPU difference of factor 2) Not suitable for all problems - depends on algorithm Can require a lot of work to get very good performance May be better to use multi-core processors with MPI or OpenMP Current programming models do not protect the programmer from him/herself. For some applications, GPU may only give 2.5x advantage over a fully-used multi-core CPU. A commercial grade GPU will cost around 2,500, around the same as a 16-core Xeon server. Philip Blakely (LSC) GPU introduction 10 / 37

Brief history of GPUs Graphics cards (for home PCs) first developed in 1990s Take care of graphics rendering, leaving main CPU to do complex calculations CPU hands over a scene for rendering to graphics card Early 2000s, get more complex graphics ops. and programmable. APIs such as Microsoft s DirectX and OpenGL. Scientific Computing researchers start performing calcs. using built-in vector ops to do useful work. Late 2000s, Graphics card companies developed ways to generally program GPUs Advent of GPGPUs and general programming languages, e.g. CUDA (NVIDIA - 2006) Super computing with GPUs takes off 2010 - NVIDIA release Fermi cards - more versatile 2012 - NVIDIA release Kepler cards - even more scope for dynamic programming 2014 - NVIDIA release Maxwell cards - more efficient, allowing for more cores per GPU. Philip Blakely (LSC) GPU introduction 11 / 37

Hardware structure Typical CPU Large number of transistors associated with flow control, cache, etc. Philip Blakely (LSC) GPU introduction 12 / 37

Hardware struture Typical GPU Larger proportion of transistors associated with floating point operations. No branch prediction or speculative execution on GPU cores. Results in higher theoretical FLOPS. Philip Blakely (LSC) GPU introduction 13 / 37

GPU programming GPU Programming model: GPU programming is typically SIMD (or SIMT) Programming for GPUs requires rethinking algorithms. Best algorithm for CPU not necessarily best for GPU. Knowledge of GPU hardware required for best performance GPU programming works best if we: Perform same operation simultaneously on multiple pieces of data Organise operations to be as independent as possible Arrange data in GPU memory to maximise rate of data access Philip Blakely (LSC) GPU introduction 14 / 37

CUDA language The CUDA language taught in this course is applicable to NVIDIA GPUs only. Other GPU manufacturers exist, e.g. AMD OpenCL (see last lecture) aims to be applicable to all GPUs. CUDA (Compute Unified Device Architecture) is enhanced C ++ Contains extensions corresponding to hardware features Easy to use (if you know C ++ ) but contains much more potential for errors than C ++ Philip Blakely (LSC) GPU introduction 15 / 37

Alternatives to CUDA Caveat: I have not studied any of these... Microsoft s DirectCompute (following DirectX) AMD Stream Computing SDK (AMD-hardware specific) PyCUDA (Python wrapper for CUDA) CLyther (Python wrapper for OpenCL) Jacket (MatLab or C ++ wrapper - commercial) Philip Blakely (LSC) GPU introduction 16 / 37

Installing CUDA Drivers Instructions for latest driver can be found at https://developer.nvidia.com/cuda-downloads For Ubuntu, propietary drivers are in standard repositories but may not be the latest version The Toolkit from NVIDIA provides a compiler, libraries, documentation, and a wide range of examples. Philip Blakely (LSC) GPU introduction 17 / 37

CUDA example Example CUDA application First, the kernel, which multiplies a single element of a vector by a scalar. global void multiplyvector(float a, float x, int N){ int i = blockidx.x blockdim.x + threadidx.x; if( i < N ){ a[i] = a[i] x; } } Philip Blakely (LSC) GPU introduction 18 / 37

CUDA example Example CUDA application Second, the main function, which allocates memory on the GPU and calls the kernel. int main(void) { float a, ahost; ahost = new float[n]; cudamalloc((void ) &a, N sizeof(float)); cudamemcpy(a, ahost, N sizeof(float), cudamemcpyhosttodevice); multiplyvector<<<(n+255)/512, 512>>>(a,x,N); cudamemcpy(ahost, a, N sizeof(float), cudamemcpydevicetohost); } Philip Blakely (LSC) GPU introduction 19 / 37

Compiling a CUDA program NVIDIA have their own compiler, nvcc (based on Clang). Uses either gcc or cl for compiling CPU code. nvcc needed to compile any CUDA code CUDA API functions can be compiled with normal compiler (gcc / cl) Compilation nvcc helloworld.cu -o helloworld Philip Blakely (LSC) GPU introduction 20 / 37

Programming model CUDA works in terms of threads (Single Instruction Multiple Thread) Threads are separate processing elements within a program Same set of instructions... typically differ only by their thread number... which can lead to differing data dependencies / execution paths Threads execute independently of each other unless explicitly synchronized (or part of same warp) Typically threads have access to same memory block as well as some data local to each thread. i.e. shared-memory paradigm with all the potential issues that implies. Philip Blakely (LSC) GPU introduction 21 / 37

Conceptual diagram Each thread has internal registers and access to global and constant memory Philip Blakely (LSC) GPU introduction 22 / 37

Conceptual diagram Each thread has internal registers and access to global and constant memory Threads arranged in blocks have access to a common block of shared memory. Threads can communicate within blocks Philip Blakely (LSC) GPU introduction 22 / 37

Conceptual diagram Each thread has internal registers and access to global and constant memory Threads arranged in blocks have access to a common block of shared memory. Threads can communicate within blocks A block of threads is run entirely on a single streaming-multiprocessor on the GPU. Philip Blakely (LSC) GPU introduction 22 / 37

Conceptual diagram Blocks are arranged into a grid. Separate blocks cannot (easily) communicate as they may be run on separate streaming multiprocessors. Philip Blakely (LSC) GPU introduction 23 / 37

Conceptual diagram Blocks are arranged into a grid. Separate blocks cannot (easily) communicate as they may be run on separate streaming multiprocessors. Communication with CPU is only via global and constant memory. Philip Blakely (LSC) GPU introduction 23 / 37

Conceptual diagram Blocks are arranged into a grid. Separate blocks cannot (easily) communicate as they may be run on separate streaming multiprocessors. Communication with CPU is only via global and constant memory. The GPU cannot request data from the CPU; the CPU must send data to it. Philip Blakely (LSC) GPU introduction 23 / 37

Conceptual elaboration When running a program, you launch one or more blocks of threads, specifying the number of threads in each block and the total number of blocks. Each block of threads is allocated to a Streaming Multiprocessor by the hardware. Each thread has a unique index made up of a thread index and a block index which are available inside your code. These are used to distinguish threads in a kernel. Communication between threads in the same block is allowed. Communication between blocks is not. Philip Blakely (LSC) GPU introduction 24 / 37

Hardware specifics A GPU is made up of several streaming multiprocessors (4 for Quadro K3100M, 13 for Kepler K20c), each of which consists of 192 cores. No. of cores per SM is lower in older cards. Each SM runs a warp of 32 threads at once. Global memory access relatively slow (400-800 clock cycles) A SM can have many thread blocks allocated at once, but only runs one at any one time. The SM can switch rapidly to running other threads to hide memory latency Thread registers and shared memory for a thread-block remain on multiprocessor until thread-block finished. Philip Blakely (LSC) GPU introduction 25 / 37

Streaming Multiprocessor When launching a kernel a block of threads is assigned to a streaming multiprocessor All threads in the block run concurrently Within a block threads are split into warps of 32 threads which run in lock-step, all executing the same instruction at the same time Different warps within a block are not necessarily run in lock-step, but can be synchronized at any point. All threads in a block have access to some common shared memory. Multiple thread blocks can be allocated on a SM at once, assuming sufficient memory Execution can switch to other blocks if necessary, while holding other running blocks in memory. Philip Blakely (LSC) GPU introduction 26 / 37

Hardware limitations Tesla C2050 - Compute capability 2.0 14 Multiprocessors, each with 32 cores = 448 cores. Per streaming multiprocessor (SM): 32,768 32-bit registers 16-48kB shared memory per SM 8 active blocks Global: 64kB constant memory 3GB global memory Maximum threads per block: 1024 Philip Blakely (LSC) GPU introduction 27 / 37

Hardware limitations Tesla K20 - Compute capability 3.5 13 Multiprocessors, each with 192 cores = 2496 cores. Per streaming multiprocessor (SM): 65,536 32-bit registers 16-48kB shared memory per SM 16 active blocks Global: 64kB constant memory 5GB global memory Maximum threads per block: 2048 Philip Blakely (LSC) GPU introduction 28 / 37

Hardware limitations If you have a recent NVIDIA GPU in your own PC, it will have CUDA capability. If bought recently, this will probably be 2.1 or 3.0 The commercial-grade GPUs differ only in that they have more RAM, and that it is ECC. The same applications demonstrated here will run on your own hardware, so you can often use a local machine for testing (For LSC users - see http://www-internal.lsc.phy.cam.ac.uk/internal/systems.shtml for CUDA capability details for our machines.) Philip Blakely (LSC) GPU introduction 29 / 37

Terminology Host The PC s main CPU and/or RAM Device A GPU being used for CUDA operations (usually the GPU global memory) Kernel Single block of code to be run on a GPU Thread Single running instance of a kernel Block Indexed collection of threads Philip Blakely (LSC) GPU introduction 30 / 37

Terminology continued Global memory RAM on a GPU that can be read/written to by any thread Constant memory RAM on a GPU that can be read by any thread Shared memory RAM on a GPU that can be read/written to by any thread in a block. Registers Memory local to each thread. Latency Time between issuing an instruction and instruction being completed Philip Blakely (LSC) GPU introduction 31 / 37

Terminology ctd. Compute capability NVIDIA label for what hardware features a graphics card has. Most cards: 2.0-3.0: Latest cards: 3.5, 5.0 Warps A warp is a consecutive set of 32 threads within a block that are executed simultaneously (in lock-step). If branching occurs within a warp, then different code branches execute serially. Philip Blakely (LSC) GPU introduction 32 / 37

Importance of blocks All threads in a block have access to same shared memory Threads in block can communicate using shared memory and some special CUDA instructions: syncthreads() - waits until all threads in a block reach this point Blocks may be run in any order, and cannot (ordinarily/easily/safely) communicate with each other. Philip Blakely (LSC) GPU introduction 33 / 37

Parallel Correctness Remember that threads are essentially independent Cannot make any assumptions about execution order Blocks may be run in any order within a grid Algorithm must be designed to work without block-ordering assumptions Any attempt to use global memory to pass information between blocks will probably fail (or at least give unpredictable results)... Philip Blakely (LSC) GPU introduction 34 / 37

Simple race-condition The simple instruction i++ usually results in the operations: Read value of i into processor register Increment register Write i back to memory If the same instruction occurs on separate threads, and i refers to the same place in global memory, then the following could occur: Start with i = 0 Thread 1 Read i Increment register Write i End with i =2 Thread 2 Read i Increment register Write i Philip Blakely (LSC) GPU introduction 35 / 37

Simple race-condition The simple instruction i++ usually results in the operations: Read value of i into processor register Increment register Write i back to memory If the same instruction occurs on separate threads, and i refers to the same place in global memory, then the following could occur: Start with i = 0 Thread 1 Read i Increment register Write i End with i =1 Thread 2 Read i Increment register Write i Philip Blakely (LSC) GPU introduction 35 / 37

Global memory read/write Atomic Instruction Using an atomic instruction will ensure that the reads and writes occur serially. This will ensure correctness at the expense of performance. Programming Guide If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device and which thread performs the final write is undefined. In other words: Here be dragons! Each thread should write to its own global memory location Ideally, no global memory needed to be read should be written by the same kernel Philip Blakely (LSC) GPU introduction 36 / 37

Summary Introduced GPUs - history and their advantages Described hardware model and how it relates to the programming model Defined some terminology for later use Philip Blakely (LSC) GPU introduction 37 / 37