CUDA. Multicore machines



Similar documents
GPU Parallel Computing Architecture and CUDA Programming Model

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Introduction to GPU hardware and to CUDA

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

CUDA programming on NVIDIA GPUs

CUDA Basics. Murphy Stein New York University

Next Generation GPU Architecture Code-named Fermi

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPGPU Computing. Yong Cao

Radeon HD 2900 and Geometry Generation. Michael Doggett

Introduction to GPGPU. Tiziano Diamanti

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

GPGPU Parallel Merge Sort Algorithm

Lecture 1: an introduction to CUDA

Computer Graphics Hardware An Overview

GPU Computing - CUDA

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

L20: GPU Architecture and Models

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Introduction to GPU Architecture

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

ME964 High Performance Computing for Engineering Applications

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

HPC with Multicore and GPUs

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Introduction to GPU Programming Languages

Programming GPUs with CUDA

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPU Architecture. Michael Doggett ATI

IMAGE PROCESSING WITH CUDA

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

NVIDIA GeForce GTX 580 GPU Datasheet

OpenCL Programming for the CUDA Architecture. Version 2.3

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

CUDA Programming. Week 4. Shared memory and register

Optimizing Application Performance with CUDA Profiling Tools

Introduction to CUDA C

Texture Cache Approximation on GPUs

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPUs for Scientific Computing

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Stream Processing on GPUs Using Distributed Multimedia Middleware

gpus1 Ubuntu Available via ssh

Advanced CUDA Webinar. Memory Optimizations

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

QCD as a Video Game?

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

AUDIO ON THE GPU: REAL-TIME TIME DOMAIN CONVOLUTION ON GRAPHICS CARDS. A Thesis by ANDREW KEITH LACHANCE May 2011

Graphics and Computing GPUs

Learn CUDA in an Afternoon: Hands-on Practical Exercises

GPU Performance Analysis and Optimisation

Introduction to Cloud Computing

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Chapter 1 Computer System Overview

Rethinking SIMD Vectorization for In-Memory Databases

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

GPU Tools Sandra Wienke

ultra fast SOM using CUDA

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

~ Greetings from WSU CAPPLab ~

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

Writing Applications for the GPU Using the RapidMind Development Platform

GPU Hardware Performance. Fall 2015

GPU Programming Strategies and Trends in GPU Computing

Guided Performance Analysis with the NVIDIA Visual Profiler

1. If we need to use each thread to calculate one output element of a vector addition, what would

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Touchstone -A Fresh Approach to Multimedia for the PC

Parallel Programming Survey

Intel 810 and 815 Chipset Family Dynamic Video Memory Technology

NVIDIA Tools For Profiling And Monitoring. David Goodwin

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

The Fastest, Most Efficient HPC Architecture Ever Built

Chapter 2 Parallel Architecture, Software And Performance

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Lecture Notes, CEng 477

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Using Many-Core Hardware to Correlate Radio Astronomy Signals

SPARC64 VIIIfx: CPU for the K computer

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Transcription:

CUDA GPU vs Multicore computers Multicore machines Emphasize multiple full-blown processor cores, implementing the complete instruction set of the CPU The cores are out-of-order implying that they could be doing different tasks They may additionally support hyperthreading with two hardware threads Designed to maximize the execution speed of sequential programs GPUs (Nvidia) Typically have hundreds of cores Cores are heavily multithreaded, in-order, and single-instruction issue processors Each core shares control and instruction cache with seven other cores Achieve about 10X performance compared to single core machines Design philosophy of general-purpose multicore CPUs Design of CPU optimized for sequential code performance Emphasis of parallelizing different parts of instruction (pipelining) to improve performance of sequential processes Large cache memories to reduce instruction and data latency Core i7 (Announced for Q1 2012) Two load/store operations per cycle for each memory channel Allows for 32KB data and 32KB instruction L1 cache at 3 clocks and 256KB L2 cache at 8 clocks per core Up to 8 physical cores and 16 logical cores through hyperthreading Design philosophy of many-core GPUs CUDA Shaped by video game industry; large number of floating po calculations per video frame Maximize chip area and power budget for floating po computation Optimize for the execution throughput of large number of threads Large bandwidth to move data (10X compared to CPU) Nvidia GeForce 8800 GTX moves data at 85GB/s in and out of its main DRAM; GT200 chips support 150GB/s; CPU-based systems support about 50GB/s Simpler memory models No legacy applications, OS, or I/O devices to restrict increase in memory bandwidth Small cache memories to control bandwidth requirements for applications Designed as numeric computing engines Applications designed to execute sequential part on CPU and numerically ensive part on GPU Compute Unified Device Architecture Designed to support jo CPU/GPU execution of applications Set of developing tools to create applications to execute on GPU CUDA compiler uses a variation of C with some C++ extensions Avoids the performance overhead of graphics layer APIs by compiling software directly to the hardware

CUDA Introduction 2 Includes a unified shader pipeline allowing each ALU on the GPU to be used for general-purpose computations ALUs built to comply with IEEE single precision floating po standard Execution units on GPU allowed arbitrary read/write access to memory as well as software-managed cache known as shared memory Modern GPU architecture Array of highly threaded streaming multiprocessors SMs SMs share control logic and instruction cache A set of SMs combined o a building block Above figure (GeForce 8800) contains 16 highly threaded SMs per block 128 FPUs Each has a multiply-add unit and and additional multiply unit Special function units perform floating po functions such as square root Each SP can run multiple threads (768 threads/sm) 367 GFLOPS 768 MB DRAM 86.4 GB/s memory bandwidth 8 GB/s bandwidth to CPU 4GB/s from device to system 4GB/s from system to device 4GB of graphics double data rate (GDDR) DRAM global memory on GPU GDDR DRAM differs from system DRAM on CPU motherboard

CUDA Introduction 3 GDDR DRAM is frame buffer memory used for graphics Functions as very high bandwidth off-chip memory for computing CUDA-enabled GPUs Operate as a co-processor within the host computer Each GPU has its own memory and PEs Data needs to be transferred from host to device memory and device to host memory Memory transfers affect performance times Use the nvcc compiler to convert C code to run on a GPU Preferred to use.cu as extensions to CUDA C source code Program to move data between host and device memory Fixed-function graphics pipeline Configurable but non-programmable graphics hardware Graphics APIs to use software or hardware functionality Send commands to a GPU to display an object being drawn Graphics pipeline Host erface receives graphics commands and data from CPU Commands given by host through API DMA hardware in host erface to efficiently transfer bulk data between host and device memory Vertex Corner of a polygon Graphics pipeline optimized to render triangles Vertex in GeForce refers to vertex of a triangle Surface of an object drawn as a set of triangles Smaller triangles imply better image quality Addressing modes for limited texture size/dimension Vertex control stage Receives parameterized triangle data from CPU Converts it to a form understood by hardware Places prepared data o vertex cache Vertex shading, transform, and lighting stage Transforms vertices and assigns per-vertex values Color, normal, texture coordinates, tangents Shading done by pixel shading hardware Vertex shader assigns color to each vertex but color is not immediately applied to triangle pixels Edge equations to erpolate colors and other per-vertex data (texture coordinates) across the pixels touched by triangle Raster stage

CUDA Introduction 4 Determines pixels contained in each triangle Interpolates per vertex values to shade the pixels Performs color raster operations to blend the colors of overlapping/adjacent colors for transparency and antialiasing Determines the visible objects for a given viewpo and discards occluded pixels Antialiasing gives each pixel a color that is blended from the colors of objects that partially overlap the pixel Shader stage Determines final color of each pixel Combined effect of many techniques: erpolation of vertex colors, texture mapping, per-pixel lighting mathematics, reflections, and so on Frame buffer erface stage Memory read/write to the display frame buffer memory High bandwidth requirement for high resolution displays Two techniques 1. Special memory design for higher bandwidth 2. Simultaneously manage multiple memory channels connected to multiple memory banks Evolution of programmable real-time graphics CPU die area dominated by cache memories GPU dominated by floating po datapath and fixed-function logic GPU memory erfaces emphasize bandwidth over latency Latency can be hidden by massively parallel execution CUDA Programming Model General purpose programming model User kicks off batches of threads on the GPU GPU dedicated super-threaded, massively data parallel co-processor Targeted software stack Compute oriented drivers, language, and tools Driver for loading computation programs o GPU Standalone driver optimized for computation Interface designed for compute graphics-free API Data sharing with OpenGL buffer objects Guaranteed maximum download and readback speeds Explicit GPU memory management C with no shader limitations Integrated host+device app C program Kernel call Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code

CUDA Introduction 5 Hello world program An empty function named kernel() qualified with global A call to the empty function, embellished with <<<1,1>>> Nvidia tools feed the code o the host compiler (gcc) global qualifier alerts the compiler that a function should be compiled to run on the device instead of host Kernel is a function callable from the host and executed on the CUDA device simultaneously by many threads in parallel Host calls the kernel by specifying the name of the kernel and an execution configuration Execution configuration defines the humber of parallel threads in a group and the number of groups to use when running the kernel for CUDA device Execution configuration defined in angular brackets Angular brackets are not arguments to the device code Parameters to influence how the runtime will launch device code Code for addnum Can pass parameters to a kernel just like any C function Need to allocate memory on device to do anything useful Code for incr_arr No loop in the device code Function is simultaneously executed by an array of threads on device Each thread is provided with a unique ID that is used to compute different array indices or to make control decisions Unique ID calculated in the register variable idx used to refer each element in the array Number of threads can be larger than the size of the array idx is checked against n to see if the index element needs to be operated on Call to kernel incr_arr_dev queues the launch of kernel function on device by an asynchronous call Execution configuration contains the number of blocks and block size Arguments to kernel passed by standard C parameter list Since the device is idle, kernel immediately starts to execute as per execution configuration and function arguments Both host and device execute concurrently their separate code Host calls cudamemcpy which waits until all threads have finished on device (returned from incr_arr_dev) Specification of kernel configuration by number of blocks and block size allows the code to be portable without recompilation Built-in variables in the kernel blockidx Block index within the grid threadidx Thread index within the block blockdim Number of threads in a block Structures containing eger components of the variables Blocks have x, y, and z components because they are 3D Grids are 2D and contain only x and y components We used only x component because the input array is 1D We added an extra block if n was not evenly divisible by blk_sz; this may lead to some threads not having any work in the last block Important: Each thread should be able to access the entire array a d on device No data partitioning when the kernel is launched

CUDA Introduction 6 cudamalloc The code should not try to dereference the poer returned on host Host code can pass this poer around, perform arithmetic on it, or cast it to a different type Just cannot read or write in memory using this poer Memory allocated must be freed by using a call to cudafree Poers within the device code are used exactly as you would on host *c = a + b; In a similar way, do not try to access host memory from within device CUDA error handling Every CUDA call, with the exception of kernel launches, returns an error of type cudaerror_t The successful calls return cudasuccess A failure returns an error code The error code can be converted to human readable form by the function char * cudageterrorstring ( cudaerror_t code ); cudagetlasterror reports last error for any previous run time call in host thread Kernel launches are asynchronous; so cannot explicitly check with cudagetlasterror Use cudathreadsynchronize to block until the device has completed all previous calls, including kernel calls Returns an error if one of the preceding calls fails Queuing multiple kernel launches implies that error checking can only be done after all the kernels have completed Errors are reported to the correct host thread If host runs multiple threads on different CUDA devices, error is reported to the correct host thread

CUDA Introduction 7 If there are multiple errors, only the last error is reported Querying devices Find out memory capacity and other capabilities of CUDA device Some machines may have more than one CUDA capable device The class machine gpu.umsl.edu has two such devices Iterate through each device to get the relevant information on each Properties returned in a structure of type cudadeviceprop Defined in /usr/local/cuda/include/driver_types.h struct cudadeviceprop { char name[256]; // ASCII string to identify the device size_t totalglobalmem; // Device memory in bytes }; size_t sharedmemperblock; // Maximum amount of shared memory in bytes // usable in a single block regsperblock; // Number of 32-bit registers per block warpsize; // Number of threads in a warp size_t mempitch; // Maximum pitch allowed for memory copies // in bytes maxthreadsperblock; // Maximum number of threads in a block maxthreadsdim[3]; // Max number of threads along each dimension maxgridsize[3]; // Number of blocks along each side of grid size_t totalconstmem; // Available constant memory major; // Major revision of device compute capability minor; // Minor revision of device compute capability clockrate; // Clock frequency in khz size_t texturealignment; // Device requirement for texture alignment deviceoverlap; // Can device simultaneously perform a // cudamemcpy and kernel execution multiprocessorcount; // Number of PEs on device kernelexectimeoutenabled; // Is there a runtime limit for // kernels executed on this device egrated; // Is the device an egrated GPU? // Integrated GPU is part of chipset and // not a discrete GPU canmaphostmemory; // Device can map host memory o CUDA // address space computemode; // Device computing mode: default, // exclusive, or prohibited maxtexture1d; // Max size for 1D textures maxtexture2d[2]; // Max dimensions for 2D textures maxtexture3d[3]; // Max dimensions for 3D textures Using device properties maxtexture2darray[3]; // Max dimensions for 2D texture arrays concurrentkernels; // Does device support executing multiple // kernels within same context simultaneously Device properties useful in optimizing the code to take advantage of the underlying hardware and software

CUDA Introduction 8 Double precision floating po math is supported in cards with compute capability 1.3 or higher To support double precision floating po, we need to work with the device that has compute capability of 1.3 or higher Find the device that can achieve the goals and use it dev; cudadeviceprop prop; memset ( &prop, 0, sizeof ( cudadeviceprop ) ); prop.major = 1; prop.minor = 3; cudachoosedevice ( &dev, &prop ); prf ( "CUDA device closest to revision 1.3 is: %d\n", dev ); cudasetdevice ( dev ); CUDA parallel programming global qualifier Adding this qualifier and calling it with special angle bracket syntax, the function is executed on GPU Kernel to increment an array defined as follows: global void incr_arr_dev ( float * arr, // Array to be incremented n // Number of elements in array ) { idx; // To define thread index idx = blockidx.x * blockdim.x + threadidx.x; } if ( idx < n ) arr[idx] += 1.0f; Kernel launch Requires specification of an execution configuration Number of threads that compose a block Number of blocks that form a grid A block can only be processed on a single multi-processor The call to launch kernel is in terms of: n = 256; // Size of array a_d[256]; // Array on device blk_sz = 4; num_blks = ()ceil ( (float)n / blk_sz ); incr_arr_dev <<< num_blks, blk_sz >>> ( a_d, n ); This creates num_blks copies of the kernel and executes those concurrently Within the kernel, the block is identified as a CUDA built-in variable blockidx; variable defined by CUDA runtime Blocks can be defined in two dimensions

CUDA Introduction 9 Since we are adding a 1D array, we use blockidx.x to get the first dimension; the second one being 1 Collection of parallel blocks is called a grid The above call erprets kernel as a grid of 1D blocks Scalar values are erpreted as 1D Data Parallelism Many arithmetic operations can be concurrently and safely performed on data structures Example: Matrix-matrix multiplication C = A B Shared memory Each element of product matrix C is a result of dot product of a row from A and column from B The dot product for each element of C can be performed concurrently Global memory can deliver over 60GB/s or 15GF/s for single touch use of data Reuse of local data can achieve higher performance May hide global memory latency and global memory bandwidth restrictions Kernel launch requires specification of an execution configuration as Number of threads that compose a block Number of blocks that compose a grid A block can only be processed on a single multiprocessor Threads within a block can communicate with each other through local multiprocessor resources such as local shared memory Balancing hardware and software resources against cost Developers want large amounts of local multiprocessor resources such as registers and shared memory Hardware needs to be inexpensive but fast local multiprocessor memory is expensive Different hardware with different capabilities and cost CUDA occupancy calculator: http://developer.download.nvidia.com/compute/cuda/cuda_occupancy_calculator.xls Autoconfiguring the application by querying the device helps in optimization CUDA execution model Each hardware multiprocessor can concurrently process multiple blocks Warp Capability depends on number of registers per thread and the amount of shared memory needed by the kernel Blocks processed by one multiprocessor at one time are called active Kernels with minimal resource requirements are better because multiprocessor resources are split among all threads of active blocks If there are not enough resources to process at least one block, kernel will fail to launch Kernel failure can be caught by checking for errors Each active block split o SIMD groups of threads called warps Each warp has the same number of threads (warp size), executed in SIMD fashion Efficient and cost-effective model from hardware po Serializes conditionals in the sense that both branches of the condition must be evaluated