CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing



Similar documents
GPU Hardware Performance. Fall 2015

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to CUDA C

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

1. If we need to use each thread to calculate one output element of a vector addition, what would

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Introduction to GPU hardware and to CUDA

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

CUDA Programming. Week 4. Shared memory and register

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA Basics. Murphy Stein New York University

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Optimizing Application Performance with CUDA Profiling Tools

GPU Performance Analysis and Optimisation

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

GPU Tools Sandra Wienke

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Next Generation GPU Architecture Code-named Fermi

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Programming GPUs with CUDA

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA Tools For Profiling And Monitoring. David Goodwin

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenACC 2.0 and the PGI Accelerator Compilers

CUDA. Multicore machines

Guided Performance Analysis with the NVIDIA Visual Profiler

Advanced CUDA Webinar. Memory Optimizations

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU Computing - CUDA

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPGPU Parallel Merge Sort Algorithm

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2

The Fastest, Most Efficient HPC Architecture Ever Built

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Introduction to GPU Architecture

Hands-on CUDA exercises

Lecture 1: an introduction to CUDA

Texture Cache Approximation on GPUs

Rootbeer: Seamlessly using GPUs from Java

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Image Processing & Video Algorithms with CUDA

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Spatial BIM Queries: A Comparison between CPU and GPU based Approaches

Parallel Programming Survey

Introduction to GPU Programming Languages

IMAGE PROCESSING WITH CUDA

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off)

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

NVIDIA GeForce GTX 580 GPU Datasheet

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Efficient Sparse Matrix-Vector Multiplication on CUDA

OpenACC Basics Directive-based GPGPU Programming

CUDA programming on NVIDIA GPUs

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

~ Greetings from WSU CAPPLab ~

GPUs for Scientific Computing

ultra fast SOM using CUDA

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

Fast Implementations of AES on Various Platforms

GPU Programming Strategies and Trends in GPU Computing

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Altera SDK for OpenCL

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU-based Decompression for Medical Imaging Applications

HPC with Multicore and GPUs

How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Optimizing Code for Accelerators: The Long Road to High Performance

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Multi-Threading Performance on Commodity Multi-Core Processors

Black-Scholes option pricing. Victor Podlozhnyuk

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

GPGPU Computing. Yong Cao

Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

Transcription:

CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing

day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 1

RECAP GPUs are massively parallel, energy-efficient processors There are a variety of ways to harness the power of GPUs Language extensions, directives, libraries, scripts NVCC is the C++ compiler for CUDA GPUs Kernels are functions that run in parallel on GPUs need to be launched from host CPU (or otherwise by Dynamic Parallelism) Threads are organized in a grid of blocks Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 2

THE SIMT ARCHITECTURE Software Kernels run in warps of 32 parallel threads All threads in a warp must execute the same instruction at any given time Hardware Kepler architecture Each GPU contains several, e.g. 15, Stream Multiprocessors (SMs) Each SM contains 192 cores, divided into 6 groups (i.e. 32 cores per group) Each SM can hold up to 2048 CUDA threads, i.e. 64 warps 2 blocks if block size is 1024, 4 blocks if block size is 512, etc For each cycle, the SM can pick up to 4 warps and issue up to 2 instructions per warp Why: to hide instruction & memory latency Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 3

OCCUPANCY Occupancy = number of actual threads per SM / max number of threads per SM Why high occupancy is crucial for reaching peak performance: to hide latency Instruction latency: 9+ cycles Memory latency: 8-2000+ cycles CUDA core execute instructions in-order How much occupancy do we actually need? Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 4

BRANCH DIVERGENCE A warp can only execute ONE common instruction at a time Divergence happens when threads of a warp disagree on their execution path due to data-dependent conditional branch if, for, while/do, switch Warp serially executes each branch path, disabling threads that are not on that path until all paths complete Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 5

QUIZ Divergent or not? global void foo(... ) { if ( threadidx.x % 2 ) {... else {... global void foo( int *bar ) { if ( bar[threadidx.x] ) {... else {... global void foo(... ) { if ( ( threadidx.x / warpsize ) % 2 ) {... else {... global void foo( int *bar ) { int tid = threadidx.x; for( int i = 0; i < bar[tid]; i++ ) {... Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 6

NOT ALL MEMORIES ARE BORN EQUAL HW Register files On-chip L1/L2 cache On-chip texture units Off-chip GRAM SW registers per-thread private local memory per-block shared memory global memory: accessible to all threads constant and texture cache: global read-only access GPU & On-chip memory Off-chip GRAM Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 7

GPU MEMORY FACT SHEET Reg(32bit) Global Shared Const Texture Capacity 255/thread 2-12 GB 16-48 KB/SM ~10s KB ~10s KB Latency 2 ~1000 8 8 (hit) ~60 (hit) Bandwidth - High Very High Low Very High Scope thread global block global global Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 8

GLOBAL MEMORY Allocatable from host/device cudaerror_t cudamalloc ( void** devptr, size_t size ); cudaerror_t cudafree ( void* devptr ) ; device-side malloc/new/free/delete Accessible from device Copiable from host ptr[ index ] = value; cudaerror_t cudamemcpy ( void* dst, const void* src, size_t count, cudamemcpykind kind ); cudaerror_t cudamemset ( void* devptr, int value, size_t count ); UVA Single address space for the host and all devices. Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 9

ACCESS PATTERN Coalesced: adjacent threads access consecutive memory locations Aligned: Starting address of memory access is multiple of 32 bytes (write, non-caching read) or 128 bytes (caching read) Strided: memory access spaced uniformly Coalesced, Aligned Strided Coalesced, Unaligned Uncoalesced, Unaligned Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 10

QUIZ Coalesced or not? global void foo( int *bar ) { bar[thread_id()] =...; global void foo( double *bar ) { double e = bar[thread_id()+16]; global void foo( int *bar ) { bar[thread_id()+8] =...; global void foo( int2 *bar ) { int e = bar[thread_id()].x; global void foo( int *bar ) { bar[thread_id()+13] =...; global void foo( float4 *bar ) { float e = bar[thread_id()].z; global void foo( int *bar ) { int e = bar[thread_id()+16]; global void foo( int *map, int *bar ) { int e = bar[ map[thread_id()] ]; Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 11

CONSTANT MEMORY Const memory constant low latency on hit, low bandwidth (broadcast only) const *: compiler automatically offload to constant cache Non-coherent cache const restrict high bandwidth, medium latency compiler automatically offload to non-coherent cache Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 12

SHARED MEMORY Shared visible to all threads of the block and within the lifetime of the block. allocated using the shared qualifier much faster than global memory banked access, broadcasting Static allocation shared int array[32]; Dynamical allocation <<<numblocks, threadsperblock, sharedmemsize >>> Template allocation Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 13

READ-ONLY DATA CACHE (TEXTURE CACHE) Underlying memory region is assumed to be immutable during kernel launch. 48 KB per SM Better random access performance 32-byte load granularity, cached hardware takes care of multi-dimensional data locality Usage global void foo( int *bar, int *map ) { int x = ldg( bar + map[ threadidx.x ] ); global void foo2( const int* restrict bar, int *map ) { int x = bar[ map[ threadidx.x] ]; Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 14

EXAMPLE 5: IMAGE FILTERING REVISITED Shared memory version copy tiles to shared memory first syncthreads() Non-coherent cache version decorate image as const * restrict Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 15

ATOMICS Race condition shared int sum; int b =...; sum += b; shared int sum; int b =...; register r = sum; r += b; sum = r; shared int sum; int b 0 =...; register r 0 = sum; r 0 += b 0 ; int b 1 =...; register r 1 = sum; sum = r 0 ; r 1 += b 1 ; sum = r 1 ; Atomicity: a guarantee that the operation will be performed without interference from other threads. Performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory modify = add, sub, exchange, etc... Only atomicexch() and atomicadd() for float values Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 16

WARP SHUFFLE Exchange a variable between threads within a warp (C.C. > 3.0) Signature type shfl(type var, int srclane, int width=warpsize); type = int / float shfl() shfl_up() shfl_down() shfl_xor() Direct copy from indexed lane Copy from a lane with lower ID relative to caller Copy from a lane with higher ID relative to caller Copy from a lane based on bitwise XOR of own lane ID Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 17

EXAMPLE 6: PARALLEL REDUCTION Reduction: a summary of data summary = summation, mean, max, min, etc. Parallel summation: S n = n 1 i=0 a i The serial way: for(int i = 0 ; i < n ; i++) sum += a[i]; How to reduce in parallel? Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 18

Thank you for coming to this workshop! ACKNOWLEDGEMENT Y.H.T. appreciate invitation from Professor Wei Cai and support from CSRC. Yu-Hang Tang @ Karniadakis Group, Summer School on Parallel Computing, CSRC, Beijing June 23-26, 2015 19