Topics. Threading Details. Optimziation. Wavefronts and warps Thread scheduling for both AMD and NVIDIA GPUs Predication

Similar documents
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Programming for the CUDA Architecture. Version 2.3

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introduction to GPU Programming Languages

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

1. If we need to use each thread to calculate one output element of a vector addition, what would

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Introduction to GPU Architecture

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Introduction to GPU hardware and to CUDA

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Lecture 3. Optimising OpenCL performance

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5 Posted. Project

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Optimizing Application Performance with CUDA Profiling Tools

CUDA Basics. Murphy Stein New York University

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

Next Generation GPU Architecture Code-named Fermi

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA programming on NVIDIA GPUs

GPU Hardware Performance. Fall 2015

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Guided Performance Analysis with the NVIDIA Visual Profiler


CUDA Programming. Week 4. Shared memory and register

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Optimizing Code for Accelerators: The Long Road to High Performance

Hardware support for Local Memory Transactions on GPU Architectures

GPU Computing - CUDA

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

GPU Performance Analysis and Optimisation

Matrix Multiplication

Spring 2011 Prof. Hyesoon Kim

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

ultra fast SOM using CUDA

NVIDIA GeForce GTX 580 GPU Datasheet

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

College of William & Mary Department of Computer Science

GPUs for Scientific Computing

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda

Lecture 1: an introduction to CUDA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Introduction to GPGPU. Tiziano Diamanti

The Yin and Yang of Processing Data Warehousing Queries on GPU Devices

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

OpenCL for programming shared memory multicore CPUs

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Introduction to GPU Computing

HPC with Multicore and GPUs

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

IMAGE PROCESSING WITH CUDA

Radeon HD 2900 and Geometry Generation. Michael Doggett

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

THE FUTURE OF THE APU BRAIDED PARALLELISM Session 2901

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Rethinking SIMD Vectorization for In-Memory Databases

Parallel Computing for Data Science

Evaluation of CUDA Fortran for the CFD code Strukti

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

COSCO 2015 Heterogeneous Computing Programming

Texture Cache Approximation on GPUs

GPGPU Computing. Yong Cao

Image Processing & Video Algorithms with CUDA

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

GPGPUs, CUDA and OpenCL

GTC 2014 San Jose, California

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Real-time Visual Tracker by Stream Processing

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Accelerating Wavelet-Based Video Coding on Graphics Hardware

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

HY345 Operating Systems

VLIW Processors. VLIW Processors

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

SOC architecture and design

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

OpenACC 2.0 and the PGI Accelerator Compilers

The sparse matrix vector product on GPUs

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Speeding up GPU-based password cracking

OpenCL: Portability and Performance

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

SPES. Software Platform Embedded Systems 2020

Transcription:

GPU Optimization

Topics Threading Details Wavefronts and warps Thread scheduling for both AMD and NVIDIA GPUs Predication Optimziation Thread mapping Device occupancy Vectorization 2

Work Groups to HW Threads OpenCL kernels are structured into work groups that map to device compute units Compute units on GPUs consist of SIMT processing elements Work groups automatically get broken down into hardware schedulable groups of threads for the SIMT hardware This schedulable unit is known as a warp (NVIDIA) or a wavefront (AMD) 3

Work-Item Scheduling Hardware creates wavefronts by grouping threads of a work group Along the X dimension first All threads in a wavefront execute the same instruction Threads within a wavefront move in lockstep Threads have their own register state and are free to execute different control paths Thread masking used by HW Predication can be set by compiler 0,0 0,1 Wavefront 0 0,14 0,15 1,0 1,1 1,14 1,15 2,0 2,1 2,14 2,15 3,0 3,1 3,14 3,15 Wavefront 1 4,0 4,1 4,14 4,15 7,0 7,1 7,14 7,15 Wavefront 2 Wavefront 3 Grouping of work-group into wavefronts 4

Local Data Share Wavefront Scheduling - AMD Wavefront size is 64 threads Each thread executes a 5 way VLIW instruction issued by the common issue unit A Stream Core (SC) executes one VLIW instruction 16 stream cores execute 16 VLIW instructions on each cycle A quarter wavefront is executed on each cycle, the entire wavefront is executed in four consecutive cycles SIMD Engine Issue and Branch Control Unit SC 0 SC 1 SC 2 SC 3 SC 4 SC 15 5

Warp Scheduling - Nvidia Work groups are divided into 32-thread warps which are scheduled by a SM On Nvidia GPUs half warps are issued each time and they interleave their execution through the pipeline The number of warps available for scheduling is dependent on the resources used by each block Similar to wavefronts in AMD hardware except for size differences Warp 0 Warp 1 Warp 2 t 0 t 31 t 32 t 63 t 64 t 95 Work Group Streaming Multiprocessor Instruction Fetch/Dispatch SP SP SP SP SP SP SP SP Shared Memory 7

Divergent Control Flow Instructions are issued in lockstep in a wavefront /warp for both AMD and Nvidia However each work item can execute a different path from other threads in the wavefront If work items within a wavefront go on divergent paths of flow control, the invalid paths of a workitems are masked by hardware Branching should be limited to a wavefront granularity to prevent issuing of wasted instructions 9

Predication and Control Flow How do we handle threads going down different execution paths when the same instruction is issued to all the work-items in a wavefront? Predication is a method for mitigating the costs associated with conditional branches Beneficial in case of branches to short sections of code Based on fact that executing an instruction and squashing its result may be as efficient as executing a conditional Compilers may replace switch or if then else statements by using branch predication 10

Predication for GPUs Predicate is a condition code that is set to true or false based on a conditional Both cases of conditional flow get scheduled for execution Instructions with a true predicate are committed Instructions with a false predicate do not write results or read operands Benefits performance only for very short conditionals kernel void test() { } int tid= get_local_id(0) ; if( tid %2 == 0) Do_Some_Work() ; else Do_Other_Work() ; Predicate = True for threads 0,2,4. Predicate = False for threads 1,3,5. Predicates switched for the else condition 11

Divergent Control Flow Case 1: All odd threads will execute if conditional while all even threads execute the else conditional. The if and else block need to be issued for each wavefront Case 2: All threads of the first wavefront will execute the if case while other wavefronts will execute the else case. In this case only one out of if or else is issued for each wavefront Case 1 Case 2 int tid = get_local_id(0) if ( tid % 2 == 0) //Even Work Items DoSomeWork() else DoSomeWork2() int tid = get_local_id(0) if ( tid / 64 == 0) //Full First Wavefront DoSomeWork() else if (tid /64 == 1) //Full Second Wavefront DoSomeWork2() Conditional With divergence Conditional No divergence 12

Effect of Predication on Performance Time for Do_Some_Work = t 1 (if case) Time for Do_Other _Work = t 2 (else case) if( tid %2 == 0) T = 0 T = t start Do_Some_Work() t 1 Green colored threads have valid results Squash invalid results, invert mask T = t start + t 1 Do_Other _Work() t 2 Green colored threads have valid results Squash invalid results T = t start + t 1 + t 2 Total Time taken = t start +t 1 + t 2 13

Pitfalls of using Wavefronts OpenCL specification does not address warps/wavefronts or provide a means to query their size across platforms AMD GPUs (5870) have 64 threads per wavefront while NVIDIA has 32 threads per warp NVIDIA s OpenCL extensions (discussed later) return warp size only for Nvidia hardware Maintaining performance and correctness across devices becomes harder Code hardwired to 32 threads per warp when run on AMD hardware 64 threads will waste execution resources Code hardwired to 64 threads per warp when run on Nvidia hardware can lead to races and affects the local memory budget We have only discussed GPUs, the Cell doesn t have wavefronts Maintaining portability assign warp size at JIT time Check if running AMD / Nvidia and add a DWARP_SIZE Size to build command 15

Topics Threading Details Wavefronts and warps Thread scheduling for both AMD and NVIDIA GPUs Predication Optimziation Thread mapping Device occupancy Vectorization 16

Thread Mapping Thread mapping determines which threads will access which data Proper mappings can align with hardware and provide large performance benefits Improper mappings can be disastrous to performance 17

Thread Mapping By using different mappings, the same thread can be assigned to access different data elements The examples below show three different possible mappings of threads to data (assuming the thread id is used to access an element) int group_size = get_local_size(0) * get_local_size(1); Mapping int tid = get_global_id(1) * get_global_size(0) + get_global_id(0); int tid = get_global_id(0) * get_global_size(1) + get_global_id(1); int tid = get_group_id(1) * get_num_groups(0) * group_size + get_group_id(0) * group_size + get_local_id(1) * get_local_size(0) + get_local_id(0) Thread IDs 0 1 2 3 4 5 6 7 0 4 8 12 1 5 9 13 0 1 4 5 2 3 6 7 8 9 10 11 2 6 10 14 8 9 12 13 12 13 14 15 3 7 11 15 10 11 14 15 *assuming 2x2 groups 18

Thread Mapping Consider a serial matrix multiplication algorithm This algorithm is suited for output data decomposition We will create NM threads Effectively removing the outer two loops Each thread will perform P calculations The inner loop will remain as part of the kernel Should the index space be MxN or NxM? 19

Thread Mapping Thread mapping 1: with an MxN index space, the kernel would be: Thread mapping 2: with an NxM index space, the kernel would be: Both mappings produce functionally equivalent versions of the program 20

Thread Mapping the execution of the two thread mappings on NVIDIA GeForce 285 and 8800 GPUs Notice that mapping 2 is far superior in performance for both GPUs 21

Why? Due to data accesses on the global memory bus Assuming row-major data, data in a row (i.e., elements in adjacent columns) are stored sequentially in memory To ensure coalesced accesses, consecutive threads in the same wavefront should be mapped to columns (the second dimension) of the matrices This will give coalesced accesses in Matrices B and C For Matrix A, the iterator i3 determines the access pattern for row-major data, so thread mapping does not affect it 22

Thread Mapping In mapping 1, consecutive threads (tx) are mapped to different rows of Matrix C, and non-consecutive threads (ty) are mapped to columns of Matrix B The mapping causes inefficient memory accesses 23

Thread Mapping In mapping 2, consecutive threads (tx) are mapped to consecutive elements in Matrices B and C Accesses to both of these matrices will be coalesced Degree of coalescence depends on the workgroup and data sizes 24

Thread Mapping In general, threads can be created and mapped to any data element by manipulating the values returned by the thread identifier functions The following matrix transpose example will show how thread IDs can be modified to achieve efficient memory accesses 25

Matrix Transpose A matrix transpose is a straightforward technique Out(x,y) = In(y,x) No matter which thread mapping is chosen, one operation (read/write) will produce coalesced accesses while the other (write/read) produces uncoalesced accesses Note that data must be read to a temporary location (such as a register) before being written to a new location In Out In Out coalesced uncoalesced uncoalesced coalesced Threads 0 1 2 3 0 1 2 3 26

Matrix Transpose If local memory is used to buffer the data between reading and writing, we can rearrange the thread mapping to provide coalesced accesses in both directions Note that the work group must be square In Out coalesced 0 1 2 3 coalesced Local memory 4 5 6 7 8 9 10 11 12 13 14 15 Threads global mem index local mem index 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 4 8 12 27

Time (s) Matrix Transpose The following figure shows a performance comparison of the two transpose kernels for matrices of size NxM on an AMD 5870 GPU Optimized uses local memory and thread remapping 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 1024 2048 3072 4096 Matrix Order Naive Optimized 28

Occupancy On current GPUs, work groups get mapped to compute units When a work group is mapped to a compute unit, it cannot be swapped off until all of its threads complete their execution If there are enough resources available, multiple work groups can be mapped to the same compute unit at the same time Wavefronts from another work group can be swapped in to hide latency Resources are fixed per compute unit (number of registers, local memory size, maximum number of threads) The term occupancy is used to describe how well the resources of the compute unit are being utilized 29

Occupancy Registers The availability of registers is one of the major limiting factor for larger kernels The maximum number of registers required by a kernel must be available for all threads of a workgroup Example: Consider a GPU with 16384 registers per compute unit running a kernel that requires 35 registers per thread Each compute unit can execute at most threads This affects the choice of workgroup size A workgroup of 512 is possible/not possible? A workgroup of 256 is???? A workgroups of 128 is??? 30

Occupancy Registers Consider another example: A GPU has 16384 registers per compute unit The work group size of a kernel is fixed at 256 threads The kernel currently requires 17 registers per thread Given the information, each work group requires 4352 registers This allows for 3 active work groups if registers are the only limiting factor If the code can be restructured to only use 16 registers, then 4 active work groups would be possible 31

Occupancy Local Memory GPUs have a limited amount of local memory on each compute unit 32KB of local memory on AMD GPUs 32-48KB of local memory on NVIDIA GPUs Local memory limits the number of active work groups per compute unit Depending on the kernel, the data per workgroup may be fixed regardless of number of threads (e.g., histograms), or may vary based on the number of threads (e.g., matrix multiplication, convolution) 32

Occupancy Threads GPUs have hardware limitations on the maximum number of threads per work group 256 threads per WG on AMD GPUs 512 threads per WG on NVIDIA GPUs NVIDIA GPUs have per-compute-unit limits on the number of active threads and work groups (depending on the GPU model) 768 or 1024 threads per compute unit 8 or 16 warps per compute unit AMD GPUs have GPU-wide limits on the number of wavefronts 496 wavefronts on the 5870 GPU (~25 wavefronts or ~1600 threads per compute unit) 33

Occupancy Limiting Factors The minimum of these three factors is what limits the active number of threads (or occupancy) of a compute unit The interactions between the factors are complex The limiting factor may have either thread or wavefront granularity Changing work group size may affect register or shared memory usage Reducing any factor (such as register usage) slightly may have allow another work group to be active The CUDA occupancy calculator from NVIDIA plots these factors visually allowing the tradeoffs to be visualized 34

CUDA Occupancy Calculator CUDA occupancy calculator: 1. Enter hardware model and kernel requirements 3. Graphs are shown to visualize limiting factors 2. Resource usage and limiting factors are displayed http://developer.download.nvidia.com/compu te/cuda/cuda_occupancy_calculator.xls 35

Compute Unit... PE 0 PE 1 PE 2 PE n-1 Vectorization On AMD GPUs, each processing element executes a 5-way VLIW instruction 5 scalar operations or 4 scalar operations + 1 transcendental operation Registers ALU + T-unit ALU Incoming Instruction Branch Unit 36

Vectorization Vectorization allows a single thread to perform multiple operations at once Explicit vectorization is achieved by using vector datatypes (such as float4) in the source program When a number is appended to a datatype, the datatype becomes an array of that length Operations can be performed on vector datatypes just like regular datatypes Each ALU will operate on different element of the float4 data 37

Vectorization Vectorization improves memory performance on AMD GPUs The AMD Accelerated Parallel Processing OpenCL Programming Guide compares float to float4 memory bandwidth 38

Summary Although writing a simple OpenCL program is relatively easy, optimizing code can be very difficult Improperly mapping loop iterations to OpenCL threads can significantly degrade performance When creating work groups, hardware limitations (number of registers, size of local memory, etc.) need to be considered Work groups must be sized appropriately to maximize the number of active threads and properly hide latencies Vectorization is an important optimization for AMD GPU hardware Though not covered here, vectorization may also help performance when targeting CPUs 39