GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile



Similar documents
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Introduction to GPU hardware and to CUDA

Advanced CUDA Webinar. Memory Optimizations

Guided Performance Analysis with the NVIDIA Visual Profiler

CUDA Basics. Murphy Stein New York University

1. If we need to use each thread to calculate one output element of a vector addition, what would

CUDA Programming. Week 4. Shared memory and register

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Parallel Computing Architecture and CUDA Programming Model

OpenCL Programming for the CUDA Architecture. Version 2.3

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Multi-Threading Performance on Commodity Multi-Core Processors

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Parallel Programming Survey

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

GPU Hardware Performance. Fall 2015

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Introduction to CUDA C

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Performance Analysis and Optimisation

Optimizing Application Performance with CUDA Profiling Tools

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Parallel Algorithm Engineering

Learn CUDA in an Afternoon: Hands-on Practical Exercises

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

ultra fast SOM using CUDA

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Lecture 1: an introduction to CUDA

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

GPU Tools Sandra Wienke

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Introduction to GPU Architecture

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda

Image Processing & Video Algorithms with CUDA

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Black-Scholes option pricing. Victor Podlozhnyuk

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

CUDA programming on NVIDIA GPUs

Next Generation GPU Architecture Code-named Fermi

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

OpenACC 2.0 and the PGI Accelerator Compilers

Introduction to GPGPU. Tiziano Diamanti

CSE 6040 Computing for Data Analytics: Methods and Tools

Architecture of Hitachi SR-8000

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

Rootbeer: Seamlessly using GPUs from Java

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

HPC with Multicore and GPUs

Introduction to Cloud Computing

FPGA-based Multithreading for In-Memory Hash Joins

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

~ Greetings from WSU CAPPLab ~

Programming GPUs with CUDA

Final Project Report. Trading Platform Server

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

IMAGE PROCESSING WITH CUDA

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

Hands-on CUDA exercises

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenACC Basics Directive-based GPGPU Programming

GPGPU Computing. Yong Cao

Introduction to GPU Computing

Speeding up GPU-based password cracking

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

GPUs for Scientific Computing

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Speeding Up RSA Encryption Using GPU Parallelization

Monte-Carlo Option Pricing. Victor Podlozhnyuk

QCD as a Video Game?

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

GPU Computing - CUDA

EE361: Digital Computer Organization Course Syllabus

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

Transcription:

GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding Occupancy 2

Recap Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block Tiling - Use shared memory as cache 3

Recap Tiling examples for FD - Solution 1: Fetching to global in block boundaries - Solution 2: Using halo nodes All threads load to shared, only some operate - Solution 3: Using halo nodes Load to shared in multiple steps, all threads operate 4

Recap Bank conflicts - Arrays in shared memory are divided into banks - Access to different data in the same bank by more than one thread in the same warp creates a bank conflict - Bank conflicts serializes the access to the minimum number of non-conflicting instructions 5

Control flow If statement - Threads are executed in warps - Within a warp, the hardware is not capable of executing if and else statements at the same time! global void function(); {... } if (condition) {... } else {... } 6

Control flow How does the hardware deal with an if statement? Instruction Warp Branch Thread A B Instruction Warp divergence! 7

Control flow Hardware serializes the different execution paths Recommendations - Try to make every thread in the same warp do the same thing If the if statement cuts at a multiple of the warp size, there is no warp divergence and the instruction can be done in one pass - Remember threads are placed consecutively in a warp (t0-t31, t32- t63,...) But we cannot rely on any execution order within warps - If you can t avoid branching, try to make as many consecutive threads as possible do the same thing 8

Control flow - An illustration Let s implement a sum reduction In a serial code, we would just loop over all elements and add Idea: - To implement in parallel, let s take every other value and add in place it to its neighbor - To the resulting array do the same thing until we have only one value shared float partialsum[] int t = threadidx.x for(int stride = 1; stride<blockdim.x; stride*=2) { syncthreads(); if (t%(2*stride)==0) partialsum[t] += partialsum[t+stride]; } 9

Control flow - An illustration Thread 0! Thread 2! Thread 4! Thread 6! Thread 8! Thread 10! 0! 1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 1! 0+1! 2+3! 4+5! 6+7! 8+9! 10+11! 2! 0...3! 4..7! 8..11! 3! iterations! 0..7! 8..15! David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009! ECE498AL, University of Illinois, Urbana-Champaign! Array elements! 10

Control flow - An illustration Advantages - Right result - Runs in parallel Disadvantages - The number of threads decreases per iteration, but we re using much more warps than needed There is warp divergence in every iteration - No more than half the threads per thread are being executed per iteration Let s change the algorithm a little bit... 11

Control flow - An illustration Improved version - Instead of adding neighbors, let s add values with stride half a section away - Divide the stride by two after each iteration shared float partialsum[] int t = threadidx.x for(int stride = blockdim.x; stride>1; stride>>=1) { syncthreads(); if (t<stride) partialsum[t] += partialsum[t+stride]; } 12

Control flow - An illustration Thread 0! 0! 1! 2! 3!! 13! 14! 15! 16! 17! 18! 19! 1!0+16! 15+31! 3! 4! 13

Control flow - An illustration 512 elements Iteration Exec. threads Warps 1 256 16 2 128 8 3 64 4 4 32 2 5 16 1 6 8 1 7 4 1 8 2 1 Threads > Warp Threads < Warp Warp divergence! 14

Control flow - An illustration We get warp divergence only for the last 5 iterations Warps will be shut down as the iteration progresses - This will happen much faster than for the previous case - Resources are better utilized - For the last 5 iterations, only 1 warp is still active 15

Control flow - Loop divergence Work per thread data dependent global void per_thread_sum (int *indices, float *data, float *sums) {... for (int j=indices[i]; j<indices[i+1]; j++) { sum += data[j]; } sums[i] = sum } David Tarjan - NVIDIA 16

Control flow - Loop divergence Warp wont finish until the last thread finishes - Warp will be dragged Possible solution: - Try flatten peaks by making threads work in multiple data 17

Memory coalescing Global memory is accessed in chunks of aligned 32, 64 or 128 bytes Following protocol is used to issue a memory transaction of a half warp (valid for 1.X) - Find memory segment that contains address requested by the lowest numbered active thread - Find other active threads whose requested address lies in same segment, and reduce transaction size if possible - Do transaction. Mark serviced threads as inactive. - Repeat until all threads are serviced Worse case: fetch 32 bytes, use only 4 bytes: 7/8 wasted bandwidth! 18

Memory coalescing Access pattern visualization - Thread 0 is lowest active, accesses address 116 Belongs to 128-byte segment 0-127 t0 t1 t2 t3... t15 { 0 32 64 96 128 160 192 224 256 288 David Tarjan - NVIDIA { 128B { 32B{ 64B { 64B 128B 19

Memory coalescing Simple access pattern One 64 byte transaction Will be looking at compute capability 1.X examples 20

Memory coalescing Sequential but misaligned One 128 byte transaction One 64 byte transaction and one 32 byte transaction 21

Memory coalescing Strided access One 128 byte transaction, but half of bandwidth is wasted 22

Memory coalescing Example: Copy with stride global void stridecopy(float *odata, float *idata, int stride) { int xid = (blockidx.x*blockdim.x+threadidx.x)*stride; odata[xid] = idata[xid]; } 23

Memory coalescing 2.X architecture - Global memory is cached Cached in both L1 and L2: 128 byte transaction Cached only in L2: 32 byte transaction - Whole warps instead of half warps 24

Memory coalescing - SoA or AoS? Array of structures struct record { int key; int value; int flag; }; record *d_record; cudamalloc((void**) &d_records,...); David Tarjan - NVIDIA 25

Memory coalescing - SoA or AoS? Structure of array struct SoA { int *key; int *value; int *flag; }; SoA *d_aoa_data; cudamalloc((void**) &d_soa_data.keys,...); cudamalloc((void**) &d_soa_data.value,...); cudamalloc((void**) &d_soa_data.flag,...); David Tarjan - NVIDIA 26

Memory coalescing - SoA or AoS? cudamalloc guarantees aligned memory, then accessing the SoA will be much more efficient global void bar (record *AoS_data, SoA SoA_data) { int i = blockdim.x*blockidx.x + threadidx.x; // AoS wastes bandwidth int key = AoS_data[i].key; // SoA efficient use of bandwidth int key_better = SoA_data.keys[i]; } David Tarjan - NVIDIA 27

Latency hiding A warp is not scheduled until all threads have finished the previous instruction These instructions can have high latency (eg. global memory access) Ideally one wants to have enough warps to keep the GPU busy during the waiting time. time SM multithreaded Warp scheduler warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 warp 3 instruction 96 Latency hiding David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009! ECE 498AL, University of Illinois, Urbana-Champaign! 28

Latency hiding How many instructions do we need to hide latency of L clock cycles? - 1.X: L/4 instructions SM issues one instruction per warp over four cycles - 2.0: L instructions SM issues one instruction per warp over two clock cycles for two warps at a time - 2.1: L/2 instructions SM issues a pair of instructions per warp over two clock cycles for two warps at a time 29

Latency hiding Doing the exercise - Instruction takes 22 clock cycles to complete 1.X: we need 6 available warps to hide latency 2.X: we need 22 available warps to hide latency - Fetch to global memory: ~600 cycles! Depends on FLOPs per memory access (arithmetic intensity) eg. if ratio is 15: - 10 warps for 1.X - 40 warps for 2.X 30

Occupancy Occupancy is the ratio of resident warps to maximum number of resident warps Occupancy is determined by the resource limits in the SM - Maximum number of blocks per SM - Maximum number of threads - Shared memory - Registers 31

Occupancy --ptxas-options=-v for compiler to tell you register and shared memory usage Use CUDA Occupancy Calculator The idea is to find the optimal threads per block for maximum occupancy Occupancy!= performance: but low occupancy codes have a hard time hiding latency Demonstration 32

Measuring performance Right measure depends on program - Compute bound: operations per second Performance given by ratio of number of operations and timing of kernel Peak performance depends on architecture: Fermi ~1TFLOP/s - Memory bound: bandwidth Performance given by ratio of number of global memory accesses and timing of kernel Peak bandwidth for GTX 280 (1.107GHz with 512-bit width) 1107*10 6 *(512/8)*2/10 9 = 141.7 GB/s 33