CUDA Programming. Week 4. Shared memory and register

Similar documents

CUDA Basics. Murphy Stein New York University

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

HPC with Multicore and GPUs

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to GPU hardware and to CUDA

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

1. If we need to use each thread to calculate one output element of a vector addition, what would

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Hardware Performance. Fall 2015

Rootbeer: Seamlessly using GPUs from Java

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Black-Scholes option pricing. Victor Podlozhnyuk

Image Processing & Video Algorithms with CUDA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

OpenACC 2.0 and the PGI Accelerator Compilers

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Introduction to CUDA C

OpenCL Programming for the CUDA Architecture. Version 2.3

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

OpenACC Basics Directive-based GPGPU Programming

12 Tips for Maximum Performance with PGI Directives in C

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Next Generation GPU Architecture Code-named Fermi

Hardware Implementations of RSA Using Fast Montgomery Multiplications. ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner

Lecture 1: an introduction to CUDA

Hands-on CUDA exercises

GPU Tools Sandra Wienke

Multi-Threading Performance on Commodity Multi-Core Processors

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Binary search tree with SIMD bandwidth optimization using SSE

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Advanced CUDA Webinar. Memory Optimizations

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Learn CUDA in an Afternoon: Hands-on Practical Exercises

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Parallel Programming Survey

Accelerate your Application with CUDA C

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

IMAGE PROCESSING WITH CUDA

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Programming GPUs with CUDA

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Compute Spearman Correlation Coefficient with Matlab/CUDA

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Altera SDK for OpenCL

Lecture 3. Optimising OpenCL performance

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

What is Multi Core Architecture?

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

GPGPUs, CUDA and OpenCL

Optimizing matrix multiplication Amitabha Banerjee

Physical Data Organization

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

OpenMP and Performance

U N C L A S S I F I E D

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Parallel and Distributed Computing Programming Assignment 1

Analysis of Binary Search algorithm and Selection Sort algorithm

Rethinking SIMD Vectorization for In-Memory Databases

GPU Programming Strategies and Trends in GPU Computing

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Chapter 11 I/O Management and Disk Scheduling

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Introduction to GPU Programming Languages

GPU Performance Analysis and Optimisation

Lecture 22: C Programming 4 Embedded Systems

Introduction to GPU Architecture

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

A quick tutorial on Intel's Xeon Phi Coprocessor

Module: Software Instruction Scheduling Part I

Sources: On the Web: Slides will be available on:

Using the Intel Inspector XE

7th Marathon of Parallel Programming WSCAD-SSC/SBAC-PAD-2012

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

Database Management Systems

Efficient Sparse Matrix-Vector Multiplication on CUDA

Transcription:

CUDA Programming Week 4. Shared memory and register

Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework

SHARED MEMORY AND BANK CONFLICTION

Memory bank To achieve high memory bandwidth, shared memory is divided into equally-sized memory modules, called banks Memory read/write made of n addresses in n distinct banks can be serviced simultaneously There are 16 banks, which are organized such that successive 32-bit words are assigned to successive banks and each bank has a bandwidth of 32 bits per two clock cycles.

Bank conflict If two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized.

No bank conflict shared float shared[32]; float data = shared[baseindex + tid]; Or float data = shared[baseindex + 3*tid];

Another example Random access

2 way bank conflict 2-way shared double shared[32]; double data = shared[baseindex + tid];

4 way bank conflict shared char shared[32]; char data = shared[baseindex + tid]; Solution: char data = shared[baseindex + 4 * tid];

8 way bank conflict For example, a structure of 4 floats.

Broadcast No bank conflict for broadcast Right: This access causes either no bank conflicts if the word from bank 5 is the broadcast during the first step or 2-way bank conflicts.

How to avoid bank conflict? The old fashion method: (don t use it) shared int shared_lo[32]; shared int shared_hi[32]; double datain; shared_lo[baseindex+tid]= double2loint(datain); shared_hi[baseindex+tid]= double2hiint(datain); double dataout = hiloint2double(shared_hi[baseindex+tid], shared_lo[baseindex+tid]); For array of structures, bank conflict can be reduced by changing it to structure of array. Memory padding

MEMORY PADDING

Example: C=A*B global void matmult(const float* a, size_t lda, const float* b, size_t ldb, float* c, size_t ldc, int n){ shared float mata[block_size][block_size]; shared float matb[block_size][block_size]; const int tidc = threadidx.x; const int tidr = threadidx.y; const int bidc = blockidx.x * BLOCK_SIZE; const int bidr = blockidx.y * BLOCK_SIZE;... for(j = 0; j < n; j += BLOCK_SIZE) { mata[tidr][tidc] = a[(tidr+bidr)*lda+tidc+j]; matb[tidr][tidc] = b[(tidr+j)*ldb+tidc+bidc]; syncthreads(); for(i = 0; i < BLOCK_SIZE; i++) result += mata[tidr][i] * matb[i][tidc];...

Memory access pattern Suppose block size = 16 The memory access of mata[tidr][i] The memory access of matb[i][tidc] Bank conflict

Memory padding When declaring the shared memory, pad extra space matb[block_size][block_size+1]; When access the memory, the addresses become 0, 17(=1 mod 16), 34(=2 mod 16)

REGISTER ALLOCATION

Register partition In G80, 8,192 registers in each SM. Each register is 4byte long. 32KB The automatic variables declared in a CUDA kernel are placed into registers Register file is the fastest and the largest on chip memory Keep as much data as possible in registers

Register allocation example Thread Contexts TB0 TB1 TB2 Thread Contexts Insufficient registers to allocate 3 blocks SP0 SP7 SP0 SP7 32KB Register File 16KB Shared Memory 32KB Register File 16KB Shared Memory X (a) Pre- optimization (b) Post- optimization

Example: Configuration 1 Assume that each block has 256 thread, and each thread uses 10 registers. 3 Blocks can run on each SM Requires 256*3*10=7,680 registers (<8,192) 768/32=24 warps. Suppose for every 4 instructions there is a memory load. Each takes 200 cycles. For normal instruction, a warp takes 4 cycle 4x4x24=384 cycles > 200 cycles

Example: Configuration 2 If a compiler can 11 registers to change the dependence pattern so that 8 independent instructions exist for each global memory load Each block needs 256*11=2,816 registers Only two blocks can run on each SM However, one only needs 200/(8*4) = 7 Warps to tolerate the memory latency Two Blocks have 16 Warps. The performance can be actually higher!

Vasily Volkov s matrix-matrix multiplication CASE STUDY

Some analysis Counter i in for( int i = 0; i < n; i++ ) consumes 2KB in registers for block size = 512 Similarly, i++ translates into 512 increments Use smaller block size If block size=64, 64 increments and 256 bytes But we need that many threads to hide communication

Smaller block size Strip mine longer vectors into shorter at the program level if necessary E.g. instead of using float a; and BS=512 use float a*8+; and BS=64 Instead of a += b;, BS=512 use a*0+ += b*0+; ; a*7+ += b*7+;, BS=64 Use more registers for a, but less space for i, as well as less computation for increasing i.

Matrix Matrix Multiply: C=C+A*B Keep A s and C s blocks in registers Keep B s block in a shared storage No other sharing is needed if C s height = BS. BS=64 is the best result from experiments Choose large enough width of C s block 16 is enough as 2/(1/64+1/16) = 26 way reuse Choose a convenient thickness for A s and B s blocks

Code global void sgemmnn( const float *A, int lda, const float *B, int ldb, float* C, int ldc, int k, float alpha, float beta ){ // Compute pointers to the data A += blockidx.x*64+threadidx.x +threadidx.y*16; B += threadidx.x+(blockidx.y*16+threadidx.y)*ldb; C += blockidx.x*64+threadidx.x+(threadidx.y+ blockidx.y * ldc ) * 16; // declare the shared memory shared float bs[16][17]; Declare on chip float c[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};

The framework const float *Blast = B + k; do{ #pragma unroll for( int i = 0; i < 16; i += 4 ) bs[threadidx.x][threadidx.y+i] = B[i*ldb]; // Read next B s block B += 16; syncthreads();... // the computation part: next slide syncthreads(); } while( B < Blast ); //Store C s block to memory for( int i = 0; i < 16; i++, C += ldc ) C[0] = alpha*c[i] + beta*c[0]; }

#pragma unroll // The bottleneck: Read A s columns for( int i = 0; i < 16; i++, A += lda ){ c[0] += A[0]*bs[i][0]; c[1] += A[0]*bs[i][1]; c[2] += A[0]*bs[i][2]; c[3] += A[0]*bs[i][3]; c[4] += A[0]*bs[i][4]; c[5] += A[0]*bs[i][5]; c[6] += A[0]*bs[i][6]; c[7] += A[0]*bs[i][7]; A c[8] += A[0]*bs[i][8]; c[9] += A[0]*bs[i][9]; c[10] += A[0]*bs[i][10]; c[11] += A[0]*bs[i][11]; c[12] += A[0]*bs[i][12]; c[13] += A[0]*bs[i][13]; c[14] += A[0]*bs[i][14]; c[15] += A[0]*bs[i][15]; } B C Outer product

Some statistics The table from Vasily s talk This record keeps a while until Feb 2010

HOMEWORK

Prefix Sum EX: a = [3 1 7 0 4 1 6 3]. The prefix sum of a is [0 3 4 11 11 15 16 22]. One of the fundamental tool in many algorithms Radix sort, compression, etc The sequential code ps[0] = 0; For(int j=1, j<n; j++) ps[j] = ps[j-1] + a[j-1];

Parallel prefix sum for d = 1 to log 2 n do for all k in parallel do if k >=2 d then x[k] = x[k 2 d-1 ] + x[k]

Homework Play with Vasily s code SGEMM In the repository of the google group Read parallel prefix sum implementation on GPU http://http.developer.nvidia.com/gpugems3/gpu gems3_ch39.html Implement your own