Memory Hardware and Bank Conflict

Similar documents
CUDA Programming. Week 4. Shared memory and register

CUDA Basics. Murphy Stein New York University

GPU Parallel Computing Architecture and CUDA Programming Model

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Hardware Performance. Fall 2015

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Advanced CUDA Webinar. Memory Optimizations

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

1. If we need to use each thread to calculate one output element of a vector addition, what would

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Next Generation GPU Architecture Code-named Fermi

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

HPC with Multicore and GPUs

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GPU hardware and to CUDA

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

CUDA programming on NVIDIA GPUs

IMAGE PROCESSING WITH CUDA

Lecture 1: an introduction to CUDA

Image Processing & Video Algorithms with CUDA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to GPU Programming Languages

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2

GPGPU Computing. Yong Cao

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

CUDA. Multicore machines

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Efficient Sparse Matrix-Vector Multiplication on CUDA

Programming GPUs with CUDA

Introduction to CUDA C

GPGPUs, CUDA and OpenCL

OpenCL Programming for the CUDA Architecture. Version 2.3

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Multi-Threading Performance on Commodity Multi-Core Processors

Parallel Programming Survey

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Introduction to Cloud Computing

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

1. Memory technology & Hierarchy

CSE 6040 Computing for Data Analytics: Methods and Tools

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Optimizing Application Performance with CUDA Profiling Tools

GPU Computing - CUDA

Compute Spearman Correlation Coefficient with Matlab/CUDA

Rootbeer: Seamlessly using GPUs from Java

The Fastest, Most Efficient HPC Architecture Ever Built

Guided Performance Analysis with the NVIDIA Visual Profiler

The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off)

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

~ Greetings from WSU CAPPLab ~

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

Fast Implementations of AES on Various Platforms

Binary search tree with SIMD bandwidth optimization using SSE

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

ultra fast SOM using CUDA

Radeon HD 2900 and Geometry Generation. Michael Doggett

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Matrix Multiplication with CUDA A basic introduction to the CUDA programming model. Robert Hochberg

GPUs for Scientific Computing

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

LSN 2 Computer Processors

Graphics and Computing GPUs

GPU Programming Strategies and Trends in GPU Computing

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Memory Basics. SRAM/DRAM Basics

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Sélection adaptative de codes polyédriques pour GPU/CPU

Hands-on CUDA exercises

Matrix Multiplication

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Introduction to GPGPU. Tiziano Diamanti

RevoScaleR Speed and Scalability

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Parallel Computing for Data Science

Chapter 12 File Management

Chapter 12 File Management. Roadmap

361 Computer Architecture Lecture 14: Cache Memory

Transcription:

GPU Memory II Memory Hardware and Bank Conflict

CUDA Device Memory Space: Review (Device) Grid Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block boc shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can R/W global, constant, and texture memories Host Block (0, 0) Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Local Memory Global Memory Constant Memory Texture Memory Thread (1, 0) Local Memory Shared Memory Registers Registers Thread (0, 0) Local Memory Thread (1, 0) Local Memory 2

Parallel l Memory Sharing Thread Block Grid 0 Grid 1 Local Memory Shared Memory...... Local Memory: per-thread Private per thread Auto variables, register spill Shared Memory: per-block Shared by threads of the same block Inter-thread communication Global Memory: per-application Shared by all threads Inter-Grid communication Global Memory Sequential Grids in Time

Hardware Overview Streaming Processor Array TPC TPC TPC TPC TPC TPC TPC TPC Thread Processor Cluster SM Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory TEX SM SP SP SP SFU SP SP SP SFU Special Function Unit (SFU) SP SP

Register File I$ Register File (RF) 32 KB Provides 4 operands/clock Texture pipe can also read/write RF 2 SMs share 1 TEX Load/Store pipe can also read/write RF L1 Multithreaded Instruction Buffer R F C$ L1 Operand Select Shared Mem MAD SFU

Programmer View of Register File There are 8192 registers in each SM in G80 Registers are dynamically partitioned across all Blocks assigned to the SM Once assigned to a Block, the register is NOT accessible by threads in other Blocks Each thread in the same Block only access registers assigned to itself 4 blocks 3 blocks

Matrix Multiplication li Example If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM? Each Block requires 10*256 = 2560 registers 8192 = 3 * 2560 + change So, three blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 1? Each Block now requires 11*256 = 2816 registers 8192 < 2816 *3 Only two Blocks can run on an SM, 1/3 reduction of parallelism!!! 7

More on Dynamic Partitioning i Dynamic partitioning gives more flexibility to compilers/programmers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional CPU threading models. The compiler can tradeoff between instructionlevel parallelism and thread level parallelism 8

Constant t Immediate address constants I$ Indexed address constants Constants stored in DRAM, and cached on chip L1 per SM A constant value can be broadcast to all threads in a Warp L1 Multithreaded Instruction Buffer Extremely efficient way of accessing a value that t is common for all threads in a Block! MAD SFU R F C$ L1 Operand Select Shared Mem

Shared Memory GPU Memory II Each Multi-processor has 16 KB of I$ L1 Shared Memory 16 banks of 32bit words Will discuss about accessing pattern later Visible to all threads in a thread block read and write access Multithreaded Instruction Buffer R F C$ L1 Operand Select Shared Mem MAD SFU

Matrix Multiplication li Example Explore Tile-based implementation with Shared Memory. Question: How is shared memory organized? What are the issues when accessing shared memory?

Tile Based Multiplication li One thread computes one element of P P sub Assume that the dimensions of M and N are multiples of BLOCK_SIZE and square shape M P 0 N bx 0 1 2 One block computes one square sub- matrix P sub of size BLOCK_SIZE tx 012 bsize-1 BLOCK_SIZE BLOCK_SIZE N.heig ght by 1 ty 0 1 2 bsize-1 P sub BLOCK_SIZE M.height 2 BLOCK_SIZE BLOCK_SIZE M.width BLOCK_SIZE N.width 13

Tiled Matrix Multiplication li Kernel -- Review global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { 1. shared float Mds[TILE_WIDTH][TILE_WIDTH]; 2. shared float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockidx.x; int by = blockidx.y; 4. int tx = threadidx.x; int ty = threadidx.y; 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*tile_width + tx)]; 10. Nds[ty][tx] = Nd[Col + (m*tile_width + ty)*width]; 11. syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) { 13. Pvalue += Mds[ty][k] * Nds[k][tx]; 14. Synchthreads(); 15. } 16. Pd[Row*Width+Col] = Pvalue; }

Shared Memory Organization Parallel Memory Architecture: Memory is divided id d into banks Essential to achieve high bandwidth Each bank can service one address per cycle A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Conflicting accesses are serialized Bank 15 16

Share Memory Access Issue No Bank Conflicts No Bank Conflicts Linear addressing stride == 1 Random 1:1 Permutation Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Bank 3 Thread 4 Bank 4 Thread 4 Bank 4 Thread 5 Thread 6 Thread 7 Bank 5 Bank 6 Bank 7 Thread 5 Thread 6 Thread 7 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15

Share Memory Access Issue 2-way Bank Conflicts 8-way Bank Conflicts Linear addressing stride == 2 Linear addressing stride == 8 Thread 0 Bank 0 Thread 0 x8 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Thread 4 Bank 4 Thread 4 Bank 5 Thread 5 Bank 7 Bank 6 Thread 6 Bank 8 Bank 7 Thread 7 Thread 8 x8 Bank 9 Thread 9 Thread 10 Thread 11 Bank 15 Thread 15 Bank 15

How addresses map to banks in CUDA Each bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks So bank = address % 16 Same as the size of a half-warp No bank conflicts between different half-warps, only within a single half-warp 19

Share Memory Performance Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank 20

Linear Addressing (1D) Given: shared float shared[256]; float foo = shared[baseindex + s * threadidx.x]; Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 s=1 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd Thread 15 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 s=3 Bank 15 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 21 Bank 15

Data types and bank conflicts This has no conflicts if type of shared is 32-bits: foo = shared[baseindex + threadidx.x] But not if the data type is smaller 4-way bank conflicts: shared char shared[]; foo = shared[baseindex + threadidx.x]; Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank k15 2-way bank conflicts: shared short shared[]; foo = shared[baseindex + threadidx.x]; Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 22 Bank 15

Structs t and Bank Conflicts Struct assignments compile into as many memory accesses as there are struct members: struct vector { float x, y, z; }; struct mytype { float f; int c; }; shared struct vector vectors[64]; shared struct mytype mytypes[64]; Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 This has no bank conflicts for vector; struct size is 3 words 3 accesses per thread, contiguous banks (no common factor with 16) struct vector v = vectors[baseindex + threadidx.x]; Bank 0 Bank 1 Bank 2 Bank 3 Bank k4 Bank 5 Bank 6 Bank 7 Bank 15 This has 2-way bank conflicts for my Type; (2 accesses per thread) struct mytype m = mytypes[baseindex + threadidx.x]; 23

Common Array Bank Conflict Patterns 1D Each thread loads 2 elements into shared memory: 2-way-interleaved loads result in 2-way bank conflicts: int tid = threadidx.x; shared[2*tid] = global[2*tid]; shared[2*tid+1] = global[2*tid+1]; l[2*tid+1] This makes sense for traditional CPU threads, locality lit in cache line usage and reduced sharing traffic. Not in shared memory usage where there is no cache line effects but banking effects Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 8 Thread 9 Thread 10 Thread 11 Bank 0 Bank 1 Bank k2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 24

A Better Array Access Pattern Each thread loads one element in every consecutive group of blockdim elements. shared[tid] = global[tid]; shared[tid + blockdim.x] = global[tid + blockdim.x]; Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank k5 Bank 6 Bank 7 Thread 15 Bank 15 25

Common Bank Conflict Patterns (2D) Operating on 2D array of floats in shared memory e.g. image processing Example: 16x16 block Each thread processes a row So threads in a block access the elements in each column simultaneously (example: row 1 in purple) 16-way bank conflicts: rows all start at bank 0 Solution 1) pad the rows Add one float to the end of each row Solution 2) transpose before processing Suffer bank conflicts during transpose Bank Indices without Padding Bank Indices with Padding 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 2 4 5 6 7 8 9 1011 3 5 6 7 8 9 101112 4 6 7 8 9 10111213 5 7 8 9 1011121314 7 3 4 5 6 8 15 0 1 2 3 4 5 6 1415 26

Does Matrix Multiplication Incur Shared Memory Bank Conflicts? 12. for (int k = 0; k < TILE_WIDTH; ++k) { 13. Pvalue += Mds[ty][k] * Nds[k][tx]; 14. Synchthreads(); 15. } Nd TILE_WI IDTH TILE WIDTH WIDTH Md Pd Pd sub TILE_WIDTHE WIDTH TILE_WIDTH TILE_WIDTH TILE_WIDTH WIDTH WIDTH 27

Does Matrix Multiplication Incur Shared Memory Bank Conflicts? 12. for (int k = 0; k < TILE_WIDTH; ++k) { 13. Pvalue += Mds[ty][k] * Nds[k][tx]; 14. Synchthreads(); 15. } Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx] 28

Does Matrix Multiplication Incur Shared Memory Bank Conflicts? 12. for (int k = 0; k < TILE_WIDTH; ++k) { 13. Pvalue += Mds[ty][k] * Nds[k][tx]; 14. Synchthreads(); 15. } Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx] For TILE_WIDTH = 16 The whole half-warp is accessing the same shared memory location. Conflict. But, GPU support broadcasting. 29

Does Matrix Multiplication Incur Shared Memory Bank Conflicts? 12. for (int k = 0; k < TILE_WIDTH; ++k) { 13. Pvalue += Mds[ty][k] * Nds[k][tx]; 14. Synchthreads(); 15. } Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 For TILE_WIDTH = 8 The first half-warp and the second halfwarp are accessing two different shared memory location. 8-way bank conflict. 30

Does Matrix Multiplication Incur Shared Memory Bank Conflicts? 12. for (int k = 0; k < TILE_WIDTH; ++k) { 13. Pvalue += Mds[ty][k] * Nds[k][tx]; 14. Synchthreads(); 15. } Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 For TILE_WIDTH = 4 4-way bank conflict. 31

Does Matrix Multiplication Incur Shared Memory Bank Conflicts? 12. for (int k = 0; k < TILE_WIDTH; ++k) { 13. Pvalue += Mds[ty][k] * Nds[k][tx]; 14. Synchthreads(); 15. } Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx] For TILE_WIDTH = 16 Each thread in a half-warp is accessing different shared memory location. No conflict. 32

Does Matrix Multiplication Incur Shared Memory Bank Conflicts? 12. for (int k = 0; k < TILE_WIDTH; ++k) { 13. Pvalue += Mds[ty][k] * Nds[k][tx]; 14. Synchthreads(); 15. } Mds[ty*TILE_WIDTH + k] Nds[k*TILE_WIDTH + tx] For TILE_WIDTH = 8 Since the memory storage organization is row-major for 2D array, so it s the same with TILE_WIDTH = 16. No conflict. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 33