GPU Hardware Performance. Fall 2015



Similar documents
CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

CUDA Basics. Murphy Stein New York University

Introduction to GPU hardware and to CUDA

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

OpenCL Programming for the CUDA Architecture. Version 2.3

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Programming. Week 4. Shared memory and register

CUDA programming on NVIDIA GPUs

Rethinking SIMD Vectorization for In-Memory Databases

GPU Parallel Computing Architecture and CUDA Programming Model

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Advanced CUDA Webinar. Memory Optimizations

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

The Fastest, Most Efficient HPC Architecture Ever Built

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Programming GPUs with CUDA

GPU Computing - CUDA

GPI Global Address Space Programming Interface

Image Processing & Video Algorithms with CUDA

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Lecture 1: an introduction to CUDA

Next Generation GPU Architecture Code-named Fermi

Optimizing Application Performance with CUDA Profiling Tools

GPU Performance Analysis and Optimisation

Instruction Set Architecture (ISA) Design. Classification Categories

Binary search tree with SIMD bandwidth optimization using SSE

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Introduction to CUDA C

Introduction to GPU Programming Languages

Computer Organization and Architecture

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

Computer Graphics Hardware An Overview

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

MPLAB TM C30 Managed PSV Pointers. Beta support included with MPLAB C30 V3.00

OpenACC 2.0 and the PGI Accelerator Compilers

CS 153 Design of Operating Systems Spring 2015

Big Data Processing with Google s MapReduce. Alexandru Costan

Intel X38 Express Chipset Memory Technology and Configuration Guide

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

CPU performance monitoring using the Time-Stamp Counter register

Parallel Programming Survey

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Inge Os Sales Consulting Manager Oracle Norway

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Efficient Low-Level Software Development for the i.mx Platform

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

Introduction to Cloud Computing

HP Smart Array Controllers and basic RAID performance factors

Chapter 7 Memory Management

Storing Data: Disks and Files

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

NVIDIA GeForce GTX 580 GPU Datasheet

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

CHAPTER 7: The CPU and Memory

Software implementation of Post-Quantum Cryptography

IMAGE PROCESSING WITH CUDA

361 Computer Architecture Lecture 14: Cache Memory

Definition of RAID Levels

Introduction to GPU Architecture

A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors

Efficient Sparse Matrix-Vector Multiplication on CUDA

How To Write A Disk Array

Intel Pentium 4 Processor on 90nm Technology

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

GPU Tools Sandra Wienke

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

Lecture 22: C Programming 4 Embedded Systems

GPGPU Parallel Merge Sort Algorithm

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Chapter 7 Memory Management

Guided Performance Analysis with the NVIDIA Visual Profiler

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off)

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Big Data Visualization on the MIC

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

Transcription:

Fall 2015

Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using global memory for c. c. 1.1 and shared memory for c. c. 1.2 arithmetic (Add, Sub, Exch, Min, Max, Inc, Dec, CAS) a bitwise (And, Or, Xor) operations

Warp Voting All threads in one warp evaluate the same condition and perform its comparison. Available in c. c. 1.2. int all ( int predicate ) ; Result is non-zero iff the predicate is non-zero for all the threads in the warp. int any ( int predicate ) ; Result is non-zero iff the predicate is non-zero for at least one thread in the warp. unsigned int ballot ( int predicate ) ; Contains voting bit mask of individual threads.

Shuffle Functions Threads within a warp can efficiently communicate using warp shuffle functions (from c.c. 3.0). float shfl ( float var, int srclane, int width=warpsize ) ; Copy value from srclane. float shfl_up ( float var, unsigned int delta, int width=warpsize ) Copy value from threads with lower ID relative to caller. Analogically shfl down. float shfl_xor ( float var, int lanemask, int width=warpsize ) ; Copy from a thread based on bitwise XOR of own ID and lanemask. Parameter width defines the number of participating threads. It must be power of two, indexing starts at 0.

Synchronization of Memory Operations Compiler can optimize operations on shared/global memory (intermediate results may be kept in registers) and can reorder them if we need to ensure that the data are visible for others, we use threadfence() or threadfence block() if a variable is declared as volatile, all load/store operations are implemented in shared/global memory very important if we assume implicit warp synchronization

Global Synchronization using Atomic Operations Alternative implementation of a vector reduction each block sums elements in its part of a vector barrier (weak global barrier) one block sums results of all the blocks

device unsigned int count = 0 ; shared bool islastblockdone ; global void sum ( const float array, unsigned int N, float result ) { float partialsum = calculatepartialsum ( array, N ) ; if ( threadidx. x == 0) { result [ blockidx. x ] = partialsum ; threadfence ( ) ; unsigned int value = atomicinc(&count, griddim. x ) ; islastblockdone = ( value == ( griddim. x 1 ) ) ; } syncthreads ( ) ; if ( islastblockdone ) { float totalsum = calculatetotalsum ( result ) ; if ( threadidx. x == 0) { result [ 0 ] = totalsum ; count = 0 ; } } }

Global Performance of global memory becomes a bottleneck easily global memory bandwdith is low relatively to arithmetic performance of GPU (G200 24 FLOPS/float, G100 30, GK110 62, GM200 73) 400 600 cycles latency The throughput can be significantly worse with bad parallel access pattern the memory has to be accessed coalesced use of just certain subset of memory regions should be avoided (partition camping)

Coalesced Memory Access (C. C. < 2.0) GPU memory needs to be accessed in larger blocks for efficiency global memory is split into 64 B segments two of these segments are aggregated into 128 B segments

Coalesced Memory Access (C. C. < 2.0) A half of a warp can transfer data using single transaction or one to two transactions when transactions when transferring a 128 B word it is necessary to use large words one memory transaction can transfer 32 B, 64 B, or 128 B words GPUs with c. c. 1.2 the accessed block has to begin at an address dividable by 16 data size k-th thread has to access k-th block element some threads needn t participate if these rules are not obeyed, each element is retrieved using a separate memory transaction

Coalesced Memory Access (C. C. < 2.0) GPUs with c. c. 1.2 are less restrictive each transfer is split into 32 B, 64 B, or 128 B transactions in a way to serve all requests with the least number of transactions order of threads can be arbitrarily permuted w.r.t. transferred elements

Coalesced Memory Access (C. C. < 2.0) Threads are aligned, element block is contiguous, order is not permuted coalesced access on all GPUs

Unaligned Memory Access (C. C. < 2.0) Threads are not aligned, contiguous elements accessed, order is not permuted one transaction on GPUs with c. c. 1.2

Unaligned Memory Access (C. C. < 2.0) Similar case may result in a need for two transactions

Unaligned Memory Access Performance (C. C. < 2.0) Older GPUs perform smallest possible transfer (32 B) for each element, thus reducing performance to 1/8 Newer GPUs perform (c. c. 1.2) two transfers

Interleaved Memory Access Performance (C. C. < 2.0) The bigger the spaces between elements, the bigger performance drop on GPUs with c. c. 1.2 the effect is rather dramatic

Global Memory Access with Fermi (C. C. = 2.x) Fermi has L1 and L2 cache L1: 256 B per row, 16 kb or 48 kb per multiprocesor in total L2: 32 B per row, 768 kb on GPU in total What are the advantages? more efficient programs with unpredictable data locality more efficient when shared memory is not used from some reason unaligned access no slowdown in principle interleaved access data needs to be used before it is flushed from the cache, otherwise the same or bigger problem as with c. c. < 2.0 (L1 cache may be turned of to avoid overfetching)

Global Memory Access with gaming Kepler (C. C. = 3.0) There is only L2 cache for read/write global memory access L2: 23 B per row, up to 1.5 GB per GPU L1: for local memory, 16 KB, 32 KB or 48 KB in total

Global Memory Access with full-featured Kepler and Maxwell (C. C. 3.5) Read-only data cache shared with textures compiler tries to use, we can help with restrict and ldg() slower than Fermi s L1 Maxwell does not have L1 cache for local memory inefficient for programs heavily using local memory

Partition camping relevant for c. c. 1.x (and AMD GPUs) processors based on G80 have 6 regions, G200 have 8 regions of global memory the memory is split into regions in 256 B chunks even access among the regions is needed for maximum performance among individual blocks block are usually run in order given by their position in the grid if only part of regions is used, the resulting condition is called partition camping generally not as critical as the coalesced access more tricky, problem size dependent, disguised from fine-grained perspective

HW Organization of Shared Memory Shared memory is organized into memory banks, which can be accessed in parallel c. c. 1.x 16 banks, c. c. 2.0 32 banks memory space mapped in an interleaved way with 32 b shift or 64 b shift (c.c. 3.x) to use full memory performance, we have to access data in different banks broadcast implemented if all threads access the same data

Bank Conflict Bank conflict occurs when some threads in warp/half-warp access data in the same memory bank with several exceptions threads access exactly the same data threads access different half-words of 64 b word (c.c. 3.x) when occurs, memory access gets serialized performance drop is proportional to number of parallel operations that the memory has to perform to serve a request

Access without Conflicts

n-way Conflicts

Broadcast

Access Patterns Alignment is not needed, bank conflicts not generated int x = s [ threadidx. x + offset ] ; Interleaving does not create conflicts if c is odd, for c.c. 3.0 no conflict if c = 2 and 32 b numbers are accessed int x = s [ threadidx. x c ] ; Access to the same variable never generates conflicts on c. c. 2.x, while on 1.x only if thread count accessing the variable is multiple of 16 int x = s [ threadidx. x % c ] ;

Other Memory Types Transfers between host and GPU memory need to be minimized (often at cost of decreasing efficiency of computation on GPU) may be accelerated using page-locked memory it is more efficient to transfer large blocks at once computations and memory transfers should be overlapped Texture memory designed to reduce number of transfers from the global memory works well for aligned access does not help if latency is the bottleneck may simplify addressing or add filtering

Other Memory Types Constant memory as fast as registers if the same value is read performance decreases linearly with number of different values read Registers read-after-write latency, hidden if at least 192 threads are running for c. c. 1.x or at least 768 threads are running for c. c. 2.x possible bank conflicts even in registers compiler tries to avoid them we can make life easier for the compiler if we set block size to multiple of 64