Architecture. Jason Lowden Advanced Computer Architecture November 7, 2012

Size: px

Start display at page:

Download "Architecture. Jason Lowden Advanced Computer Architecture November 7, 2012"

Dora Craig
7 years ago
Views:

1 Evolution of the NVIDIA GPU Architecture Jason Lowden Advanced Computer Architecture November 7, 2012

2 Agenda Introduction of the NVIDIA GPU Graphics Pipeline GPU Terminology Architecture of a GPU Computing Elements Memory Types Fermi Architecture Kepler Architecture GPUs as a Computational Device CUDA Programming Performance Comparison Relation to SMT, Vector Processors, and DSPs Summary

3 NVIDIA GPU History First GPU is released in 1999 Used for the purpose of graphics processing GeForce and Quadro CUDA Architecture released in 2006 Designed for use by industry and academia as a computing device Move towards commodity parallel processing Tesla GPU series released in 2007 Fermi Architecture released in 2009 Kepler Architecture released in 2012

4 Graphics Pipeline

5 Terminology Thread The smallest grain of the hierarchy of device computation Block A group of threads Grid A group of blocks Warp A group of 32 threads that are executed simultaneously on the device Kernel The creator of a grid for GPU execution

6 Architecture of a GPU Same components as a typical CPU However, More computing elements More types of memory Original GPUs had vertex and pixel shaders Specifically for graphics Modern GPUs are slightly different CUDA Compute Unified Device Architecture

7 Computational Elements of a GPU Streaming Processor Core of the design Place where all of the computation takes place Streaming Multiprocessor Groups of streaming multiprocessors In addition to the SPs, these also contain the Special Function Units and Load/Store Units Instructional Schedulers Complex Control Logic

8 Streaming Multiprocessor Architecture

9 Types of GPU Memory Global DRAM Slowest Performance Texture Cached Global Memory Bound at runtime Constant Cached Global Memory Shared Local to a block of threads

10 Architectural Memory Hierarchy

11 Fermi Architecture

12 Fermi Improvements Increase the number of SPs per SM Unified Request Path for load/store instructions Implementation of a cache hierarchy L1 cache per SM Configurable with Shared Memory L2 cache is shared globally Register Spilling Occurs when the register requirements of a thread exceed what is available on the device Previous Generation: Spill to DRAM (global memory) Fermi: Use of the L1 cache

13 Summary

14 Kepler SM Overview Goal: Improve GPU performance and power efficiency Improved to 3 times performance per watt over Fermi Increased to 192 SPs per SM 32 Special Floating Point units Improved Warp Scheduling 14

15 Kepler SM Design 15

16 Warp Scheduler 4 warp schedulers Each scheduler can issue up to 2 independent instructions when it is ready to issue. 16

Kepler Memory Architecture Shared Memory and L1 are still physically shared New configuration: 32K L1, 32K Shared Shared memory bandwidth is doubled compared with Fermi

17 Kepler Memory Architecture Shared Memory and L1 are still physically shared New configuration: 32K L1, 32K Shared Shared memory bandwidth is doubled compared with Fermi Increased the size of L2 Doubled the size Fermi, increasing it to 1536 KB Introduction of Read Only Cache Previously, this was used in Fermi for Texture cache 48 KB of storage 17

18 Warp Shuffle Instructions In Fermi, data could only be exchanged between threads using shared memory. Resulted in additional synchronization time Kepler allows the shuffle functions, which Exchange data between threads without using shared memory Handles the store and load operation as a single step Data can only be shared within the same warp In their example, an FFT algorithm saw 6% performance increase when using this instruction. 18

19 Kepler Hardware Features Dynamic Parallelism Any kernel can launch more kernels from within itself Takes additional load off of the CPU Hyper Q 32 hardware managed work queues Fermi had 1 queue Grid Management Unit Needed to manage the number of grids that are executed Introduction of the GMU to handle all of the grids that can be active at one time NVIDIA GPUDirect TM Ability for CUDA enabled GPUs to interact without the need for CPU intervention The GPU can interact directly with the NIC 19

20 Comparison of Kepler and Fermi 20

21 Use for Computation Historically, GPUs were used for graphics to offload CPU work Current trend Combine CPU and GPU on a single core Due to the massively parallel computations of the work, GPUs are ideal for their number of processing cores. However, these are only ideal when there are few data dependencies. Introduction of CUDA and the Tesla GPUs

22 CUDA Programming Extensions to the C language With some C++ support Programming Support Windows Visual Studio Linux/Mac Eclipse Programming paradigm where each computation take place on a separate thread Requires NVIDIA GPU for acceleration Simulators are used for research purposes

23 Example Vector Addition C for( int i = 0; i < SIZE; ++i ) { c[ i ] = a[ i ] + b[ i ]; } CUDA global void addvectors( float* a, float* b, float* c ) { int id = threadidx.x; if( id < SIZE ) { c[ id ] = a[ id ] + b[ id ]; } }

24 Programming Requirements Explicit Memory Operations to allocate and copy data from the CPU to GPU Some exceptions do apply All kernels execute asynchronously of the CPU Explicit synchronization barriers between the processors

25 Synchronization and Performance To meet data dependencies, Synchronization Primitives syncthreads() Synchronizes all threads in a block Atomic Operations Depending on compute/cuda version, these are possible on global and shared memory Performance is dictated by memory operations and synchronization cost Memory Coalescence Warp Divergence

26 Performance Comparison

27 Relation to Other Architectures SMT Many smaller cores, with less functionality, to compute results Each core has a hardware context for a thread that can be switched out Vector Processors Computation of results in parallel that could be done sequentially by a CPU Ability to access large chunks of data from memory at a given time Banks of shared memory could lead to bank conflicts Digital Signal Processors As with DSP algorithms, many applications could also use the MAC elements; these are built into the GPU by design

28 Conclusions GPUs are massively parallel devices that can be used for general purpose computing, in addition to graphics processing As the cost continues to decrease, these devices become off the shelf components that can be used to build larger system. In addition to compute capabilities, Kepler offers the benefit of additional performance per watt, making a more power efficient design. When used with other technologies, like OpenCL, GPUs can be used in heterogeneous platforms.

29 References S. L. Alarcon, CUDA Memories, unpublished. NVIDIA. (2012 April 16). NVIDIA CUDA C Programming Guide. [Online]. Available: amming_guide.pdf. NVIDIA. (2009). NVIDIA s Next Generation CUDA TM Compute Architecture: Fermi. [Online]. Available: cture_whitepaper.pdf. NVIDIA. (2012). NVIDIA s Next Generation CUDA TM Compute Architecture: Kepler TM GK110. [Online]. Available: Kepler GK110 Architecture Whitepaper.pdf. NVIDIA. (2012). NVIDIA GeForce GTX 680. [Online]. Available: GTX 680 Whitepaper FINAL.pdf

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?