Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.

Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr

GPU internals What makes a GPU tick? NVIDIA GeForce GTX 980 Maxwell GPU. Artist rendering! 2

Outline Computer architecture crash course The simplest processor Exploiting instruction-level parallelism GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory 3

The free lunch era... was yesterday 1980's to 2002: Moore's law, Dennard scaling, micro-architecture improvements Exponential performance increase Software compatibility preserved Do not rewrite software, buy a new machine! Hennessy, Patterson. Computer Architecture, a quantitative approach. 4 th Ed. 2006 4

Computer architecture crash course How does a processor work? Or was working in the 1980s to 1990s: modern processors are much more complicated! An attempt to sum up 30 years of research in 15 minutes 5

Machine language: instruction set Registers Example State for computations Keeps variables and temporaries Instructions R0, R1, R2, R3... R31 Perform computations on registers, move data between register and memory, branch Instruction word Binary representation of an instruction 01100111 Assembly language Readable form of machine language ADD R1, R3 Examples Which instruction set does your laptop/desktop run? Your cell phone? 6

The Von Neumann processor PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Branch Unit Arithmetic and Logic Unit Load/Store Unit Result bus State machine Let's look at it step by step 7

Step by step: Fetch The processor maintains a Program Counter (PC) Fetch: read the instruction word pointed by PC in memory PC Fetch unit 01100111 Instruction word 01100111 Memory 8

Decode Split the instruction word to understand what it represents Which operation? ADD Which operands? R1, R3 PC Fetch unit 01100111 Instruction word Decoder Operands R1, R3 Operation Memory ADD 9

Read operands Get the value of registers R1, R3 from the register file PC Fetch unit Instruction word Decoder Operands R1, R3 Register file Memory 42, 17 10

Execute operation Compute the result: 42 + 17 PC Fetch unit Instruction word Decoder Operands Operation Memory Register file 42, 17 ADD Arithmetic and Logic Unit 59 11

Write back Write the result back to the register file PC Fetch unit Instruction word Decoder R1 Operands 59 Operation Memory Register file Arithmetic and Logic Unit Result bus 12

Increment PC PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Arithmetic and Logic Unit Result bus 13

Load or store instruction Can read and write memory from a computed address PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Arithmetic and Logic Unit Load/Store Unit Result bus 14

Branch instruction Instead of incrementing PC, set it to a computed value PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Branch Unit Arithmetic and Logic Unit Load/Store Unit Result bus 15

What about the state machine? The state machine controls everybody Sequences the successive steps Send signals to units depending on current state At every clock tick, switch to next state Clock is a periodic signal (as fast as possible) Fetchdecode Read operands Increment PC Read operands Writeback Execute 16

Recap We can build a real processor As it was in the early 1980's You are here How did processors become faster? 17

Reason 1: faster clock Progress in semiconductor technology allows higher frequencies Frequency scaling But this is not enough! 18

Going faster using ILP: pipeline Idea: we do not have to wait until instruction n has finished to start instruction n+1 Like a factory assembly line Or the bandeijão 20

Pipelined processor Program 1: add, r1, r3 2: mul r2, r3 3: load r3, [r1] Fetch Decode Execute Writeback 1: add Independent instructions can follow each other Exploits ILP to hide instruction latency 21

Pipelined processor Program 1: add, r1, r3 2: mul r2, r3 3: load r3, [r1] Fetch Decode 2: mul 1: add Execute Writeback Independent instructions can follow each other Exploits ILP to hide instruction latency 22

Pipelined processor Program 1: add, r1, r3 2: mul r2, r3 3: load r3, [r1] Fetch Decode 3: load 2: mul Execute Writeback 1: add Independent instructions can follow each other Exploits ILP to hide instruction latency 23

Superscalar execution Multiple execution units in parallel Independent instructions can execute at the same time Fetch Decode Execute Writeback Decode Execute Writeback Decode Execute Writeback Exploits ILP to increase throughput 24

Locality Time to access main memory: ~200 clock cycles One memory access every few instructions Are we doomed? Fortunately: principle of locality ~90% of memory accesses on ~10% of data Accessed locations are often the same Temporal locality Access the same location at different times Spacial locality Access locations close to each other 26

Caches Large memories are slower than small memories The computer theorists lied to you: in the real world, access in an array of size n costs O(log n), not O(1)! Think about looking up a book in a small or huge library Idea: put frequently-accessed data in small, fast memory Can be applied recursively: hierarchy with multiple levels of cache L1 cache L2 cache Capacity: 64 KB Access time: 2 ns 1 MB 10 ns L3 cache 8 MB 30 ns Memory 8 GB 60 ns 27

Branch prediction What if we have a branch? We do not know the next PC to fetch from until the branch executes Solution 1: wait until the branch is resolved Problem: programs have 1 branch every 5 instructions on average We would spend most of our time waiting Solution 2: predict (guess) the most likely direction If correct, we have bought some time If wrong, just go back and start over Modern CPUs can correctly predict over 95% of branches World record holder: 1.691 mispredictions / 1000 instructions General concept: speculation P. Michaud and A. Seznec. "Pushing the branch predictability limits with the multi-potage+ SC 28 predictor." JWAC-4: Championship Branch Prediction (2014).

Example CPU: Intel Core i7 Haswell Up to 192 instructions in flight May be 48 predicted branches ahead Up to 8 instructions/cycle executed out of order About 25 pipeline stages at ~4 GHz Quizz: how far does light travel during the 0.25 ns of a clock cycle? Too complex to explain in 1 slide, or even 1 lecture David Kanter, Intel's Haswell CPU architecture, RealWorldTech, 2012 http://www.realworldtech.com/haswell-cpu/ 29

Recap Many techniques to run sequential programs as fast as possible Discovers and exploits parallelism between instructions Speculates to remove dependencies Works on existing binary programs, without rewriting or re-compiling Upgrading hardware is cheaper than improving software Extremely complex machine 30

Technology evolution Memory wall Compute Performance Gap Memory Memory speed does not increase as fast as computing speed More and more difficult to hide memory latency Time Power wall Transistor density Power consumption of transistors does not decrease as fast as density increases Performance is now limited by power consumption Total power Transistor power Time ILP wall Law of diminishing returns on Instruction-Level Parallelism Cost Pollack rule: cost performance² Serial performance 32

Usage changes New applications demand parallel processing Computer games : 3D graphics Search engines, social networks big data processing New computing devices are power-constrained Laptops, cell phones, tablets Small, light, battery-powered Datacenters High power supply and cooling costs 33

Latency vs. throughput Latency: time to solution CPUs Minimize time, at the expense of power Throughput: quantity of tasks processed per unit of time GPUs Assumes unlimited parallelism Minimize energy per operation 34

Amdahl's law Bounds speedup attainable on a parallel machine 1 S= Time to run sequential portions 1 P P N Time to run parallel portions S P N Speedup Ratio of parallel portions Number of processors S (speedup) N (available processors) G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities. AFIPS 1967. 35

Why heterogeneous architectures? Time to run sequential portions S= 1 P 1 P N Time to run parallel portions Latency-optimized multi-core (CPU) Low efficiency on parallel portions: spends too much resources Throughput-optimized multi-core (GPU) Low performance on sequential portions Heterogeneous multi-core (CPU+GPU) Use the right tool for the right job Allows aggressive optimization for latency or for throughput M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008. 36

Example: System on Chip for smartphone Small cores for background activity GPU Big cores for applications Lots of interfaces Special-purpose accelerators 37

The (simplest) graphics rendering pipeline Primitives (triangles ) Vertices Fragment shader Vertex shader Textures Z-Compare Blending Clipping, Rasterization Attribute interpolation Pixels Fragments Framebuffer Z-Buffer Programmable stage Parametrizable stage 39

How much performance do we need to run 3DMark 11 at 50 frames/second? Element Per frame Per second Vertices 12.0M 600M Primitives 12.6M 630M Fragments 180M 9.0G Instructions 14.4G 720G Intel Core i7 2700K: 56 Ginsn/s peak We need to go 13x faster Make a special-purpose accelerator Source: Damien Triolet, Hardware.fr 40

Beginnings of GPGPU Microsoft DirectX 7.x 8.0 8.1 9.0 a 9.0b 9.0c 10.0 10.1 11 Unified shaders NVIDIA NV10 FP 16 NV20 NV30 Programmable shaders Dynamic control flow G70 R200 G80-G90 CTM R300 R400 GT200 GF100 CUDA SIMT FP 24 ATI/AMD R100 FP 32 NV40 R500 FP 64 R600 CAL R700 Evergreen GPGPU traction 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 42

Today: what do we need GPUs for? 1. 3D graphics rendering for games Complex texture mapping, lighting computations 2. Computer Aided Design workstations Complex geometry 3. GPGPU Complex synchronization, data movements One chip to rule them all Find the common denominator 43

Little's law: data=throughput latency Throughput (GB/s) Intel Core i7 920 180 50 1,25 3 10 Latency (ns) 50 1500 NVIDIA GeForce GTX 580 L1 320 L2 30 210 J. Little. A proof for the queuing formula L= λ W. JSTOR 1961. 190 DRAM 350 ns45

Hiding memory latency with pipelining Memory throughput: 190 GB/s Throughput: 190B/cycle Memory latency: 350 ns Time Requests At 1 GHz: 190 Bytes/cycle, 350 cycles to wait Latency: 350 cycles Data in flight = 66 500 Bytes 1 cycle... In flight: 65 KB 46

Consequence: more parallelism GPU vs. CPU Requests 8 more parallelism to hide longer latency 64 more total parallelism 8 How to find this parallelism? 8 Time 8 more parallelism to feed more units (throughput) Space... 47

Sources of parallelism ILP: Instruction-Level Parallelism Between independent instructions in sequential program TLP: Thread-Level Parallelism Between independent execution contexts: threads DLP: Data-Level Parallelism Between elements of a vector: same operation on several elements add r3 r1, r2 Parallel mul r0 r0, r1 sub r1 r3, r0 Thread 1 add Thread 2 mul vadd r a,b Parallel a1 + b1 a2 + b2 a3 + b3 r1 r2 r3 48

Example: X a X In-place scalar-vector product: X a X Sequential (ILP) For i = 0 to n-1 do: X[i] a * X[i] Threads (TLP) Launch n threads: X[tid] a * X[tid] Vector (DLP) X a * X Or any combination of the above 49

Uses of parallelism Horizontal parallelism for throughput A B C D More units working in parallel Vertical parallelism for latency hiding A Pipelining: keep units busy when waiting for dependencies, memory B C D A B C A B latency throughput A cycle 1 cycle 2 cycle 3 cycle 4 50

How to extract parallelism? Horizontal Vertical ILP Superscalar Pipelined TLP Multi-core SMT Interleaved / switch-on-event multithreading DLP SIMD / SIMT Vector / temporal SIMT We have seen the first row: ILP We will now review techniques for the next rows: TLP, DLP 51

Sequential processor for i = 0 to n-1 X[i] a * X[i] Source code move i 0 loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<n? loop Machine code add i 18 Fetch store X[17] Decode mul Execute Memory Memory Sequential CPU Focuses on instruction-level parallelism Exploits ILP: vertically (pipelining) and horizontally (superscalar) 53

The incremental approach: multi-core Several processors on a single chip sharing one memory space Intel Sandy Bridge Area: benefits from Moore's law Power: extra cores consume little when not in use e.g. Intel Turbo Boost Source: Intel 54

Homogeneous multi-core Horizontal use of thread-level parallelism add i 50 F IF store X[49] D ID mul mul EX X LSU Mem Threads: T0 EX X Memory add i 18 F IF store X[17] D ID LSU Mem T1 Improves peak throughput 55

Example: Tilera Tile-GX Grid of (up to) 72 tiles Each tile: 3-way VLIW processor, 5 pipeline stages, 1.2 GHz Tile (1,1) Tile (1,2) Tile (1,8) Tile (9,1) Tile (9,8) 56

Interleaved multi-threading Vertical use of thread-level parallelism Threads: T0 T1 T2 T3 mul Fetch mul Decode add i 73 add i 50 Execute load X[89] store X[72] load X[17] store X[49] Memory load-store unit Memory Hides latency thanks to explicit parallelism improves achieved throughput 57

Example: Oracle Sparc T5 16 cores / chip Core: out-of-order superscalar, 8 threads 15 pipeline stages, 3.6 GHz Thread 1 Thread 2 Thread 8 Core 1 Core 2 Core 16 58

Clustered multi-core For each individual unit, select between T0 T1 T2 Cluster 1 T3 Cluster 2 Horizontal replication Vertical time-multiplexing br Examples mul Sun UltraSparc T2, T3 AMD Bulldozer IBM Power 7 add i 73 Fetch store Decode mul add i 50 load X[89] store X[72] load X[17] store X[49] EX Memory L/S Unit Area-efficient tradeoff Blurs boundaries between cores 59

Implicit SIMD Factorization of fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers (0-3) load F T1 (0-3) store D T2 T3 (0) mul (0) (1) mul (2) mul (1) (2) (3) mul X (3) Memory T0 Mem In NVIDIA-speak SIMT: Single Instruction, Multiple Threads Convoy of synchronized threads: warp Extracts DLP from multi-thread applications 60

Explicit SIMD Single Instruction Multiple Data Horizontal use of data level parallelism add i 20 F vstore X[16..19 D vmul X Memory loop: vload T X[i] vmul T a T vstore X[i] T add i i+4 branch i<n? loop Mem Machine code SIMD CPU Examples Intel MIC (16-wide) AMD GCN GPU (16-wide 4-deep) Most general purpose CPUs (4-wide to 8-wide) 61

Quizz: link the words Parallelism Architectures ILP Superscalar processor TLP Homogeneous multi-core DLP Multi-threaded core Use Horizontal: more throughput Clustered multi-core Implicit SIMD Explicit SIMD Vertical: hide latency 62

Hierarchical combination Both CPUs and GPUs combine these techniques Multiple cores Multiple threads/core SIMD units 66

Example CPU: Intel Core i7 Is a wide superscalar, but has also Multicore Multi-thread / core SIMD units Up to 117 operations/cycle from 8 threads 256-bit SIMD units: AVX 4 CPU cores Wide superscalar Simultaneous Multi-Threading: 2 threads 67

Example GPU: NVIDIA GeForce GTX 580 SIMT: warps of 32 threads 16 SMs / chip 2 16 cores / SM, 48 warps / SM Warp 1 Warp 2 Warp 3 Warp 4 Core 32 Core 18 Time Core 17 Warp 47 Core 16 Core 2 Core 1 Warp 48 SM1 SM16 Up to 512 operations per cycle from 24576 threads in flight 68

Taxonomy of parallel architectures Horizontal Vertical ILP Superscalar / VLIW Pipelined TLP Multi-core SMT Interleaved / switch-onevent multithreading DLP SIMD / SIMT Vector / temporal SIMT 69

Classification: multi-core Intel Haswell Horizontal ILP 8 TLP 4 DLP 8 Fujitsu SPARC64 X Vertical 8 2 16 2 2 General-purpose multi-cores: balance ILP, TLP and DLP SIMD Cores Hyperthreading (AVX) IBM Power 8 Oracle Sparc T5 10 12 2 8 16 Sparc T: focus on TLP 8 4 Cores Threads 70

Classification: GPU and many small-core Intel MIC Horizontal ILP 2 TLP 60 DLP 16 SIMD Nvidia Kepler AMD GCN Vertical 2 Cores Tilera Tile-GX 4 16 4 32 32 Cores units SIMT 5 72 17 16 40 16 4 Multithreading Kalray MPPA-256 3 20 4 GPU: focus on DLP, TLP horizontal and vertical Many small-core: focus on horizontal TLP 71

Takeaway All processors use hardware mechanisms to turn parallelism into performance GPUs focus on Thread-level and Data-level parallelism 72

Computation cost vs. memory cost Power measurements on NVIDIA GT200 Energy/op (nj) Instruction control Multiply-add on a 32-wide warp Load 128B from DRAM Total power (W) 1.8 3.6 18 36 80 90 With the same amount of energy Load 1 word from external memory (DRAM) Compute 44 flops Must optimize memory accesses first! 74

External memory: discrete GPU Classical CPU-GPU model Split memory spaces Highest bandwidth from GPU memory Transfers to main memory are slower PCI Express CPU 26GB/s Main memory 8 GB 16GB/s GPU 290GB/s Graphics memory 3 GB Ex: Intel Core i7 4770, Nvidia GeForce GTX 780 75

External memory: embedded GPU Most GPUs today Same memory May support memory coherence GPU can read directly from CPU caches More contention on external memory CPU GPU Cache 26GB/s Main memory 8 GB 76

GPU: on-chip memory Conventional wisdom Cache area in CPU vs. GPU according to the NVIDIA CUDA Programming Guide: But... if we include registers: GPU NVIDIA GM204 GPU AMD Hawaii GPU Intel Core i7 CPU Register files + caches 8.3 MB 15.8 MB 9.3 MB GPU/accelerator internal memory exceeds desktop CPUs 77

Registers: CPU vs. GPU Registers keep the contents of local variables Typical values CPU GPU Registers/thread 32 32 Registers/core 256 65536 10R/5W 2R/1W Read / Write ports GPU: many more registers, but made of simpler memory 78

Internal memory: GPU Cache hierarchy Keep frequently-accessed data Core Reduce throughput demand on main memory L1 Core Core ~2 TB/s L1 L1 1 MB L2 6 MB Managed by hardware (L1, L2) or software (shared memory) Crossbar L2 L2 290 GB/s External memory 79

Caches: CPU vs. GPU Latency CPU GPU Caches, prefetching Multi-threading Throughput Caches On CPU, caches are designed to avoid memory latency Throughput reduction is a side effect On GPU, multi-threading deals with memory latency Caches are used to improve throughput (and energy) 80

GPU: thousands of cores? Computational resources NVIDIA GPUs Exec. units SM AMD GPUs Exec. Units SIMD-CU G80/G92 (2006) 128 16 GT200 (2008) 240 30 GF100 (2010) 512 16 R600 (2007) 320 4 R700 (2008) 800 10 Evergreen (2009) 1600 20 GK104 (2012) 1536 8 NI (2010) 1536 24 GK110 (2012) 2688 14 GM204 (2014) 2048 16 SI (2012) 2048 32 VI (2013) 2560 40 Number of clients in interconnection network (cores) stays limited 81

Takeaway Result of many tradeoffs Between locality and parallelism Between core complexity and interconnect complexity GPU optimized for throughput Exploits primarily DLP, TLP Energy-efficient on parallel applications with regular behavior CPU optimized for latency Exploits primarily ILP Can use TLP and DLP when available 82

Next time Next Tuesday, 1:00pm, room 2014 CUDA Execution model Programming model API Thursday 1:00pm, room 2011 Lab work: what is my GPU and when should I use it? There may be available seats even if you are not enrolled 83