Hybrid CPU-GPU cores for heterogeneous multi-core processors. Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr

Hybrid CPU-GPU cores for heterogeneous multi-core processors Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr

From GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete components Today (2011-...) Heterogeneous multi-core Physically unified CPU + GPU on the same chip Logically separated Different programming models, compilers, instruction sets Tomorrow Unified programming models? Single instruction set? Central Processing Unit (CPU) Latencyoptimized cores Throughputoptimized cores Graphics Processing Unit (GPU) Heterogeneous multi-core chip Hardware accelerators 2

Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 3

The 1980': pipelined processor Example: scalar-vector multiplication: X a X for i = 0 to n-1 X[i] a * X[i] Source code move i 0 loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<n? loop Machine code add i 18 Fetch store X[17] Decode mul Execute L/S Unit Sequential CPU Memory 4

The 1990': superscalar processor Goal: improve performance of sequential applications Latency: time to get the result Exploits Instruction-Level Parallelism (ILP) Uses many tricks Branch prediction, out-of-order execution, register renaming, data prefetching, memory disambiguation Basis: speculation Take a bet on future events If right: time gain If wrong, roll back: energy loss 5

What makes speculation work: regularity Application behavior likely to follow regular patterns Control regularity for(i ) { if(f(i)) { } Regular case Time Irregular case i=0 i=1 i=2 i=3 i=0 i=1 i=2 i=3 taken taken taken taken not tk taken taken not tk Memory regularity } j = g(i); x = a[j]; j=17 j=18 j=19 j=20 j=21 j=4 j=17 j=2 Speculation exploits patterns to guess accurately Applications Caches Branch prediction Instruction prefetch, data prefetch, write combining 6

The 2000': going multi-threaded Memory wall More and more difficult to hide memory latency Power wall Performance is now limited by power consumption ILP wall Law of diminishing returns on Instruction-Level Parallelism Gradual transition from latencyoriented to throughput-oriented Homogeneous multi-core Simultaneous multi-threading Performance Cost Time Gap Serial performance Compute Memory Transistor density Transistor power Total power Time 7

Homogeneous multi-core Replication of the complete execution engine Multi-threaded software move i slice_begin loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<slice_end? loop Machine code add i 18 IF store X[17] mul IF ID EX LSU add i 50 IF store X[49] mul IF ID EX LSU Memory Threads: T0 T1 Improves throughput thanks to explicit parallelism 8

Simultaneous multi-threading (SMT) Time-multiplexing of processing units Same software view move i slice_begin loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<slice_end? loop Machine code mul mul add i 73 add i 50 load X[89] store X[72] load X[17] store X[49] Fetch Decode Execute L/S Unit Memory Threads: T0 T1 T2 T3 Hides latency thanks to explicit parallelism 9

Throughput-oriented architectures Also known as GPUs, but do more than just graphics Target: highly parallel sections of programs Programming model: SPMD one function run by many threads One code: For n threads: X[tid] a * X[tid] Many threads: Goal: maximize computation / energy consumption ratio Many-core approach: many independent, multi-threaded cores Can we be more efficient? Exploit regularity 11

Parallel regularity Similarity in behavior between threads Control regularity Time Regular Thread 1 2 3 4 i=17 i=17 i=17 i=17 switch(i) { case 2:... case 17:... case 21:... } Irregular 1 2 3 4 i=21 i=4 i=17 i=2 Memory regularity A load A[8] load A[9] Memory load A[10] load A[11] r=a[i] load A[8] load A[0] load A[11] load A[3] Data regularity a=32 a=32 a=32 a=32 b=52 b=52 b=52 b=52 r=a*b a=17 a=-5 a=11 a=42 b=15 b=0 b=-2 b=52 12

Dynamic SPMD vectorization aka SIMT Run SPMD threads in lockstep Mutualize fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers T0 (0-3) load IF T1 T2 T3 (0) mul (0-3) store ID (1) mul (2) mul (3) mul EX Memory (0) (1) (2) (3) LSU SIMT: Single Instruction, Multiple Threads Wave of synchronized threads: warp Improves Area/Power-efficiency thanks to regularity 13

Core 127 Core 93 Core 92 Core 91 Core 66 Core 65 Core 64 Core 34 Core 33 Core 32 Core 2 Core 1 Example GPU: NVIDIA GeForce GTX 980 SIMT: warps of 32 threads 16 SMs / chip 4 32 cores / SM, 64 warps / SM Warp 1 Warp 5 Warp 2 Warp 6 Warp 3 Warp 7 Warp 4 Warp 8 Warp 60 Warp 61 Warp 62 Warp 63 Time SM1 SM16 4612 Gflop/s Up to 32768 threads in flight 14

SIMT vs. multi-core + explicit SIMD SIMT All parallelism expressed using threads Warp size implementationdefined Dynamic vectorization Threads Multi-core + explicit SIMD Combination of threads, vectors Vector length fixed at compiletime Static vectorization Threads Warp SIMT benefits Easier programming Retain binary compatibility Vector 15

Heterogeneity: causes and consequences Amdahl's law S= Time to run sequential sections 1 1 P P N Time to run parallel sections Latency-optimized multi-core (CPU) Low efficiency on parallel sections: spends too much resources Throughput-optimized multi-core (GPU) Low performance on sequential sections Heterogeneous multi-core (CPU+GPU) Use the right tool for the right job Resources saved in parallel sections can be devoted to accelerete sequential sections M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008. 17

Single-ISA heterogeneous architectures Proposed in the academia, now in embedded systems on chip Example: ARM big.little High-performance CPU cores Cortex A57 Low-power CPU cores Cortex A53 big cores Thread migration LITTLE cores Different cores, same instruction set Migrate application threads dynamically according to demand 18

Single-ISA heterogeneous CPU-GPU Enables dynamic task migration between CPU and GPU Use the best core for the current job Load-balance on available resources CPU cores GPU cores Thread migration 19

Option 1: static vectorization Extend CPU SIMD instruction sets to throughput-oriented cores Scalar + wide vector ISA e.g. x86-64 + AVX-512 Latency core Throughput cores Issue: conflicting requirements Same suboptimal SIMD vector length for all cores Or binary compatibility loss, ISA fragmentation Intel. Intel Advanced Vector Extensions 2015/2016 Support in GNU Compiler Collection. GNU Tools Cauldron 2014 20

Our proposal: Dynamic vectorization Extend SIMT execution model to general-purpose cores Scalar ISA on both sides Latency core Throughput cores Flexibility advantage: SIMD width optimized for each core type Challenge: generalize dynamic vectorization to general-purpose instruction processing 21

Capturing instruction regularity How to keep threads synchronized? Challenge: conditional branches Rules of the game One thread per SIMD lane Same instruction on all lanes Lanes can be individually disabled Thread 0 Thread 1 Thread 2 Thread 3 1 instruction Lane 0 Lane 1 Lane 2 Lane 3 x = 0; // Uniform condition if(tid > 17) { x = 1; } // Divergent conditions if(tid < 2) { if(tid == 0) { x = 2; } else { x = 3; } } 23

Most common: mask stack skip Code x = 0; // Uniform condition if(tid > 17) { } x = 1; // Divergent conditions if(tid < 2) { } push pop if(tid == 0) { } else { } push pop push pop x = 2; x = 3; Mask Stack 1 activity bit / thread tid=0 1111 1111 tid=1 1111 1100 1111 1100 1000 1111 1100 1111 1100 0100 1111 1100 1111 tid=2 tid=3 A. Levinthal and T. Porter. Chap - a SIMD graphics processor. SIGGRAPH 84, 1984. 24

Traditional SIMT pipeline Instruction Activity bit Exec Instruction Sequencer PC, Activity mask Instruction Fetch Insn, Activity mask Broadcast Instruction, Activity bit Exec Activity bit=0: discard instruction Mask stack Instruction, Activity bit Exec Used in Nvidia GPUs 25

Goto considered harmful? MIPS j jal jr syscall NVIDIA Tesla (2007) bar bra brk brkpt cal cont kil pbk pret ret ssy trap.s NVIDIA Fermi (2010) bar bpt bra brk brx cal cont exit jcal jmx kil pbk pret ret ssy.s Intel GMA Gen4 (2006) jmpi if iff else endif do while break cont halt msave mrest push pop Intel GMA SB (2011) jmpi if else endif case while break cont halt call return fork Control instructions in some CPU and GPU instruction sets Why so many? AMD R500 (2005) jump loop endloop rep endrep breakloop breakrep continue AMD R600 (2007) push push_else pop loop_start loop_start_no_al loop_start_dx10 loop_end loop_continue loop_break jump else call call_fs return return_fs alu alu_push_before alu_pop_after alu_pop2_after alu_continue alu_break alu_else_after Expose control flow structure to the instruction sequencer AMD Cayman (2011) push push_else pop push_wqm pop_wqm else_wqm jump_any reactivate reactivate_wqm loop_start loop_start_no_al loop_start_dx10 loop_end loop_continue loop_break jump else call call_fs return return_fs alu alu_push_before alu_pop_after alu_pop2_after alu_continue alu_break alu_else_after 27

SIMD is so last century Maspar MP-1 (1990) 1 instruction for 16 384 processing elements (PEs) PE : ~1 mm², 1.6 µm process SIMD programming model /1000 Fewer PEs 50 Bigger PEs More divergence NVIDIA Fermi (2010) 1 instruction for 16 PEs PE : ~0,03 mm², 40 nm process Threaded programming model From centralized control to flexible distributed control 28

Moving away from the vector model Requirements for a single-isa CPU+GPU Run general-purpose applications Switch freely back and forth between SIMT and MIMD modes Conventional techniques do no meet these requirements Solution: stateless dynamic vectorization Key idea Maintain 1 Program Counter (PC) per thread Each cycle, elect one Master PC to fetch from Activate all threads that have the same PC 29

1 PC / thread Code x = 0; if(tid > 17) { x = 1; } if(tid < 2) { if(tid == 0) { x = 2; Master PC } } else { x = 3; } Program Counters (PCs) tid= 0 1 2 3 Match active PC 0 1 0 0 0 PC 1 PC 2 PC 3 No match inactive 30

Our new SIMT pipeline PC 0 Insn, MPC MPC=PC 0? Insn Exec Update PC PC 0 PC 1 Vote MPC Instruction Fetch Insn, MPC Broadcast Insn, MPC MPC=PC 1? Insn Exec Update PC No match: discard instruction PC 1 PC n Insn, MPC MPC=PC n? Insn Exec Update PC PC n 31

Benefits of stateless dynamic vectorization Before: stack, counters O(n), O(log n) memory n = nesting depth 1 R/W port to memory Exceptions: stack overflow, underflow Vector semantics Structured control flow only Specific instruction sets After: multiple PCs O(1) memory No shared state Allows thread suspension, restart, migration Multi-thread semantics Traditional languages, compilers Traditional instruction sets Can be mixed with MIMD 32

Scheduling policy: min(sp:pc) Which PC to choose as master PC? Conditionals, loops Order of code addresses min(pc) Functions Favor max nesting depth min(sp) With compiler support Unstructured control flow too No code duplication Full backward and forward compatibility Source Assembly Order if( ) { p? br else } 1 else br endif { else: } 2 endif: 3 while( ) { } f(); void f() { } start: p? br start call f f: ret 1 2 3 4 1 3 2 33

Potential of Min(SP:PC) Comparison of fetch policies on SPMD benchmarks PARSEC and SPLASH benchmarks for CPU, using pthreads, OpenMP, TBB Microarchitecture-independent model: ideal SIMD machine Average number of active threads Min(SP:PC) achieves reconvergence at minimal cost T. Milanez et al. Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads. Parallel Computing 40.9:548-558. 2014 34

DITVA Dynamic Inter-Thread Vectorization Architecture Add dynamic vectorization capability to an in-order SMT CPU Runs existing parallel programs compiled for x86 Scheduling policy: alternate min(sp:pc) and round-robin 4 scalar units 2 SIMD units 4 SIMT units Baseline: 4-thread 4-issue in-order with explicit SIMD DITVA: 4-warp 4-thread 4-issue 36

DITVA performance Speedup of 4-warp 2-thread DITVA and 4-warp 4-thread DITVA over baseline 4-thread processor +18% and +30% performance on SPMD workloads 37

Simultaneous Branch Interweaving Co-issue instructions from divergent branches Fill inactive units using parallelism from divergent paths 2 3 4 1 5 6 7 Control-flow graph Same cycle, two instructions SIMT (baseline) SBI N. Brunie, S. Collange, G. Diamos. Simultaneous branch and warp interweaving for sustained GPU performance. ISCA 2012. 39

Conclusion: the missing link CPU today GPU today CPU ISA SIMT model Multi-core multi-thread DITVA SBI SIMT New design space New range of architecture options between multi-core and GPUs Enables heterogeneous platforms with unified instruction set 40