Hybrid CPU-GPU cores for heterogeneous multi-core processors. Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr

Similar documents
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1


LSN 2 Computer Processors

Introduction to GPU Programming Languages

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to GPU Architecture

VLIW Processors. VLIW Processors

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Introduction to Cloud Computing

Optimizing Code for Accelerators: The Long Road to High Performance

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Instruction Set Architecture (ISA)

GPUs for Scientific Computing

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

Next Generation GPU Architecture Code-named Fermi

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

HPC with Multicore and GPUs

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Energy-Efficient, High-Performance Heterogeneous Core Design

Operating System Impact on SMT Architecture

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Chapter 2 Parallel Architecture, Software And Performance

Computer Architecture TDTS10

Pipelining Review and Its Limitations

Introduction to GPU hardware and to CUDA

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

CMSC 611: Advanced Computer Architecture

Thread level parallelism

GPU Computing - CUDA

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

SOC architecture and design

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

OpenCL Programming for the CUDA Architecture. Version 2.3

OC By Arsene Fansi T. POLIMI

Multicore Processor and GPU. Jia Rao Assistant Professor in CS

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Instruction Set Design

Parallel Algorithm Engineering

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Embedded Parallel Computing

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

GPGPU Computing. Yong Cao

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CHAPTER 1 INTRODUCTION

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

CUDA programming on NVIDIA GPUs

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

An Introduction to Parallel Computing/ Programming

Rethinking SIMD Vectorization for In-Memory Databases

Introduction to GPGPU. Tiziano Diamanti

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

SPARC64 VIIIfx: CPU for the K computer

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

ADVANCED COMPUTER ARCHITECTURE

Multi-Threading Performance on Commodity Multi-Core Processors

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

PROBLEMS #20,R0,R1 #$3A,R2,R4

Radeon HD 2900 and Geometry Generation. Michael Doggett

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

İSTANBUL AYDIN UNIVERSITY

WAR: Write After Read

Multithreading Lin Gao cs9244 report, 2006

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Parallel Programming Survey

Course Development of Programming for General-Purpose Multicore Processors

NVIDIA Tegra 4 Family CPU Architecture

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Computer Graphics Hardware An Overview

Turbomachinery CFD on many-core platforms experiences and strategies

IA-64 Application Developer s Architecture Guide

Week 1 out-of-class notes, discussions and sample problems

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

MICROPROCESSOR AND MICROCOMPUTER BASICS

Multi-core Programming System Overview

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Scalability and Classifications

Transcription:

Hybrid CPU-GPU cores for heterogeneous multi-core processors Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr

From GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete components Today (2011-...) Heterogeneous multi-core Physically unified CPU + GPU on the same chip Logically separated Different programming models, compilers, instruction sets Tomorrow Unified programming models? Single instruction set? Central Processing Unit (CPU) Latencyoptimized cores Throughputoptimized cores Graphics Processing Unit (GPU) Heterogeneous multi-core chip Hardware accelerators 2

Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 3

The 1980': pipelined processor Example: scalar-vector multiplication: X a X for i = 0 to n-1 X[i] a * X[i] Source code move i 0 loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<n? loop Machine code add i 18 Fetch store X[17] Decode mul Execute L/S Unit Sequential CPU Memory 4

The 1990': superscalar processor Goal: improve performance of sequential applications Latency: time to get the result Exploits Instruction-Level Parallelism (ILP) Uses many tricks Branch prediction, out-of-order execution, register renaming, data prefetching, memory disambiguation Basis: speculation Take a bet on future events If right: time gain If wrong, roll back: energy loss 5

What makes speculation work: regularity Application behavior likely to follow regular patterns Control regularity for(i ) { if(f(i)) { } Regular case Time Irregular case i=0 i=1 i=2 i=3 i=0 i=1 i=2 i=3 taken taken taken taken not tk taken taken not tk Memory regularity } j = g(i); x = a[j]; j=17 j=18 j=19 j=20 j=21 j=4 j=17 j=2 Speculation exploits patterns to guess accurately Applications Caches Branch prediction Instruction prefetch, data prefetch, write combining 6

The 2000': going multi-threaded Memory wall More and more difficult to hide memory latency Power wall Performance is now limited by power consumption ILP wall Law of diminishing returns on Instruction-Level Parallelism Gradual transition from latencyoriented to throughput-oriented Homogeneous multi-core Simultaneous multi-threading Performance Cost Time Gap Serial performance Compute Memory Transistor density Transistor power Total power Time 7

Homogeneous multi-core Replication of the complete execution engine Multi-threaded software move i slice_begin loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<slice_end? loop Machine code add i 18 IF store X[17] mul IF ID EX LSU add i 50 IF store X[49] mul IF ID EX LSU Memory Threads: T0 T1 Improves throughput thanks to explicit parallelism 8

Simultaneous multi-threading (SMT) Time-multiplexing of processing units Same software view move i slice_begin loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<slice_end? loop Machine code mul mul add i 73 add i 50 load X[89] store X[72] load X[17] store X[49] Fetch Decode Execute L/S Unit Memory Threads: T0 T1 T2 T3 Hides latency thanks to explicit parallelism 9

Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 10

Throughput-oriented architectures Also known as GPUs, but do more than just graphics Target: highly parallel sections of programs Programming model: SPMD one function run by many threads One code: For n threads: X[tid] a * X[tid] Many threads: Goal: maximize computation / energy consumption ratio Many-core approach: many independent, multi-threaded cores Can we be more efficient? Exploit regularity 11

Parallel regularity Similarity in behavior between threads Control regularity Time Regular Thread 1 2 3 4 i=17 i=17 i=17 i=17 switch(i) { case 2:... case 17:... case 21:... } Irregular 1 2 3 4 i=21 i=4 i=17 i=2 Memory regularity A load A[8] load A[9] Memory load A[10] load A[11] r=a[i] load A[8] load A[0] load A[11] load A[3] Data regularity a=32 a=32 a=32 a=32 b=52 b=52 b=52 b=52 r=a*b a=17 a=-5 a=11 a=42 b=15 b=0 b=-2 b=52 12

Dynamic SPMD vectorization aka SIMT Run SPMD threads in lockstep Mutualize fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers T0 (0-3) load IF T1 T2 T3 (0) mul (0-3) store ID (1) mul (2) mul (3) mul EX Memory (0) (1) (2) (3) LSU SIMT: Single Instruction, Multiple Threads Wave of synchronized threads: warp Improves Area/Power-efficiency thanks to regularity 13

Core 127 Core 93 Core 92 Core 91 Core 66 Core 65 Core 64 Core 34 Core 33 Core 32 Core 2 Core 1 Example GPU: NVIDIA GeForce GTX 980 SIMT: warps of 32 threads 16 SMs / chip 4 32 cores / SM, 64 warps / SM Warp 1 Warp 5 Warp 2 Warp 6 Warp 3 Warp 7 Warp 4 Warp 8 Warp 60 Warp 61 Warp 62 Warp 63 Time SM1 SM16 4612 Gflop/s Up to 32768 threads in flight 14

SIMT vs. multi-core + explicit SIMD SIMT All parallelism expressed using threads Warp size implementationdefined Dynamic vectorization Threads Multi-core + explicit SIMD Combination of threads, vectors Vector length fixed at compiletime Static vectorization Threads Warp SIMT benefits Easier programming Retain binary compatibility Vector 15

Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 16

Heterogeneity: causes and consequences Amdahl's law S= Time to run sequential sections 1 1 P P N Time to run parallel sections Latency-optimized multi-core (CPU) Low efficiency on parallel sections: spends too much resources Throughput-optimized multi-core (GPU) Low performance on sequential sections Heterogeneous multi-core (CPU+GPU) Use the right tool for the right job Resources saved in parallel sections can be devoted to accelerete sequential sections M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008. 17

Single-ISA heterogeneous architectures Proposed in the academia, now in embedded systems on chip Example: ARM big.little High-performance CPU cores Cortex A57 Low-power CPU cores Cortex A53 big cores Thread migration LITTLE cores Different cores, same instruction set Migrate application threads dynamically according to demand 18

Single-ISA heterogeneous CPU-GPU Enables dynamic task migration between CPU and GPU Use the best core for the current job Load-balance on available resources CPU cores GPU cores Thread migration 19

Option 1: static vectorization Extend CPU SIMD instruction sets to throughput-oriented cores Scalar + wide vector ISA e.g. x86-64 + AVX-512 Latency core Throughput cores Issue: conflicting requirements Same suboptimal SIMD vector length for all cores Or binary compatibility loss, ISA fragmentation Intel. Intel Advanced Vector Extensions 2015/2016 Support in GNU Compiler Collection. GNU Tools Cauldron 2014 20

Our proposal: Dynamic vectorization Extend SIMT execution model to general-purpose cores Scalar ISA on both sides Latency core Throughput cores Flexibility advantage: SIMD width optimized for each core type Challenge: generalize dynamic vectorization to general-purpose instruction processing 21

Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 22

Capturing instruction regularity How to keep threads synchronized? Challenge: conditional branches Rules of the game One thread per SIMD lane Same instruction on all lanes Lanes can be individually disabled Thread 0 Thread 1 Thread 2 Thread 3 1 instruction Lane 0 Lane 1 Lane 2 Lane 3 x = 0; // Uniform condition if(tid > 17) { x = 1; } // Divergent conditions if(tid < 2) { if(tid == 0) { x = 2; } else { x = 3; } } 23

Most common: mask stack skip Code x = 0; // Uniform condition if(tid > 17) { } x = 1; // Divergent conditions if(tid < 2) { } push pop if(tid == 0) { } else { } push pop push pop x = 2; x = 3; Mask Stack 1 activity bit / thread tid=0 1111 1111 tid=1 1111 1100 1111 1100 1000 1111 1100 1111 1100 0100 1111 1100 1111 tid=2 tid=3 A. Levinthal and T. Porter. Chap - a SIMD graphics processor. SIGGRAPH 84, 1984. 24

Traditional SIMT pipeline Instruction Activity bit Exec Instruction Sequencer PC, Activity mask Instruction Fetch Insn, Activity mask Broadcast Instruction, Activity bit Exec Activity bit=0: discard instruction Mask stack Instruction, Activity bit Exec Used in Nvidia GPUs 25

Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 26

Goto considered harmful? MIPS j jal jr syscall NVIDIA Tesla (2007) bar bra brk brkpt cal cont kil pbk pret ret ssy trap.s NVIDIA Fermi (2010) bar bpt bra brk brx cal cont exit jcal jmx kil pbk pret ret ssy.s Intel GMA Gen4 (2006) jmpi if iff else endif do while break cont halt msave mrest push pop Intel GMA SB (2011) jmpi if else endif case while break cont halt call return fork Control instructions in some CPU and GPU instruction sets Why so many? AMD R500 (2005) jump loop endloop rep endrep breakloop breakrep continue AMD R600 (2007) push push_else pop loop_start loop_start_no_al loop_start_dx10 loop_end loop_continue loop_break jump else call call_fs return return_fs alu alu_push_before alu_pop_after alu_pop2_after alu_continue alu_break alu_else_after Expose control flow structure to the instruction sequencer AMD Cayman (2011) push push_else pop push_wqm pop_wqm else_wqm jump_any reactivate reactivate_wqm loop_start loop_start_no_al loop_start_dx10 loop_end loop_continue loop_break jump else call call_fs return return_fs alu alu_push_before alu_pop_after alu_pop2_after alu_continue alu_break alu_else_after 27

SIMD is so last century Maspar MP-1 (1990) 1 instruction for 16 384 processing elements (PEs) PE : ~1 mm², 1.6 µm process SIMD programming model /1000 Fewer PEs 50 Bigger PEs More divergence NVIDIA Fermi (2010) 1 instruction for 16 PEs PE : ~0,03 mm², 40 nm process Threaded programming model From centralized control to flexible distributed control 28

Moving away from the vector model Requirements for a single-isa CPU+GPU Run general-purpose applications Switch freely back and forth between SIMT and MIMD modes Conventional techniques do no meet these requirements Solution: stateless dynamic vectorization Key idea Maintain 1 Program Counter (PC) per thread Each cycle, elect one Master PC to fetch from Activate all threads that have the same PC 29

1 PC / thread Code x = 0; if(tid > 17) { x = 1; } if(tid < 2) { if(tid == 0) { x = 2; Master PC } } else { x = 3; } Program Counters (PCs) tid= 0 1 2 3 Match active PC 0 1 0 0 0 PC 1 PC 2 PC 3 No match inactive 30

Our new SIMT pipeline PC 0 Insn, MPC MPC=PC 0? Insn Exec Update PC PC 0 PC 1 Vote MPC Instruction Fetch Insn, MPC Broadcast Insn, MPC MPC=PC 1? Insn Exec Update PC No match: discard instruction PC 1 PC n Insn, MPC MPC=PC n? Insn Exec Update PC PC n 31

Benefits of stateless dynamic vectorization Before: stack, counters O(n), O(log n) memory n = nesting depth 1 R/W port to memory Exceptions: stack overflow, underflow Vector semantics Structured control flow only Specific instruction sets After: multiple PCs O(1) memory No shared state Allows thread suspension, restart, migration Multi-thread semantics Traditional languages, compilers Traditional instruction sets Can be mixed with MIMD 32

Scheduling policy: min(sp:pc) Which PC to choose as master PC? Conditionals, loops Order of code addresses min(pc) Functions Favor max nesting depth min(sp) With compiler support Unstructured control flow too No code duplication Full backward and forward compatibility Source Assembly Order if( ) { p? br else } 1 else br endif { else: } 2 endif: 3 while( ) { } f(); void f() { } start: p? br start call f f: ret 1 2 3 4 1 3 2 33

Potential of Min(SP:PC) Comparison of fetch policies on SPMD benchmarks PARSEC and SPLASH benchmarks for CPU, using pthreads, OpenMP, TBB Microarchitecture-independent model: ideal SIMD machine Average number of active threads Min(SP:PC) achieves reconvergence at minimal cost T. Milanez et al. Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads. Parallel Computing 40.9:548-558. 2014 34

Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 35

DITVA Dynamic Inter-Thread Vectorization Architecture Add dynamic vectorization capability to an in-order SMT CPU Runs existing parallel programs compiled for x86 Scheduling policy: alternate min(sp:pc) and round-robin 4 scalar units 2 SIMD units 4 SIMT units Baseline: 4-thread 4-issue in-order with explicit SIMD DITVA: 4-warp 4-thread 4-issue 36

DITVA performance Speedup of 4-warp 2-thread DITVA and 4-warp 4-thread DITVA over baseline 4-thread processor +18% and +30% performance on SPMD workloads 37

Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 38

Simultaneous Branch Interweaving Co-issue instructions from divergent branches Fill inactive units using parallelism from divergent paths 2 3 4 1 5 6 7 Control-flow graph Same cycle, two instructions SIMT (baseline) SBI N. Brunie, S. Collange, G. Diamos. Simultaneous branch and warp interweaving for sustained GPU performance. ISCA 2012. 39

Conclusion: the missing link CPU today GPU today CPU ISA SIMT model Multi-core multi-thread DITVA SBI SIMT New design space New range of architecture options between multi-core and GPUs Enables heterogeneous platforms with unified instruction set 40