Hybrid CPU-GPU cores for heterogeneous multi-core processors. Sylvain Collange INRIA Rennes / IRISA [email protected]
|
|
|
- James Adams
- 9 years ago
- Views:
Transcription
1 Hybrid CPU-GPU cores for heterogeneous multi-core processors Sylvain Collange INRIA Rennes / IRISA [email protected]
2 From GPU to heterogeneous multi-core Yesterday ( ) Homogeneous multi-core Discrete components Today ( ) Heterogeneous multi-core Physically unified CPU + GPU on the same chip Logically separated Different programming models, compilers, instruction sets Tomorrow Unified programming models? Single instruction set? Central Processing Unit (CPU) Latencyoptimized cores Throughputoptimized cores Graphics Processing Unit (GPU) Heterogeneous multi-core chip Hardware accelerators 2
3 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 3
4 The 1980': pipelined processor Example: scalar-vector multiplication: X a X for i = 0 to n-1 X[i] a * X[i] Source code move i 0 loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<n? loop Machine code add i 18 Fetch store X[17] Decode mul Execute L/S Unit Sequential CPU Memory 4
5 The 1990': superscalar processor Goal: improve performance of sequential applications Latency: time to get the result Exploits Instruction-Level Parallelism (ILP) Uses many tricks Branch prediction, out-of-order execution, register renaming, data prefetching, memory disambiguation Basis: speculation Take a bet on future events If right: time gain If wrong, roll back: energy loss 5
6 What makes speculation work: regularity Application behavior likely to follow regular patterns Control regularity for(i ) { if(f(i)) { } Regular case Time Irregular case i=0 i=1 i=2 i=3 i=0 i=1 i=2 i=3 taken taken taken taken not tk taken taken not tk Memory regularity } j = g(i); x = a[j]; j=17 j=18 j=19 j=20 j=21 j=4 j=17 j=2 Speculation exploits patterns to guess accurately Applications Caches Branch prediction Instruction prefetch, data prefetch, write combining 6
7 The 2000': going multi-threaded Memory wall More and more difficult to hide memory latency Power wall Performance is now limited by power consumption ILP wall Law of diminishing returns on Instruction-Level Parallelism Gradual transition from latencyoriented to throughput-oriented Homogeneous multi-core Simultaneous multi-threading Performance Cost Time Gap Serial performance Compute Memory Transistor density Transistor power Total power Time 7
8 Homogeneous multi-core Replication of the complete execution engine Multi-threaded software move i slice_begin loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<slice_end? loop Machine code add i 18 IF store X[17] mul IF ID EX LSU add i 50 IF store X[49] mul IF ID EX LSU Memory Threads: T0 T1 Improves throughput thanks to explicit parallelism 8
9 Simultaneous multi-threading (SMT) Time-multiplexing of processing units Same software view move i slice_begin loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<slice_end? loop Machine code mul mul add i 73 add i 50 load X[89] store X[72] load X[17] store X[49] Fetch Decode Execute L/S Unit Memory Threads: T0 T1 T2 T3 Hides latency thanks to explicit parallelism 9
10 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 10
11 Throughput-oriented architectures Also known as GPUs, but do more than just graphics Target: highly parallel sections of programs Programming model: SPMD one function run by many threads One code: For n threads: X[tid] a * X[tid] Many threads: Goal: maximize computation / energy consumption ratio Many-core approach: many independent, multi-threaded cores Can we be more efficient? Exploit regularity 11
12 Parallel regularity Similarity in behavior between threads Control regularity Time Regular Thread i=17 i=17 i=17 i=17 switch(i) { case 2:... case 17:... case 21:... } Irregular i=21 i=4 i=17 i=2 Memory regularity A load A[8] load A[9] Memory load A[10] load A[11] r=a[i] load A[8] load A[0] load A[11] load A[3] Data regularity a=32 a=32 a=32 a=32 b=52 b=52 b=52 b=52 r=a*b a=17 a=-5 a=11 a=42 b=15 b=0 b=-2 b=52 12
13 Dynamic SPMD vectorization aka SIMT Run SPMD threads in lockstep Mutualize fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers T0 (0-3) load IF T1 T2 T3 (0) mul (0-3) store ID (1) mul (2) mul (3) mul EX Memory (0) (1) (2) (3) LSU SIMT: Single Instruction, Multiple Threads Wave of synchronized threads: warp Improves Area/Power-efficiency thanks to regularity 13
14 Core 127 Core 93 Core 92 Core 91 Core 66 Core 65 Core 64 Core 34 Core 33 Core 32 Core 2 Core 1 Example GPU: NVIDIA GeForce GTX 980 SIMT: warps of 32 threads 16 SMs / chip 4 32 cores / SM, 64 warps / SM Warp 1 Warp 5 Warp 2 Warp 6 Warp 3 Warp 7 Warp 4 Warp 8 Warp 60 Warp 61 Warp 62 Warp 63 Time SM1 SM Gflop/s Up to threads in flight 14
15 SIMT vs. multi-core + explicit SIMD SIMT All parallelism expressed using threads Warp size implementationdefined Dynamic vectorization Threads Multi-core + explicit SIMD Combination of threads, vectors Vector length fixed at compiletime Static vectorization Threads Warp SIMT benefits Easier programming Retain binary compatibility Vector 15
16 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 16
17 Heterogeneity: causes and consequences Amdahl's law S= Time to run sequential sections 1 1 P P N Time to run parallel sections Latency-optimized multi-core (CPU) Low efficiency on parallel sections: spends too much resources Throughput-optimized multi-core (GPU) Low performance on sequential sections Heterogeneous multi-core (CPU+GPU) Use the right tool for the right job Resources saved in parallel sections can be devoted to accelerete sequential sections M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer,
18 Single-ISA heterogeneous architectures Proposed in the academia, now in embedded systems on chip Example: ARM big.little High-performance CPU cores Cortex A57 Low-power CPU cores Cortex A53 big cores Thread migration LITTLE cores Different cores, same instruction set Migrate application threads dynamically according to demand 18
19 Single-ISA heterogeneous CPU-GPU Enables dynamic task migration between CPU and GPU Use the best core for the current job Load-balance on available resources CPU cores GPU cores Thread migration 19
20 Option 1: static vectorization Extend CPU SIMD instruction sets to throughput-oriented cores Scalar + wide vector ISA e.g. x AVX-512 Latency core Throughput cores Issue: conflicting requirements Same suboptimal SIMD vector length for all cores Or binary compatibility loss, ISA fragmentation Intel. Intel Advanced Vector Extensions 2015/2016 Support in GNU Compiler Collection. GNU Tools Cauldron
21 Our proposal: Dynamic vectorization Extend SIMT execution model to general-purpose cores Scalar ISA on both sides Latency core Throughput cores Flexibility advantage: SIMD width optimized for each core type Challenge: generalize dynamic vectorization to general-purpose instruction processing 21
22 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 22
23 Capturing instruction regularity How to keep threads synchronized? Challenge: conditional branches Rules of the game One thread per SIMD lane Same instruction on all lanes Lanes can be individually disabled Thread 0 Thread 1 Thread 2 Thread 3 1 instruction Lane 0 Lane 1 Lane 2 Lane 3 x = 0; // Uniform condition if(tid > 17) { x = 1; } // Divergent conditions if(tid < 2) { if(tid == 0) { x = 2; } else { x = 3; } } 23
24 Most common: mask stack skip Code x = 0; // Uniform condition if(tid > 17) { } x = 1; // Divergent conditions if(tid < 2) { } push pop if(tid == 0) { } else { } push pop push pop x = 2; x = 3; Mask Stack 1 activity bit / thread tid= tid= tid=2 tid=3 A. Levinthal and T. Porter. Chap - a SIMD graphics processor. SIGGRAPH 84,
25 Traditional SIMT pipeline Instruction Activity bit Exec Instruction Sequencer PC, Activity mask Instruction Fetch Insn, Activity mask Broadcast Instruction, Activity bit Exec Activity bit=0: discard instruction Mask stack Instruction, Activity bit Exec Used in Nvidia GPUs 25
26 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 26
27 Goto considered harmful? MIPS j jal jr syscall NVIDIA Tesla (2007) bar bra brk brkpt cal cont kil pbk pret ret ssy trap.s NVIDIA Fermi (2010) bar bpt bra brk brx cal cont exit jcal jmx kil pbk pret ret ssy.s Intel GMA Gen4 (2006) jmpi if iff else endif do while break cont halt msave mrest push pop Intel GMA SB (2011) jmpi if else endif case while break cont halt call return fork Control instructions in some CPU and GPU instruction sets Why so many? AMD R500 (2005) jump loop endloop rep endrep breakloop breakrep continue AMD R600 (2007) push push_else pop loop_start loop_start_no_al loop_start_dx10 loop_end loop_continue loop_break jump else call call_fs return return_fs alu alu_push_before alu_pop_after alu_pop2_after alu_continue alu_break alu_else_after Expose control flow structure to the instruction sequencer AMD Cayman (2011) push push_else pop push_wqm pop_wqm else_wqm jump_any reactivate reactivate_wqm loop_start loop_start_no_al loop_start_dx10 loop_end loop_continue loop_break jump else call call_fs return return_fs alu alu_push_before alu_pop_after alu_pop2_after alu_continue alu_break alu_else_after 27
28 SIMD is so last century Maspar MP-1 (1990) 1 instruction for processing elements (PEs) PE : ~1 mm², 1.6 µm process SIMD programming model /1000 Fewer PEs 50 Bigger PEs More divergence NVIDIA Fermi (2010) 1 instruction for 16 PEs PE : ~0,03 mm², 40 nm process Threaded programming model From centralized control to flexible distributed control 28
29 Moving away from the vector model Requirements for a single-isa CPU+GPU Run general-purpose applications Switch freely back and forth between SIMT and MIMD modes Conventional techniques do no meet these requirements Solution: stateless dynamic vectorization Key idea Maintain 1 Program Counter (PC) per thread Each cycle, elect one Master PC to fetch from Activate all threads that have the same PC 29
30 1 PC / thread Code x = 0; if(tid > 17) { x = 1; } if(tid < 2) { if(tid == 0) { x = 2; Master PC } } else { x = 3; } Program Counters (PCs) tid= Match active PC PC 1 PC 2 PC 3 No match inactive 30
31 Our new SIMT pipeline PC 0 Insn, MPC MPC=PC 0? Insn Exec Update PC PC 0 PC 1 Vote MPC Instruction Fetch Insn, MPC Broadcast Insn, MPC MPC=PC 1? Insn Exec Update PC No match: discard instruction PC 1 PC n Insn, MPC MPC=PC n? Insn Exec Update PC PC n 31
32 Benefits of stateless dynamic vectorization Before: stack, counters O(n), O(log n) memory n = nesting depth 1 R/W port to memory Exceptions: stack overflow, underflow Vector semantics Structured control flow only Specific instruction sets After: multiple PCs O(1) memory No shared state Allows thread suspension, restart, migration Multi-thread semantics Traditional languages, compilers Traditional instruction sets Can be mixed with MIMD 32
33 Scheduling policy: min(sp:pc) Which PC to choose as master PC? Conditionals, loops Order of code addresses min(pc) Functions Favor max nesting depth min(sp) With compiler support Unstructured control flow too No code duplication Full backward and forward compatibility Source Assembly Order if( ) { p? br else } 1 else br endif { else: } 2 endif: 3 while( ) { } f(); void f() { } start: p? br start call f f: ret
34 Potential of Min(SP:PC) Comparison of fetch policies on SPMD benchmarks PARSEC and SPLASH benchmarks for CPU, using pthreads, OpenMP, TBB Microarchitecture-independent model: ideal SIMD machine Average number of active threads Min(SP:PC) achieves reconvergence at minimal cost T. Milanez et al. Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads. Parallel Computing 40.9:
35 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 35
36 DITVA Dynamic Inter-Thread Vectorization Architecture Add dynamic vectorization capability to an in-order SMT CPU Runs existing parallel programs compiled for x86 Scheduling policy: alternate min(sp:pc) and round-robin 4 scalar units 2 SIMD units 4 SIMT units Baseline: 4-thread 4-issue in-order with explicit SIMD DITVA: 4-warp 4-thread 4-issue 36
37 DITVA performance Speedup of 4-warp 2-thread DITVA and 4-warp 4-thread DITVA over baseline 4-thread processor +18% and +30% performance on SPMD workloads 37
38 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 38
39 Simultaneous Branch Interweaving Co-issue instructions from divergent branches Fill inactive units using parallelism from divergent paths Control-flow graph Same cycle, two instructions SIMT (baseline) SBI N. Brunie, S. Collange, G. Diamos. Simultaneous branch and warp interweaving for sustained GPU performance. ISCA
40 Conclusion: the missing link CPU today GPU today CPU ISA SIMT model Multi-core multi-thread DITVA SBI SIMT New design space New range of architecture options between multi-core and GPUs Enables heterogeneous platforms with unified instruction set 40
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.
Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique [email protected] GPU internals What makes a GPU tick? NVIDIA GeForce GTX 980 Maxwell GPU.
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
LSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
VLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Optimizing Code for Accelerators: The Long Road to High Performance
Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010 The Age of Accelerators 2 Accelerators in Real Life 3 Latency (ps/inst) Why Accelerators?
IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Instruction Set Architecture (ISA)
Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010
GPU Architecture An OpenCL Programmer s Introduction Lee Howes November 3, 2010 The aim of this webinar To provide a general background to modern GPU architectures To place the AMD GPU designs in context:
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek
Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano
Energy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
Operating System Impact on SMT Architecture
Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
Computer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
Pipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida [email protected] [email protected] October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
Thread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
GPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
SOC architecture and design
SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external
GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2
GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing,
OpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
OC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
Multicore Processor and GPU. Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/
Multicore Processor and GPU Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/ Moore s Law The number of transistors on integrated circuits doubles approximately every two years! CPU performance
Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
Instruction Set Design
Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
Embedded Parallel Computing
Embedded Parallel Computing Lecture 5 - The anatomy of a modern multiprocessor, the multicore processors Tomas Nordström Course webpage:: Course responsible and examiner: Tomas
The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
GPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.
Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:
Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
SPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
ADVANCED COMPUTER ARCHITECTURE
ADVANCED COMPUTER ARCHITECTURE Marco Ferretti Tel. Ufficio: 0382 985365 E-mail: [email protected] Web: www.unipv.it/mferretti, eecs.unipv.it 1 Course syllabus and motivations This course covers the
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group
Shader Model 3.0 Ashu Rege NVIDIA Developer Technology Group Talk Outline Quick Intro GeForce 6 Series (NV4X family) New Vertex Shader Features Vertex Texture Fetch Longer Programs and Dynamic Flow Control
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes
basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static
PROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
Radeon HD 2900 and Geometry Generation. Michael Doggett
Radeon HD 2900 and Geometry Generation Michael Doggett September 11, 2007 Overview Introduction to 3D Graphics Radeon 2900 Starting Point Requirements Top level Pipeline Blocks from top to bottom Command
A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey
A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:
İSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
WAR: Write After Read
WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads
Multithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
Data-parallel Acceleration of PARSEC Black-Scholes Benchmark
Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
NVIDIA Tegra 4 Family CPU Architecture
Whitepaper NVIDIA Tegra 4 Family CPU Architecture 4-PLUS-1 Quad core 1 Table of Contents... 1 Introduction... 3 NVIDIA Tegra 4 Family of Mobile Processors... 3 Benchmarking CPU Performance... 4 Tegra 4
Improving GPU Performance via Large Warps and Two-Level Warp Scheduling
Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman Michael Shebanow Chang Joo Lee Rustam Miftakhutdinov Onur Mutlu Yale N Patt The University of Texas at Austin {narasima,
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
IA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
Week 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
MICROPROCESSOR AND MICROCOMPUTER BASICS
Introduction MICROPROCESSOR AND MICROCOMPUTER BASICS At present there are many types and sizes of computers available. These computers are designed and constructed based on digital and Integrated Circuit
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
