Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Similar documents

Parallel Programming Survey

Introduction to Cloud Computing

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Thread level parallelism

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Next Generation GPU Architecture Code-named Fermi

Chapter 2 Parallel Computer Architecture

Introduction to GPU hardware and to CUDA

Introduction to GPU Programming Languages

GPU Architecture. Michael Doggett ATI

GPUs for Scientific Computing

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

GPGPU Computing. Yong Cao

Multithreading Lin Gao cs9244 report, 2006

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Energy-Efficient, High-Performance Heterogeneous Core Design

Computer Graphics Hardware An Overview

GPU Parallel Computing Architecture and CUDA Programming Model

Parallel Algorithm Engineering

Introduction to GPGPU. Tiziano Diamanti

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Introduction to GPU Architecture

Radeon HD 2900 and Geometry Generation. Michael Doggett

LSN 2 Computer Processors

Week 1 out-of-class notes, discussions and sample problems

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Multi-Threading Performance on Commodity Multi-Core Processors

~ Greetings from WSU CAPPLab ~

Generations of the computer. processors.

Chapter 2 Parallel Architecture, Software And Performance

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CHAPTER 1 INTRODUCTION

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

ultra fast SOM using CUDA

SPARC64 VIIIfx: CPU for the K computer

Rethinking SIMD Vectorization for In-Memory Databases

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Thread Level Parallelism II: Multithreading

An Introduction to Parallel Computing/ Programming

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Binary search tree with SIMD bandwidth optimization using SSE

L20: GPU Architecture and Models

Computer Architecture TDTS10

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

MIDeA: A Multi-Parallel Intrusion Detection Architecture

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

CMSC 611: Advanced Computer Architecture

A Lab Course on Computer Architecture

VLIW Processors. VLIW Processors

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

Multi-core and Linux* Kernel

CUDA programming on NVIDIA GPUs

Multicore Processor and GPU. Jia Rao Assistant Professor in CS

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Multi-Core Programming

Scalability and Classifications

Overview. CPU Manufacturers. Current Intel and AMD Offerings

OC By Arsene Fansi T. POLIMI

Optimizing Code for Accelerators: The Long Road to High Performance

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Principles and characteristics of distributed systems and environments

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

Introduction to Microprocessors

Intel Pentium 4 Processor on 90nm Technology

Chapter 1 Computer System Overview

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Enterprise Applications

on an system with an infinite number of processors. Calculate the speedup of

CSE 6040 Computing for Data Analytics: Methods and Tools

Symmetric Multiprocessing

Low Power AMD Athlon 64 and AMD Opteron Processors

ADVANCED COMPUTER ARCHITECTURE: Parallelism, Scalability, Programmability

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Design Cycle for Microprocessors

ADVANCED COMPUTER ARCHITECTURE

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

HPC with Multicore and GPUs

Transcription:

Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide a cheap parallel computer solution. To increase the computation power in a PC platform. Multi-core processor is a special kind of multiprocessors: All processors are on the same chip also called Chip Multiprocessor. MIMD: Different cores execute different codes (threads) operating on different data. A shared memory multiprocessor: all cores share the same memory via some cache system. Zebo Peng, IDA, LiTH 2 1

Why Multi-Core? Limitations of single core architectures: High power consumption due to high clock rates (ca. 2-3% power increase per 1% performance increase). Heat generation (cooling is expensive). Limited parallelism (Instruction Level Parallelism only). Design time and complexity increased due to complex methods to increase ILP. Many new applications are multithreaded, suitable for multi-core. Ex. Multimedia applications. General trend in computer architecture -> shift towards more parallelism. Much faster cache coherency circuits, in a single chip. Smaller in physical size than SMP. Zebo Peng, IDA, LiTH 3 A Simplified Multi-Core Model core core core core L1 Inst L1 Data L1 L1 Inst Data L1 L1 Inst Data L1 L1 Inst Data On Chip Shared Bus Main Memory Zebo Peng, IDA, LiTH 4 2

Alternative Models Zebo Peng, IDA, LiTH 5 Intel Core i7 Four (or six) identical x86 processors. Each with its own L1 split, and L2 unified cache. Shared L3 cache to speed up inter-proc. communication High speed link between processor chips. Up to 4 GHz clock frequency. Zebo Peng, IDA, LiTH 6 3

Superscalar vs. Multi-Core 6-way superscalar Quadratically increases with issue width Multi-core processor with 4 identical 2-way superscalar processors (total 8 degree of parallelism) Multi-core processors can run at higher clock rate. Zebo Peng, IDA, LiTH 7 Single Core vs. Multi Core MULTI-CORE Performance 10X SINGLE CORE 3X 2000 2004 2008+ Normalized Performance vs. Initial Intel Pentium 4 Processor Source: Intel Zebo Peng, IDA, LiTH 8 4

Intel Polaris 80 cores with tflops (10 12 ) performance on a single chip (1 st chip to do so). Mesh network-on-a-chip. Frequency target at 5 GHz. Workload-aware power management: Instructions to make any core sleep or wake as Apps demand. Chip voltage & frequency control. Peak of 1.01 Teraflops at 62 watts. Peak power efficiency of 19.4 gflops/watt (almost 10X better than supercomputers). Short design time (regularity). 21.72mm 12.64mm I/O Area single tile 1.5mm 2.0mm Technology 65nm, 1 poly, 8 metal (Cu) Transistors 100 Million (full-chip) 1.2 Million (tile) Die Area 275mm PLL 2 (full-chip) TAP 3mm 2 (tile) C4 bumps I/O # Area 8390 Zebo Peng, IDA, LiTH 9 Innovations on Intel s Polaris Rapid design The tiled-design approach allows designers to use smaller cores that can easily be repeated across the chip. Network-on-a-chip The cores are connected in a 2D mesh network that implement message-passing. This scheme is much more scalable. Fine-grain power management: The individual computing engines and data routers in each core can be activated or put to sleep based on the performance required by the applications. Zebo Peng, IDA, LiTH 10 5

Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 11 Thread-Level Parallelism (TLP) This is parallelism on a coarser level than instruction-level parallelism (ILP). Instruction stream divided into smaller streams (threads) to be executed in parallel. Thread has its own instructions and data. It is part of a parallel program or independent programs. Each thread has all state (instructions, data, PC, etc.) needed to execute. Multi-core architectures can exploit TLP efficiently. Use multiple instruction streams to improve the throughput of computers that run several programs. TLP are more cost-effective to exploit than ILP. Zebo Peng, IDA, LiTH 12 6

Multithreading Approaches Interleaved (fine-grained) Processor deals with several thread contexts at a time. Switching thread at each clock cycle (hardware support needed). If a thread is blocked, it is skipped. Hide latency of both short and long pipeline stalls. Blocked (coarse-grained) Thread executed until an event causes delay (e.g., cache miss). Relieves the need to have very fast thread switching. No slow down for ready-to-go threads. Simultaneous (SMT) Instructions simultaneously issued from multiple threads to execution units of a superscalar processor. Chip multiprocessing Each processor handles separate threads. Zebo Peng, IDA, LiTH 13 Scalar Processor Case Cache miss Zebo Peng, IDA, LiTH 14 7

Scalar Processor Approaches Single-threaded scalar Simple pipeline and no multithreading. Interleaved multithreaded scalar Easiest multithreading to implement. Switch threads at each clock cycle. Pipeline stages kept close to fully occupied. Hardware needs to switch thread context between cycles. Blocked multithreaded scalar Thread executed until latency event occurs to stop the pipeline. Processor switches to another thread. Zebo Peng, IDA, LiTH 15 Superscalar Case Zebo Peng, IDA, LiTH 16 8

Superscalar Approaches Classical superscalar No multithreading. Interleaved multithreading superscalar: Each cycle, as many instructions as possible issued from single thread. Delays due to thread switches eliminated. Number of instructions issued in a cycle is limited by dependencies. Blocked multithreaded superscalar Instructions from one thread are executed at a time. Blocked multithreading used. Zebo Peng, IDA, LiTH 17 SMT and Chip Multiprocessing Simultaneous multithreading (SMT): Issue multiple instructions at a time. One thread may fill all horizontal slots. Instructions from two or more threads may be issued to fill all horizontal slots. With enough threads, it can issue maximum number of instructions on each cycle. Best efficiency! Zebo Peng, IDA, LiTH 18 9

SMT and Chip Multiprocessing Chip multiprocessing: Multiple processors. Each can be itself a superscalar processor, with multiple instruction issues (e.g., 2 issues). Each processor is assigned a thread (or a process). Control is simplified! Zebo Peng, IDA, LiTH 19 Multithreading Paradigms Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Execution Time FU1 FU2 FU3 FU4 Conventional Superscalar Single Threaded Fine-grained Coarse-grained Multithreading Multithreading (cycle-by-cycle (Block Interleaving) Interleaving) Simultaneous Multithreading (SMT) Chip Multiprocessor (CMP) Zebo Peng, IDA, LiTH 20 10

Programming for Multi-Core There must be many threads or processes: Multiple applications running on the same machine. Multi-tasking is becoming very common. OS software tends to run many threads as a part of its normal operation. An application may also have multiple threads. In most cases, it must be specifically written or compiled. OS scheduler should map threads to different cores: To balance the work load; or To avoid hot spots due to heat generation. Zebo Peng, IDA, LiTH 21 Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 22 11

Graphics Processing Unit Why? Graphics applications are: Rendering of 2D or 3D images with complex optical effects; Highly computation intensive; Massively parallel; Data stream based. General purpose processors (CPU) are: Designed to handle huge data volumes; Serial, one operation at a time; Control flow based; High flexible, but not very well adapted to graphics. GPUs are the solutions! Zebo Peng, IDA, LiTH 23 Graphic Processing Example Input from CPU Host interface Yellow parts are programmable Vertex processing Triangle setup Pixel processing Memory Interface 64bits to memory 64bits to memory 64bits to memory 64bits to memory Zebo Peng, IDA, LiTH 24 12

Graphic Processing Feature 1) Process pixels in parallel Large degree of data-parallel: 2.1M pixels per frame (HDTV) => lots of work All pixels are independent => no synchronization needed Lots of spatial locality => regular memory access Great speedups can be achieved Limited only by the amount of hardware Zebo Peng, IDA, LiTH 25 Graphic Processing Feature 2) Focus on throughput, not latency Each pixel can take a long time to process, as long as we process many of them at the same time. Great scalability Lots of simple parallel processors Low clock speed Latency-optimized (fast, serial) Throughput-optimized (slow, parallel) Zebo Peng, IDA, LiTH 26 13

GPU Examples Lots of Small Parallel SIMD Processors Limited Interconnect Limited Memory Nvidia G80 Lots of Memory Controllers Very Small Caches AMD 5870 Zebo Peng, IDA, LiTH 27 GPU Examples Hardware thread schedulers Nvidia G80 Fixed-function Logic AMD 5870 Zebo Peng, IDA, LiTH 28 14

CPU vs. GPU Philosophy LM I$ LM I$ L2 L2 BP L1 BP L1 LM I$ LM I$ L2 L2 LM I$ LM I$ BP L1 BP L1 LM I$ LM I$ 4 CPU Massive Cores: Big caches, branch predictors, out-of-order, multipleissue, speculative execution 2 IPC per core, 8 IPC total @3GHz 8*8 Simple GPU Cores: No big caches, in-order, single-issue, 1 IPC per core, 64 IPC total @1.5GHz Source: David Black-Schaffer Zebo Peng, IDA, LiTH 29 CPU vs. GPU Chips Intel Nehalem 4-core AMD Radeon 5870 130W, 263mm 2 106 gflops (Single Precision) Big caches (8MB) Out-of-order execution 0.8 gflops/w 188W, 334mm 2 2720 gflops (Single Precision) Small caches (<1MB) Hardware thread scheduling 14.5 gflops/w Source: David Black-Schaffer Zebo Peng, IDA, LiTH 30 15

CPU vs. GPU Architecture CPUs are latency-optimized Each thread runs as fast as possible, but only a few threads. GPUs are throughput-optimized Each thread may take a long time, but thousands of threads. CPUs have a few massive cores. GPUs have hundreds of simple cores. CPUs excel at irregular control-intensive work Lots of hardware for control, few ALUs. GPUs excel at regular computation-intensive work Lots of ALUs for computation, little hardware for control. The best is to combine both architectures in the same machine! Zebo Peng, IDA, LiTH 31 Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 32 16

GPU Specialized Processor Very efficient for Fast parallel floating point processing. SIMD (Single Instruction Multiple Data) operations. High computation per memory access. Not as efficient for Double precision computations. Logical operations on integer data. Random access, memory-intensive operations. Branching-intensive operations. Zebo Peng, IDA, LiTH 33 GPU Instruction Flow GPU control units fetch 1 instruction per cycle, and share it with 8 processor cores. SIMD (AMD GPU = 4-way; Intel s Larabee = 16-way). What if they don t all want the same instruction? Divergent execution, due to branches, for example. LM I$ Zebo Peng, IDA, LiTH 34 17

Divergent Execution Thread Instructions thread t0 t1 t2 t3 t4 t5 t6 t7 1 Cycle 0 Fetch: 1 1 1 1 1 1 1 1 1 2 Cycle 1 Fetch: 2 2 2 2 2 2 2 2 2 if if ( ) Cycle 2 Fetch: if if if if if if if if if 3 do 3 Cycle 3 Fetch: 3 3 3 3 3 3 3 3 el else Cycle 4 Fetch: el el 4 do 4 Cycle 5 Fetch: 5 5 5 5 5 5 5 5 5 Cycle 6 Fetch: 4 4 6 Cycle 7 Fetch: 6 6 6 6 6 6 6 6 Cycle 8 Fetch: 5 5 t7 stalls Divergent execution can dramatically hurt performance! Source: David Black-Schaffer Zebo Peng, IDA, LiTH 35 Reduce Branch Divergence Software-based optimizations can be used. Iteration delaying: target a divergent branch enclosed by a loop. execute loop iterations that take the same branch direction. delay those that take the other direction until later iterations. execute the delayed operations in future iterations where more threads take the same direction. Zebo Peng, IDA, LiTH 36 18

GPGPU Make GPU more general by having more programmable components. Implement GPU and CPU (multi-core) in the same machine. Provide hardware support on GPU for non-graphics applications. Include multiple parallelism scheme in the same architecture. Zebo Peng, IDA, LiTH 37 NVIDIA Tesla Architecture Zebo Peng, IDA, LiTH 38 19

GPU Architecture Features Massively Parallel, 1000s of processors (today). Cost efficiency: Graphics driving demand up, supply up, leading to low price. Power efficiency: Fixed Function Hardware = area & power efficient. No speculation: more processing, less leaky cache. SIMD computation. Computing power = Frequency * Transistors (Moore s law) 2. GPUs: 1.7X (pixels) to 2.3X (vertices) annual growth. CPUs: 1.4X annual growth. Zebo Peng, IDA, LiTH 39 Performance Improvement GFLOPs 1400 1200 1000 800 600 NVIDIA GPU Intel CPU G80 GT200 T12 400 200 0 NV30 NV40 G70 3GHz Dual Core P4 3GHz Core2 Duo 3GHz Xeon Quad 9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009 Westmere Zebo Peng, IDA, LiTH 40 20

CUDA Programming Language CUDA = Computer Unified Device Architecture. Minimal extensions to familiar C/C++. Heterogeneous serial/parallel programming model. Serial code executes on CPU. Parallel kernels executed across a set of parallel threads on the GPU. It enables general-purpose GPU computing It also maps well to multi-core CPUs. It enables heterogeneous systems (i.e., CPU+GPU) CPU good at serial computation, GPU at parallel. Zebo Peng, IDA, LiTH 41 General CUDA Steps Copy data from CPU to GPU. Perform SIMD computations (kernel) on GPU. Copy data back from GPU to CPU. Execution on host doesn t wait for kernel to finish on GPU. General rules: Minimize data transfer between CPU & GPU Maximize number of threads on GPU Zebo Peng, IDA, LiTH 42 21

Summary All computers are now parallel computers! Multi-core processors represent an important new trend in computer architecture. Decreased power consumption and heat generation. Minimized wire lengths and interconnect latencies. They enable true thread-level parallelism with great energy efficiency and scalability. Graphics requires a lot of computation and a huge amount of bandwidth GPUs. GPUs are coming to general purpose computing, since they deliver huge performance with small power. Massively parallel computing are becoming a commodity technology. Zebo Peng, IDA, LiTH 43 22