Thread Level Parallelism (TLP)

Transcription

1 Thread Level Parallelism (TLP) Calcolatori Elettronici 2 TLP: SUN Microsystems vision (2004) Roberto Giorgi, Universita di Siena, C208L15, Slide 2

2 Estimated Industry Trends Moore's Law allows for the rapid increase in transistors per core. TLP optimised cores will start out much simpler, and may grow complex more slowly. The trend is for chips and CPU cores to get smaller, though TLP optimised ones will start much smaller. Growth rates in maximum power for "fat" CPUs have levelled off a bit. For "thin" cores, the number of CPU cores per chip will probably increase rather than the power consumption per core. "Fat" cores need lots of cache to reduce memory latency. TLP optimised designs are less latency sensitive, so less cache is needed. Better process technology helps both types to increase, though the simpler, slower clocked "thin" cores will be slower on more traditional benchmarks. "Fat" cores will benefit from TLP techniques and general improvements, but not as much as "thin" cores. Roberto Giorgi, Universita di Siena, C208L15, Slide 3 Current 4-way SMP An illustration of a 4-way system today. The only TLP comes from having multiple chips Roberto Giorgi, Universita di Siena, C208L15, Slide 4

3 Toward NIAGARA chips An illustration of a system with a heavily optimised TLP design Roberto Giorgi, Universita di Siena, C208L15, Slide 5 Niagara: A Torrent of Threads Niagara floorplan Roberto Giorgi, Universita di Siena, C208L15, Slide 6

4 First Niagara Chips: November 2005 UltraSPARC T1 I sistemi Niagara hanno 14 volte le prestazioni di un sistema UltraSPARC IIIi I sistemi con il single-chip Niagara 2, 35 volte I sistemi con Victoria Falls, 65 volte Roberto Giorgi, Universita di Siena, C208L15, Slide 7 EMBEDDED SYSTEM TRENDS Roberto Giorgi, Universita di Siena, C208L15, Slide 8

5 Global Embedded Systems Revenue (by Region) AAGR: average annual growth rate Global Embedded Systems Revenue $ Billions Americas Europe Japan Asia-Pacific AAGR% AAGR% Region Source: Future of Embedded Systems Technology, BCC Co, Inc., 2005 Roberto Giorgi, Universita di Siena, C208L15, Slide 9 Global Embedded Systems Revenue (by Application) World Embedded Systems Revenue $ Billions AAGR% 0 0 Telecomm Consumer Automotive Medical/Office Application Industrial/Milit AAGR% Source: Future of Embedded Systems Technology, BCC Co, Inc., 2005 Roberto Giorgi, Universita di Siena, C208L15, Slide 10

6 Global Embedded HW Revenue MPU : microprocessors MCU: microcontrollers Global Embedded Hardware Revenue by Category $ Billions AAGR% 0 MPU MCU DSP Memory Category ASIC/PLD Analog AAGR% Source: Future of Embedded Systems Technology, BCC Co, Inc., 2005 Roberto Giorgi, Universita di Siena, C208L15, Slide 11 Projected Technology Progress 1000 Transistor Density MPU (including SRAM) Source: Process Integration, Devices and Structures, ITRS, 2005 Mtransistors/cm Year Transistor number will continue to scale for some time Roberto Giorgi, Universita di Siena, C208L15, Slide 12

7 Embedded Platforms Roadmap Use of embedded processors in FPGAs 100% 80% 60% 40% Hard FPGA processor Soft FPGA processor No FPGA processor 20% 0% Hardwired Logic (ASIC-like) is being replaced by embedded processor devices Source: Survey of System Design Trends, Celoxica Inc., August 2005 Roberto Giorgi, Universita di Siena, C208L15, Slide 13 Embedded Processors: Innovation driven by Technology + Architecture Advances Multi-processing: Higher throughput With less speed Source: The Era of Tera, Pat Gelsinger, Intel, 2005 Roberto Giorgi, Universita di Siena, C208L15, Slide 14

8 Case Study ITRS Mobile Handheld Roadmap Year of Production Process Technology (nm) Supply Voltage (V) Clock Frequency (MHz) Processing Performance (GOPS) Average Power (W) Standby Power (mw) Applications Real Time Video Codec TV Telephone Source: System Drivers, ITRS, 2003 Performance, En. Efficiency (GOPs/W) increase by 200x Roberto Giorgi, Universita di Siena, C208L15, Slide 15 ITRS Low-Power SoC Source: System Drivers, ITRS, 2005 Many Processing Elements Reusability, Multi-Standard requirements drive for programmable (processor-based) solutions (PEs) (Heterogeneous) Multi-Processor systems-on-a-chip (SoC) Roberto Giorgi, Universita di Siena, C208L15, Slide 16

9 ITRS Low-Power SoC Processing/Performance Trends Source: System Drivers, ITRS, 2005 > 100 Processing Elements in 2011! Roberto Giorgi, Universita di Siena, C208L15, Slide 17 Future Embedded System Design Trends Mobile Handset Market driving commercial factor New applications, wireless transmission standards require high performance embedded low power ITRS foresees 3x magnitude improvement in performance and energy efficiency over the next 10 years (Heterogeneous) Multi-Processor system-on-chip Platforms Compiler Technologies for high-performance, low-power embedded computing will be needed Compiler and System-Design Tools for heterogeneous, massively parallel processing systems and networks Roberto Giorgi, Universita di Siena, C208L15, Slide 18

10 Network of Excellence HiPEAC High-Performance Embedded Architectures and Compilers IST Web Site Roberto Giorgi, Universita di Siena, C208L15, Slide 20

11 What We Have Now ACACES Extranet (Program, practical info,...) Participant management HiPEAC Conference Extranet (Committees, Call for papers, practical info,...) Paper submission (Commence) Roberto Giorgi, Universita di Siena, C208L15, Slide 21 SARC: Scalable ARChitectures WEB Site: Roberto Giorgi, Universita di Siena, C208L15, Slide 22

12 Paradigm shift Tiled architecture, built from fixed size nodes The architecture scales up by adding nodes NOT by growing the node size The node becomes the processor The processors become the functional units Roberto Giorgi, Universita di Siena, C208L15, Slide 23 Programming model features Programming model will have tagged procedure calls Define local and global (shared) variables - Defines address range(s) to copy to local store - Automatic programming of DMA transfers - Defines address range(s) to watch for interference Set procedure properties - Has secondary effects (modifies global state) - Reads global space - Writes global space - Requires atomicity - Regarding local variables - Regarding global variables Processor functionality requirements - Supports a specific ISA extension (or a different ISA) Roberto Giorgi, Universita di Siena, C208L15, Slide 24

13 Intra-node memory hierarchy Architecture must be easy to program for: Shared memory Accelerators may have: Local memory - Private, non-coherent DMA controller - Bridge between global memory and Local memory Accelerators must have: Global memory access - Directly, or through cache hierarchy Single load/store instruction Address range differentiates Local memory from Global memory Local memory ACC DMA Accelerator Cache(s) Local memory Local interconnect Outer shared cache ACC DMA Accelerator Cache(s) Roberto Giorgi, Universita di Siena, C208L15, Slide 25 Intra-node memory hierarchy (II) All caches inside a node must be coherent All outer caches (from each node) should also be coherent Caches work as shared distributed memory If threads do not share memory - There s no coherence traffic, nor overhead - There s no memory waste If threads share memory - Turning off coherence results in wrong execution Which is the benefit of turning off coherence? The hardware must be there anyway Turn it off for power savings? Lower memory access latency in non-shared mode? Use the hardware for something else? (what else? additional storage?) Roberto Giorgi, Universita di Siena, C208L15, Slide 26

14 Examples for intra-node memory Local memory ACC DMA Accelerator Cache(s) Local memory ACC Accelerator Cache(s) ACC Acc Cache(s) Local memory ACC DMA Local interconnect Outer shared cache Roberto Giorgi, Universita di Siena, C208L15, Slide 27 Determine the node size If node size is fixed, we must determine its size Split available area among Shared cache Local interconnect General purpose processor Accelerators Fixed or flexible distribution? Fixed GPP, cache, interconnect Reconfigurable accelerator area How many accelerators can a thread actually exploit? Streaming computation Parallel computation Task offloading Outer cache memory Local interconnect GPP Roberto Giorgi, Universita di Siena, C208L15, Slide 28

15 Node examples Sea of simple cores Niagara Cell Few complex cores Power5 Single vector/media/bio accelerator Multiple accelerators Outer cache memory Local interconnect GPP Roberto Giorgi, Universita di Siena, C208L15, Slide 29 GPP Accelerator interface For the processor to become the functional unit, task offloading must have minimum overhead Outer cache memory Accelerator as ISA extension Shares PC, Fetch & Decode with a general purpose CPU Issue logic sends instructions to CPU or Accelerator Units Implements an extension of the base ISA Accelerator as a new CPU Has a separate PC, Fetch, Decode engine May implement a completely different ISA - VLIW, SIMD, Stack, 16-bit ACC F & D Local interconnect Fetch & Dispatch Fetch & Dispatch GPP ACC CPU ACC Roberto Giorgi, Universita di Siena, C208L15, Slide 30

16 Memory Hierarchy DRAM I/O L3 Control Control Cache Set of coherent (processor-shared?) L1 caches inside the nodes C x Node Set of coherent node-shared L2 caches inside the chip (one from each node) 1 x Node, N x Chip Chip-shared L3 cache 1 x Chip Off-chip DRAM (or other memory technology) Roberto Giorgi, Universita di Siena, C208L15, Slide 31 Motivation Hard to further scale uniprocessors Brought back focus to multiprocessors Different applications profit from different techniques/types of parallelism ILP, TLP, DLP Motivates a customizable system with complex cores simple cores domain-specific accelerators 32 Roberto Giorgi, Universita di Siena, C208L15, Slide 32

17 Motivation (2) Parallelism type exhibited by application and suitable architecture: TLP SSC CMP+ vector DLP SMT vector FCC ILP 33 Roberto Giorgi, Universita di Siena, C208L15, Slide 33 SARC? complex cores simple cores accelerators 34 Roberto Giorgi, Universita di Siena, C208L15, Slide 34

18 ISA considerations Complex cores and simple cores have the same ISA (allows to move threads from one to another [for real-time performance, power, ], simpler programming and compilation) ISA-agnostic approaches applicable to basically any ISA (ARM, PowerPC, ) Accelerator ISAs extensions of GPP ISA single instruction stream (co-processor instructions) or multiple instructions stream 35 Roberto Giorgi, Universita di Siena, C208L15, Slide 35 How to realize customization? At design-time: The right mix of simple cores, complex cores, accelerators is determined at design-time Pro: Highest performance for specific application domains Con: after fabrication, only for specific application domains At run-time: There will be many processing cores on a chip, for temperature reasons some will have to be powered down anyhow Pro: Allows to achieve good performance, low power on many applications Con: Performance not as high as at design-time 36 Roberto Giorgi, Universita di Siena, C208L15, Slide 36

19 Levels of Abstraction Levels of abstraction: Architecture Microarchitecture Implementation Realization SARC WP1 focuses mainly on levels 1 and 2 37 Roberto Giorgi, Universita di Siena, C208L15, Slide 37 SARC node architecture 38 Roberto Giorgi, Universita di Siena, C208L15, Slide 38

20 Architectures of Domain Specific Accelerators SARC specifically targets (but is not limited to) application domains scientific computing (supercomputing) bioinformatics multimedia internet and transaction processing Contain code pieces responsible for large fraction of execution time Performance and power-efficiency can be improved significantly by employing domain-specific accelerators 39 Roberto Giorgi, Universita di Siena, C208L15, Slide 39 Scientific Computing Vector Accelerator Architecture For applications dominated by loops with vector operands What are the innovations: Matrix by Matrix operations (at least 2D) Dimensionality not encoded in the instructions (novel register file to support this) Sparse and Dense matrices considered identically Auto-indexing and sectioning addressing mechanisms (link to WP2) (possible) on-chip distributed vector facility ISA, data formats, register file organization and memory addressing scheme under investigation 40 Roberto Giorgi, Universita di Siena, C208L15, Slide 40

21 Scientific Computing Vector Accelerator Architecture (cont) ISA (check the document) Operand types: Vectors, Matrices (Sparse and Dense), Bit vectors and Scalars. (in sparse mode ½ of the available registers used as index vectors) Data formats: 64 bit FP; 8, 16, 32 and 64 bit INT and BOOL Auto indexing for rectangular patterns (dense): 41 Roberto Giorgi, Universita di Siena, C208L15, Slide 41 Scientific Computing Vector Accelerator Architecture (cont) Register file: The SARC vector register file is a parameterizable register file, which can be logically reorganized by the programmer to support multiple register dimensions and sizes simultaneously. Scalar reg. file shared with GPP 1) Vector registers can overlap (think about it) 2) Scalar registers can be used for conditional branches on the GPP side 42 Roberto Giorgi, Universita di Siena, C208L15, Slide 42

22 Bioinformatics Accelerator Will have a scalar and vector-simd part (Multiple) sequence alignment algorithms require: support for efficient unaligned memory accesses strided memory accesses vector reduction operations, etc. In structure prediction monte carlo or molecular dynamic simulations common can profit from earlier ASIC/FPGA work Docking profits from architectural features incorporated for structure prediction but also from matrix rotations, transposes, 43 Roberto Giorgi, Universita di Siena, C208L15, Slide 43 Multimedia accelerator Vector-SIMD architecture Architecture agnostic to physical vector length Avoid packing/unpacking, reorganization overhead unpacking while loading packing while storing flexible access to register file Use more dimensions 44 Roberto Giorgi, Universita di Siena, C208L15, Slide 44

23 Micro-architectural considerations Simple/complex GPP mixture Scalable cache coherence Support for (existing) sequential, single-threaded applications Thread-level speculation Kilo-instruction processors 45 Roberto Giorgi, Universita di Siena, C208L15, Slide 45 I/O and Communication Subsystem Overheads of system call, context switch, interrupt, network protocol no longer justified With fewer threads than processing cores no reason for switching execution context OS must not run on same processor as user applications requires extra-low communication latency 46 Roberto Giorgi, Universita di Siena, C208L15, Slide 46

24 Interconnection Network LANs/SANs are so fast that switching and routing have to be provided in hardware but reliable and congestion control left to end-nodes needs to be addressed Power considerations also Applies to multi-chip interconnection networks, but NoCs have to solve similar problems in a much more constrained enviroment 47 Roberto Giorgi, Universita di Siena, C208L15, Slide 47 TRANSACTIONAL MEMORY The most difficult task when developing multithreaded applications is making sure that the program works (e.g. deadlocks may occur when combining correct code fragments) Transactional memory is a concurrency control mechanism for controlling access to shared memory A transaction is a piece of code that executes a series of reads and writes to shared memory, which logically occur at a single instant in time, and are typically implemented in a lock-free way Transactional memory is optimistic: every thread completes its modifications to shared memory without regard for what other threads might be doing, recording every read and write that it makes in a log, which are validated in the commit stage Implementing part of the system memory as transactional memory could be the solution for storing shared data in parallel applications while simplifying programming Roberto Giorgi, Universita di Siena, C208L15, Slide 48

25 Riflessione PROBLEM: THINKING IN PARALLEL IS HARD! Perhaps: THINKING is hard! (YALE PATT - Sep.2007) Roberto Giorgi, Universita di Siena, C208L15, Slide 49