This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings



Similar documents
Thread Level Parallelism II: Multithreading

Multithreading Lin Gao cs9244 report, 2006

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Thread level parallelism

Energy-Efficient, High-Performance Heterogeneous Core Design

Operating System Impact on SMT Architecture

Putting it all together: Intel Nehalem.

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:


Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

VLIW Processors. VLIW Processors

OC By Arsene Fansi T. POLIMI

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Improving System Scalability of OpenMP Applications Using Large Page Support

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Introduction to Cloud Computing

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Chapter 2 Parallel Computer Architecture

CPU Scheduling Outline

Software and the Concurrency Revolution

Parallel Programming Survey

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

Control 2004, University of Bath, UK, September 2004

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor

Driving force. What future software needs. Potential research topics

Low Power AMD Athlon 64 and AMD Opteron Processors

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

WAR: Write After Read

Pipelining Review and Its Limitations

Enterprise Applications

Course Development of Programming for General-Purpose Multicore Processors

Symmetric Multiprocessing

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Multi-Core Programming

Multi-core Programming System Overview

Multicore Programming with LabVIEW Technical Resource Guide

Multi-core and Linux* Kernel

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Computer Organization and Components

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

FPGA-based Multithreading for In-Memory Hash Joins

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

An Event-Driven Multithreaded Dynamic Optimization Framework

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

HyperThreading Support in VMware ESX Server 2.1

Dynamic Scheduling Issues in SMT Architectures

Performance Impacts of Non-blocking Caches in Out-of-order Processors

POWER8 Performance Analysis

How To Understand The Design Of A Microprocessor

Enabling Technologies for Distributed and Cloud Computing

2

STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS

Enabling Technologies for Distributed Computing

Web Server Architectures

2. Background Network Interface Processing

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Chapter 2: OS Overview

NVIDIA Tegra 4 Family CPU Architecture

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Week 1 out-of-class notes, discussions and sample problems

Computer Systems Structure Main Memory Organization

SOC architecture and design

Performance evaluation

Digital Design for Low Power Systems

Optimizing Code for Accelerators: The Long Road to High Performance

PROBLEMS #20,R0,R1 #$3A,R2,R4

Carlos Villavieja, Nacho Navarro Arati Baliga, Liviu Iftode

Multi-Threading Performance on Commodity Multi-Core Processors

Processing Multiple Buffers in Parallel to Increase Performance on Intel Architecture Processors

This Unit: Virtual Memory. CIS 501 Computer Architecture. Readings. A Computer System: Hardware

CPU Scheduling. Core Definitions

Introduction to Microprocessors

MPI and Hybrid Programming Models. William Gropp

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Probabilistic Modeling for Job Symbiosis Scheduling on SMT Processors

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy?, Rajeev Garg?, David H. Albonesi y and Sandhya Dwarkadas?

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

CISC, RISC, and DSP Microprocessors

Exploring Heterogeneity within a Core for Improved Power Efficiency

Computer Architecture TDTS10

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Chapter 5 Process Scheduling

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

LSN 2 Computer Processors

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Transcription:

This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)? Utilization vs. performance Three implementations Coarse-grained MT Fine-grained MT Simultaneous MT (SMT) Slides originally developed by Amir Roth with contributions by Milo Martin at University of ennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. CIS 501 (Martin): Multithreading 1 CIS 501 (Martin): Multithreading 2 Readings Textbook (MA:FSTCM) Section 8.1 aper Tullsen et al., Exploiting Choice CIS 501 (Martin): Multithreading 3 erformance And Utilization Even moderate superscalar (e.g., 4-way) not fully utilized Average sustained IC: 1.5 2! < 50% utilization. Utilization is (actual IC / peak IC) Why so low? Many dead cycles, due too: Mis-predicted branches Cache misses, especially misses to off-chip memory Data dependences Some workloads worse than others, for example, databases ig data, lots of instructions, hard-to-predict branches Have resource idle is wasteful How can we better utilize the core? Multi-threading (MT) Improve utilization by multiplexing multiple threads on single core If one thread cannot fully utilize core? Maybe 2 or 4 can CIS 501 (Martin): Multithreading 4

Superscalar Under-utilization Time evolution of issue slot 4-issue processor Simple Multithreading Time evolution of issue slot 4-issue processor cache miss time cache miss Fill in with instructions from another thread Superscalar CIS 501 (Martin): Multithreading 5 Superscalar Multithreading Where does it find a thread? Same problem as multi-core Same shared-memory abstraction CIS 501 (Martin): Multithreading 6 Latency vs Throughput MT trades (single-thread) latency for throughput Sharing processor degrades latency of individual threads + ut improves aggregate latency of both threads + Improves utilization Example Thread A: individual latency=10s, latency with thread =15s Thread : individual latency=20s, latency with thread A=25s Sequential latency (first A then or vice versa): 30s arallel latency (A and simultaneously): 25s MT slows each thread by 5s + ut improves total latency by 5s Different workloads have different parallelism SpecF has lots of IL (can use an 8-wide machine) Server workloads have TL (can use multiple threads) CIS 501 (Martin): Multithreading 7 MT Implementations: Similarities How do multiple threads share a single processor? Different sharing mechanisms for different kinds of structures Depend on what kind of state structure stores No state: ALUs Dynamically shared ersistent hard state (aka context ): C, registers Replicated ersistent soft state: caches, branch predictor Dynamically shared (like on a multi-programmed uni-processor) TLs need thread ids, caches/bpred tables don t Exception: ordered soft state (HR, RAS) is replicated Transient state: pipeline latches, RO, Instruction Queue artitioned somehow CIS 501 (Martin): Multithreading 8

MT Implementations: Differences Key question: thread scheduling policy When to switch from one thread to another? Related question: pipeline partitioning How exactly do threads share the pipeline itself? Choice depends on What kind of latencies (specifically, length) you want to tolerate How much single thread performance you are willing to sacrifice The Standard Multithreading icture Time evolution of issue slots Color = thread time Three designs Coarse-grain multithreading (CGMT) Fine-grain multithreading (FGMT) Simultaneous multithreading (SMT) Superscalar CGMT FGMT SMT CIS 501 (Martin): Multithreading 9 CIS 501 (Martin): Multithreading 10 Coarse-Grain Multithreading (CGMT) CGMT Coarse-Grain Multi-Threading (CGMT) + Sacrifices very little single thread performance (of one thread) Tolerates only long latencies (e.g., L2 misses) riority thread scheduling policy Designate a preferred thread (e.g., thread A) Switch to thread on thread A cache miss Switch back to A when A cache miss returns Can also give threads equal priority ipeline partitioning None, flush the pipe on thread switch Can t tolerate short latencies (<2x pipeline depth) est with short in-order pipelines Example: IM Northstar/ulsar CGMT CGMT CIS 501 (Martin): Multithreading 11 L2 miss? CIS 501 (Martin): Multithreading 12

Fine-Grain Multithreading (FGMT) (Extremely) Fine-Grain Multithreading (FGMT) Key idea: have so many threads that never stall on cache misses If miss latency is N cycles, have N threads! Thread scheduling policy Switch threads every cycle (round-robin) Need a lot of threads Many threads! many register files Sacrifices significant single thread performance + Tolerates latencies (e.g., cache misses) + No bypassing needed! ipeline partitioning Dynamic, no flushing Extreme example: Denelcor HE, Tera MTA So many threads (100+), it didn t even need caches Failed commercially, but good for certain niche apps FGMT CIS 501 (Martin): Multithreading 13 Fine-Grain Multithreading (FGMT) (Adaptive) Fine-Grain Multithreading (FGMT) Instead of switching each cycle, just pick from ready threads Thread scheduling policy Each cycle, select one thread to execute from + Fewer threads needed + Tolerates latencies (e.g., cache misses, branch mispred) + Sacrifices little single-thread performance Requires modified pipeline Must handle multiple threads in-flight ipeline partitioning Dynamic, no flushing FGMT CIS 501 (Martin): Multithreading 14 Fine-Grain Multithreading FGMT Multiple threads in pipeline at once (Many) more threads Vertical and Horizontal Under-Utilization FGMT and CGMT reduce vertical under-utilization Loss of all slots in an issue cycle Do not help with horizontal under-utilization Loss of some slots in an issue cycle (in a superscalar processor) time CIS 501 (Martin): Multithreading 15 CGMT FGMT SMT CIS 501 (Martin): Multithreading 16

Simultaneous Multithreading (SMT) What can issue insns from multiple threads in one cycle? Same thing that issues insns from multiple parts of same program out-of-order execution Simultaneous multithreading (SMT): OOO + FGMT Aka hyper-threading Observation: once insns are renamed to physical registers... Scheduler need to know which thread they came from Some examples IM ower5: 4-way issue, 2 threads Intel entium4: 3-way issue, 2 threads Intel Core i7: 4-way issue, 2 threads Alpha 21464: 8-way issue, 4 threads (canceled) Notice a pattern? #threads (T) * 2 = #issue width (N) CIS 501 (Martin): Multithreading 17 Simultaneous Multithreading (SMT) SMT Replicate map table, share (larger) physical register file map table map tables CIS 501 (Martin): Multithreading 18 SMT Resource artitioning hysical and insn buffer entries shared at fine-grain hysically unordered and so fine-grain sharing is possible How are ordered structures (RO/LSQ) shared? Fine-grain sharing (below) would entangle commit (and squash) Allowing threads to commit independently is important Thus, typically replicated or statically partitioned RO is cheap (just a FIFO, so it can be big) LSQ is more of an issue map tables Static & Dynamic Resource artitioning Static partitioning (below) T equal-sized contiguous partitions ± No starvation, sub-optimal utilization (fragmentation) Dynamic partitioning > T partitions, available partitions assigned on need basis ± etter utilization, possible starvation ICOUNT: fetch policy prefers thread with fewest in-flight insns Couple both with larger ROs/LSQs CIS 501 (Martin): Multithreading 19 CIS 501 (Martin): Multithreading 20

Multithreading Issues Shared soft state (caches, branch predictors, TLs, etc.) Key example: cache interference General concern for all multithreading variants Can the working sets of multiple threads fit in the caches? Shared memory threads help here (versus multiple programs) + Same insns! share instruction cache + Shared data! share data cache To keep miss rates low SMT might need more associative & larger caches (which is OK) Out-of-order tolerates L1 misses Large physical register file (and map table) physical registers = (#threads * #arch-regs) + #in-flight insns map table entries = (#threads * #arch-regs) CIS 501 (Martin): Multithreading 21 Notes About Sharing Soft State Caches can be shared naturally hysically-tagged: address translation distinguishes different threads but TLs need explicit thread IDs to be shared Virtually-tagged: entries of different threads indistinguishable Thread IDs are only a few bits: enough to identify on-chip contexts Thread IDs make sense on T (branch target buffer) T entries are already large, a few extra bits / entry won t matter Different thread s target prediction! automatic mis-prediction but not on a HT (branch history table) HT entries are small, a few extra bits / entry is huge overhead Different thread s direction prediction! mis-prediction not automatic Ordered soft-state should be replicated Examples: ranch History Register (HR), Return Address Stack (RAS) Otherwise it becomes meaningless Fortunately, it is typically small CIS 501 (Martin): Multithreading 22 Multithreading vs. Multicore If you wanted to run multiple threads would you build a A multicore: multiple separate pipelines? A multithreaded processor: a single larger pipeline? oth will get you throughput on multiple threads Multicore core will be simpler, possibly faster clock SMT will get you better performance (IC) on a single thread SMT is basically an IL engine that converts TL to IL Multicore is mainly a TL (thread-level parallelism) engine Do both Sun s Niagara (UltraSARC T1) 8 processors, each with 4-threads (non-smt threading) 1Ghz clock, in-order, short pipeline (6 stages or so) Designed for power-efficient throughput computing CIS 501 (Martin): Multithreading 23 Research: Speculative Multithreading Speculative multithreading Use multiple threads/processors for single-thread performance Speculatively parallelize sequential loops, that might not be parallel rocessing elements (called E) arranged in logical ring Compiler or hardware assigns iterations to consecutive Es Hardware tracks logical order to detect mis-parallelization Techniques for doing this on non-loop code too Detect reconvergence points (function calls, conditional code) Effectively chains ROs of different processors into one big RO Global commit head travels from one E to the next Mis-parallelization flushes one Es, but not all Es Also known as split-window or Multiscalar Not commercially available yet ut it is the biggest idea from academia not yet adopted CIS 501 (Martin): Multithreading 24

Multithreading Summary Latency vs. throughput artitioning different processor resources Three multithreading variants Coarse-grain: no single-thread degradation, but long latencies only Fine-grain: other end of the trade-off Simultaneous: fine-grain with out-of-order Multithreading vs. chip multiprocessing CIS 501 (Martin): Multithreading 25