This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)? Utilization vs. performance Three implementations Coarse-grained MT Fine-grained MT Simultaneous MT (SMT) Slides originally developed by Amir Roth with contributions by Milo Martin at University of ennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. CIS 501 (Martin): Multithreading 1 CIS 501 (Martin): Multithreading 2 Readings Textbook (MA:FSTCM) Section 8.1 aper Tullsen et al., Exploiting Choice CIS 501 (Martin): Multithreading 3 erformance And Utilization Even moderate superscalar (e.g., 4-way) not fully utilized Average sustained IC: 1.5 2! < 50% utilization. Utilization is (actual IC / peak IC) Why so low? Many dead cycles, due too: Mis-predicted branches Cache misses, especially misses to off-chip memory Data dependences Some workloads worse than others, for example, databases ig data, lots of instructions, hard-to-predict branches Have resource idle is wasteful How can we better utilize the core? Multi-threading (MT) Improve utilization by multiplexing multiple threads on single core If one thread cannot fully utilize core? Maybe 2 or 4 can CIS 501 (Martin): Multithreading 4

Superscalar Under-utilization Time evolution of issue slot 4-issue processor Simple Multithreading Time evolution of issue slot 4-issue processor cache miss time cache miss Fill in with instructions from another thread Superscalar CIS 501 (Martin): Multithreading 5 Superscalar Multithreading Where does it find a thread? Same problem as multi-core Same shared-memory abstraction CIS 501 (Martin): Multithreading 6 Latency vs Throughput MT trades (single-thread) latency for throughput Sharing processor degrades latency of individual threads + ut improves aggregate latency of both threads + Improves utilization Example Thread A: individual latency=10s, latency with thread =15s Thread : individual latency=20s, latency with thread A=25s Sequential latency (first A then or vice versa): 30s arallel latency (A and simultaneously): 25s MT slows each thread by 5s + ut improves total latency by 5s Different workloads have different parallelism SpecF has lots of IL (can use an 8-wide machine) Server workloads have TL (can use multiple threads) CIS 501 (Martin): Multithreading 7 MT Implementations: Similarities How do multiple threads share a single processor? Different sharing mechanisms for different kinds of structures Depend on what kind of state structure stores No state: ALUs Dynamically shared ersistent hard state (aka context ): C, registers Replicated ersistent soft state: caches, branch predictor Dynamically shared (like on a multi-programmed uni-processor) TLs need thread ids, caches/bpred tables don t Exception: ordered soft state (HR, RAS) is replicated Transient state: pipeline latches, RO, Instruction Queue artitioned somehow CIS 501 (Martin): Multithreading 8

MT Implementations: Differences Key question: thread scheduling policy When to switch from one thread to another? Related question: pipeline partitioning How exactly do threads share the pipeline itself? Choice depends on What kind of latencies (specifically, length) you want to tolerate How much single thread performance you are willing to sacrifice The Standard Multithreading icture Time evolution of issue slots Color = thread time Three designs Coarse-grain multithreading (CGMT) Fine-grain multithreading (FGMT) Simultaneous multithreading (SMT) Superscalar CGMT FGMT SMT CIS 501 (Martin): Multithreading 9 CIS 501 (Martin): Multithreading 10 Coarse-Grain Multithreading (CGMT) CGMT Coarse-Grain Multi-Threading (CGMT) + Sacrifices very little single thread performance (of one thread) Tolerates only long latencies (e.g., L2 misses) riority thread scheduling policy Designate a preferred thread (e.g., thread A) Switch to thread on thread A cache miss Switch back to A when A cache miss returns Can also give threads equal priority ipeline partitioning None, flush the pipe on thread switch Can t tolerate short latencies (<2x pipeline depth) est with short in-order pipelines Example: IM Northstar/ulsar CGMT CGMT CIS 501 (Martin): Multithreading 11 L2 miss? CIS 501 (Martin): Multithreading 12

Fine-Grain Multithreading (FGMT) (Extremely) Fine-Grain Multithreading (FGMT) Key idea: have so many threads that never stall on cache misses If miss latency is N cycles, have N threads! Thread scheduling policy Switch threads every cycle (round-robin) Need a lot of threads Many threads! many register files Sacrifices significant single thread performance + Tolerates latencies (e.g., cache misses) + No bypassing needed! ipeline partitioning Dynamic, no flushing Extreme example: Denelcor HE, Tera MTA So many threads (100+), it didn t even need caches Failed commercially, but good for certain niche apps FGMT CIS 501 (Martin): Multithreading 13 Fine-Grain Multithreading (FGMT) (Adaptive) Fine-Grain Multithreading (FGMT) Instead of switching each cycle, just pick from ready threads Thread scheduling policy Each cycle, select one thread to execute from + Fewer threads needed + Tolerates latencies (e.g., cache misses, branch mispred) + Sacrifices little single-thread performance Requires modified pipeline Must handle multiple threads in-flight ipeline partitioning Dynamic, no flushing FGMT CIS 501 (Martin): Multithreading 14 Fine-Grain Multithreading FGMT Multiple threads in pipeline at once (Many) more threads Vertical and Horizontal Under-Utilization FGMT and CGMT reduce vertical under-utilization Loss of all slots in an issue cycle Do not help with horizontal under-utilization Loss of some slots in an issue cycle (in a superscalar processor) time CIS 501 (Martin): Multithreading 15 CGMT FGMT SMT CIS 501 (Martin): Multithreading 16

Simultaneous Multithreading (SMT) What can issue insns from multiple threads in one cycle? Same thing that issues insns from multiple parts of same program out-of-order execution Simultaneous multithreading (SMT): OOO + FGMT Aka hyper-threading Observation: once insns are renamed to physical registers... Scheduler need to know which thread they came from Some examples IM ower5: 4-way issue, 2 threads Intel entium4: 3-way issue, 2 threads Intel Core i7: 4-way issue, 2 threads Alpha 21464: 8-way issue, 4 threads (canceled) Notice a pattern? #threads (T) * 2 = #issue width (N) CIS 501 (Martin): Multithreading 17 Simultaneous Multithreading (SMT) SMT Replicate map table, share (larger) physical register file map table map tables CIS 501 (Martin): Multithreading 18 SMT Resource artitioning hysical and insn buffer entries shared at fine-grain hysically unordered and so fine-grain sharing is possible How are ordered structures (RO/LSQ) shared? Fine-grain sharing (below) would entangle commit (and squash) Allowing threads to commit independently is important Thus, typically replicated or statically partitioned RO is cheap (just a FIFO, so it can be big) LSQ is more of an issue map tables Static & Dynamic Resource artitioning Static partitioning (below) T equal-sized contiguous partitions ± No starvation, sub-optimal utilization (fragmentation) Dynamic partitioning > T partitions, available partitions assigned on need basis ± etter utilization, possible starvation ICOUNT: fetch policy prefers thread with fewest in-flight insns Couple both with larger ROs/LSQs CIS 501 (Martin): Multithreading 19 CIS 501 (Martin): Multithreading 20

Multithreading Issues Shared soft state (caches, branch predictors, TLs, etc.) Key example: cache interference General concern for all multithreading variants Can the working sets of multiple threads fit in the caches? Shared memory threads help here (versus multiple programs) + Same insns! share instruction cache + Shared data! share data cache To keep miss rates low SMT might need more associative & larger caches (which is OK) Out-of-order tolerates L1 misses Large physical register file (and map table) physical registers = (#threads * #arch-regs) + #in-flight insns map table entries = (#threads * #arch-regs) CIS 501 (Martin): Multithreading 21 Notes About Sharing Soft State Caches can be shared naturally hysically-tagged: address translation distinguishes different threads but TLs need explicit thread IDs to be shared Virtually-tagged: entries of different threads indistinguishable Thread IDs are only a few bits: enough to identify on-chip contexts Thread IDs make sense on T (branch target buffer) T entries are already large, a few extra bits / entry won t matter Different thread s target prediction! automatic mis-prediction but not on a HT (branch history table) HT entries are small, a few extra bits / entry is huge overhead Different thread s direction prediction! mis-prediction not automatic Ordered soft-state should be replicated Examples: ranch History Register (HR), Return Address Stack (RAS) Otherwise it becomes meaningless Fortunately, it is typically small CIS 501 (Martin): Multithreading 22 Multithreading vs. Multicore If you wanted to run multiple threads would you build a A multicore: multiple separate pipelines? A multithreaded processor: a single larger pipeline? oth will get you throughput on multiple threads Multicore core will be simpler, possibly faster clock SMT will get you better performance (IC) on a single thread SMT is basically an IL engine that converts TL to IL Multicore is mainly a TL (thread-level parallelism) engine Do both Sun s Niagara (UltraSARC T1) 8 processors, each with 4-threads (non-smt threading) 1Ghz clock, in-order, short pipeline (6 stages or so) Designed for power-efficient throughput computing CIS 501 (Martin): Multithreading 23 Research: Speculative Multithreading Speculative multithreading Use multiple threads/processors for single-thread performance Speculatively parallelize sequential loops, that might not be parallel rocessing elements (called E) arranged in logical ring Compiler or hardware assigns iterations to consecutive Es Hardware tracks logical order to detect mis-parallelization Techniques for doing this on non-loop code too Detect reconvergence points (function calls, conditional code) Effectively chains ROs of different processors into one big RO Global commit head travels from one E to the next Mis-parallelization flushes one Es, but not all Es Also known as split-window or Multiscalar Not commercially available yet ut it is the biggest idea from academia not yet adopted CIS 501 (Martin): Multithreading 24

Multithreading Summary Latency vs. throughput artitioning different processor resources Three multithreading variants Coarse-grain: no single-thread degradation, but long latencies only Fine-grain: other end of the trade-off Simultaneous: fine-grain with out-of-order Multithreading vs. chip multiprocessing CIS 501 (Martin): Multithreading 25