Thread Level Parallelism II: Multithreading

Similar documents

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Thread level parallelism

Multithreading Lin Gao cs9244 report, 2006

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Putting it all together: Intel Nehalem.

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Operating System Impact on SMT Architecture

Energy-Efficient, High-Performance Heterogeneous Core Design

Virtuoso and Database Scalability

Enabling Technologies for Distributed and Cloud Computing

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Parallel Programming Survey

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Enabling Technologies for Distributed Computing

Enterprise Applications

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

OC By Arsene Fansi T. POLIMI

Chapter 2 Parallel Computer Architecture

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

UltraSPARC T1: A 32-threaded CMP for Servers. James Laudon Distinguished Engineer Sun Microsystems james.laudon@sun.com

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

IOS110. Virtualization 5/27/2014 1

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Scaling in a Hypervisor Environment

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Introduction to Microprocessors

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

VMWARE WHITE PAPER 1

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Performance Impacts of Non-blocking Caches in Out-of-order Processors

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Symmetric Multiprocessing

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Communicating with devices

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

VLIW Processors. VLIW Processors

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Introduction to Cloud Computing

Microsoft Exchange Server 2007 and Hyper-V high availability configuration on HP ProLiant BL680c G5 server blades

Accelerating Server Storage Performance on Lenovo ThinkServer

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

Benchmarking Cassandra on Violin

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

Power efficiency and power management in HP ProLiant servers

MAGENTO HOSTING Progressive Server Performance Improvements

Table Of Contents. Page 2 of 26. *Other brands and names may be claimed as property of others.

Managing Data Center Power and Cooling

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Seradex White Paper. Focus on these points for optimizing the performance of a Seradex ERP SQL database:

SQL Server Business Intelligence on HP ProLiant DL785 Server

VTrak SATA RAID Storage System

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Parallel Algorithm Engineering

Fusionstor NAS Enterprise Server and Microsoft Windows Storage Server 2003 competitive performance comparison

FPGA-based Multithreading for In-Memory Hash Joins

Improving Economics of Blades with VMware

This Unit: Virtual Memory. CIS 501 Computer Architecture. Readings. A Computer System: Hardware

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Computer Systems Structure Main Memory Organization

Leading Virtualization Performance and Energy Efficiency in a Multi-processor Server

Multi-core Programming System Overview

Multi-Threading Performance on Commodity Multi-Core Processors

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Intel Pentium 4 Processor on 90nm Technology

Control 2004, University of Bath, UK, September 2004

Multi-Core Programming

PARALLELS CLOUD SERVER

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

A Quantum Leap in Enterprise Computing

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

Adaptec: Snap Server NAS Performance Study

How To Compare Two Servers For A Test On A Poweredge R710 And Poweredge G5P (Poweredge) (Power Edge) (Dell) Poweredge Poweredge And Powerpowerpoweredge (Powerpower) G5I (

Computer Systems Structure Input/Output

7 Real Benefits of a Virtual Infrastructure

Gigabit Ethernet Design

OpenSPARC T1 Processor

Transcription:

Thread Level Parallelism II: Multithreading Readings: H&P: Chapter 3.5 Paper: NIAGARA: A 32-WAY MULTITHREADED Thread Level Parallelism II: Multithreading 1

This Unit: Multithreading (MT) Application OS Compiler Firmware CPU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)? Utilization vs. performance Three implementations Coarse-grained MT Fine-grained MT Simultaneous MT (SMT) Example The Sun UltraSparc T1 Research topics Thread Level Parallelism II: Multithreading 2

Performance And Utilization Performance (IPC) important Utilization (actual IPC / peak IPC) important too Even moderate superscalars (e.g., 4-way) not fully utilized Average sustained IPC: 1.5 2 < 50% utilization Mis-predicted branches Cache misses, especially L2 Data dependences Multi-threading (MT) Improve utilization by multi-plexing multiple threads on single CPU One thread cannot fully utilize CPU? Maybe 2, 4 (or 100) can Multi-programmed OS generalization: I/O CPU Stall Context Switches Thread Switches Stack of software Zero software Thread Level Parallelism II: Multithreading 3

Superscalar Under-utilization Time evolution of issue slot 4-issue processor cache miss Superscalar Thread Level Parallelism II: Multithreading 4

time Simple Multithreading Time evolution of issue slot 4-issue processor cache miss Fill in with instructions from another thread Superscalar Multithreading Where does it find a thread? Same problem as multi-core Same shared-memory abstraction Thread Level Parallelism II: Multithreading 5

Latency vs Throughput MT trades (single-thread) latency for throughput Sharing processor degrades latency of individual threads + But improves aggregate latency of both threads + Improves utilization Example Thread A: individual latency=10s, latency with thread B=15s Thread B: individual latency=20s, latency with thread A=25s Sequential latency (first A then B or vice versa): 30s Parallel latency (A and B simultaneously): 25s MT slows each thread by 5s + But improves total latency by 5s Different workloads have different parallelism SpecFP has lots of ILP (can use an 8-wide machine) Server workloads have TLP (can use multiple threads) Thread Level Parallelism II: Multithreading 6

MT Implementations: Similarities How do multiple threads share a single processor? Different sharing mechanisms for different kinds of structures Depend on what kind of state structure stores No state: ALUs Dynamically shared Persistent hard state (aka context ): PC, registers Replicated Persistent soft state: caches, bpred Dynamically partitioned (like on a multi-programmed uni-processor) TLBs need thread ids, caches/bpred tables don t (PA/VA???) Exception: ordered soft state (BHR, RAS) is replicated Transient state: pipeline latches, ROB, RS Partitioned somehow Thread Level Parallelism II: Multithreading 7

MT Implementations: Differences Main question: thread scheduling policy When to switch from one thread to another? Related question: pipeline partitioning How exactly do threads share the pipeline itself? Choice depends on What kind of latencies (specifically, length) you want to tolerate How much single thread performance you are willing to sacrifice Three designs Coarse-grain multithreading (CGMT) Fine-grain multithreading (FGMT) Simultaneous multithreading (SMT) Thread Level Parallelism II: Multithreading 8

time The Standard Multithreading Picture Time evolution of issue slots Color = thread Superscalar CGMT FGMT SMT Thread Level Parallelism II: Multithreading 9

Coarse-Grain Multithreading (CGMT) Coarse-Grain Multi-Threading (CGMT) + Sacrifices very little single thread performance (of one thread) Tolerates only long latencies (e.g., L2 misses) Thread scheduling policy Designate a preferred thread (e.g., thread A) Switch to thread B on thread A L2 miss Switch back to A when A L2 miss returns Pipeline partitioning None, flush on switch Can t tolerate latencies shorter than twice pipeline depth Need short in-order pipeline for good performance Example: IBM Northstar/Pulsar (RS/6000) Thread Level Parallelism II: Multithreading 10

CGMT regfile I$ B P D$ CGMT thread scheduler I$ B P regfile regfile D$ L2 miss? Thread Level Parallelism II: Multithreading 11

Fine-Grain Multithreading (FGMT) Fine-Grain Multithreading (FGMT) Sacrifices significant single thread performance + Tolerates latencies (e.g., L2 misses, mispredicted branches, etc.) Thread scheduling policy Switch threads every cycle (round-robin), L2 miss or no Pipeline partitioning Dynamic, no flushing Length of pipeline doesn t matter so much Need a lot of threads Extreme example: Denelcor HEP ( 81-85) So many threads (100+), it didn t even need caches Failed commercially Not popular today Many threads many register files Thread Level Parallelism II: Multithreading 12

Fine-Grain Multithreading FGMT Multiple threads in pipeline at once (Many) more threads thread scheduler regfile regfile regfile regfile I$ B P D$ Thread Level Parallelism II: Multithreading 13

time Vertical and Horizontal Under-Utilization FGMT and CGMT reduce vertical under-utilization Loss of all slots in an issue cycle Do not help with horizontal under-utilization Loss of some slots in an issue cycle (in a superscalar processor) CGMT FGMT SMT Thread Level Parallelism II: Multithreading 14

SMT Thread Level Parallelism II: Multithreading 15

Simultaneous Multithreading (SMT) What can issue insns from multiple threads in one cycle? Same thing that issues insns from multiple parts of same program out-of-order execution How different PCs are managed? Simultaneous multithreading (SMT): OOO + FGMT Aka hyper-threading Observation: once insns are renamed, scheduler doesn t care which thread they come from (well, for non-loads at least) Some examples IBM Power5: 4-way issue, 2 threads Intel Pentium4: 3-way issue, 2 threads Intel Nehalem : 4-way issue, 2 threads Alpha 21464: 8-way issue, 4 threads (canceled) Notice a pattern? #threads (T) * 2 = #issue width (N) Thread Level Parallelism II: Multithreading 16

Simultaneous Multithreading (SMT) map table regfile I$ B P D$ SMT Replicate map table, share (larger) physical register file thread scheduler map tables regfile I$ B P D$ Thread Level Parallelism II: Multithreading 17

SMT Resource Partitioning Physical regfile and insn buffer entries shared at fine-grain Physically unordered and so fine-grain sharing is possible How are physically ordered structures (ROB/LSQ) shared? Fine-grain sharing (below) would entangle retire (and flush) Allowing threads to commit independently is important thread scheduler map tables regfile I$ B P D$ Thread Level Parallelism II: Multithreading 18

Static & Dynamic Resource Partitioning Static partitioning (below) T equal-sized contiguous partitions ± No starvation, sub-optimal utilization (fragmentation) Dynamic partitioning P > T partitions, available partitions assigned on need basis ± Better utilization, possible starvation ICOUNT: fetch policy prefers thread with fewest in-flight insns Couple both with larger ROBs/LSQs regfile I$ B P D$ Thread Level Parallelism II: Multithreading 19

Multithreading Issues Shared soft state (caches, branch predictors, TLBs, etc.) Key example: cache interference General concern for all MT variants Can the working sets of multiple threads fit in the caches? Shared memory SPMD threads help here + Same insns share I$ + Shared data less D$ contention MT is good for workloads with shared insn/data To keep miss rates low, SMT might need a larger L2 (which is OK) Out-of-order tolerates L1 misses Large physical register file (and map table) physical registers = (#threads * #arch-regs) + #in-flight insns map table entries = (#threads * #arch-regs) Thread Level Parallelism II: Multithreading 20

Notes About Sharing Soft State Caches are shared naturally Physically-tagged: address translation distinguishes different threads but TLBs need explicit thread IDs to be shared Virtually-tagged: entries of different threads indistinguishable Thread IDs are only a few bits: enough to identify on-chip contexts Thread IDs make sense on BTB (branch target buffer) BTB entries are already large, a few extra bits / entry won t matter Different thread s target prediction automatic mis-prediction but not on a BHT (branch history table) BHT entries are small, a few extra bits / entry is huge overhead Different thread s direction prediction mis-prediction not automatic Ordered soft-state should be replicated Examples: Branch History Register (BHR), Return Address Stack (RAS) Otherwise it becomes meaningless Fortunately, it is typically small Thread Level Parallelism II: Multithreading 21

The real cost of SMT Thread Level Parallelism II: Multithreading 22

Not always a good thing Thread Level Parallelism II: Multithreading 23

Multithreading vs. Multicore If you wanted to run multiple threads would you build a A multicore: multiple separate pipelines? A multithreaded processor: a single larger pipeline? Both will get you throughput on multiple threads Multicore core will be simpler, possibly faster clock SMT will get you better performance (IPC) on a single thread SMT is basically an ILP engine that converts TLP to ILP Multicore is mainly a TLP (thread-level parallelism) engine Do both Sun s Niagara (UltraSPARC T1), Chip Multiprocessor 8 processors, each with 4-threads (non-smt threading) 1Ghz clock, in-order, short pipeline (6 stages or so) Designed for power-efficient throughput computing Thread Level Parallelism II: Multithreading 24

Multithreading Summary Latency vs. throughput Partitioning different processor resources Three multithreading variants Coarse-grain: no single-thread degradation, but long latencies only Fine-grain: other end of the trade-off Simultaneous: fine-grain with out-of-order Multithreading vs. chip multiprocessing Thread Level Parallelism II: Multithreading 25

EXAMPLE: THE SUN UNLTRASPARC T1 MULTIPROCESSOR Thread Level Parallelism II: Multithreading 26

The SUN UltraSparc T1 Thread Level Parallelism II: Multithreading 27

Floor plan Thread Level Parallelism II: Multithreading 28

Enclosure (SunFire T2000) Thread Level Parallelism II: Multithreading 29

Power Breakdown Thread Level Parallelism II: Multithreading 30

The SUN UltraSparc T1 Pipeline Thread Level Parallelism II: Multithreading 31

Cost Thread Level Parallelism II: Multithreading 32

Performance Thread Level Parallelism II: Multithreading 33

Performance (Pentium D normalized) Thread Level Parallelism II: Multithreading 34

Sun UltraSparc T2 Second iteration of Niagara 8 pipelined cores, 8-way SMT 64 hardware threads per chip 4MB L2 8 Fully pipelined FPU (1 per core) Dual 10 GbE and PCIe integrated Security processor per core: DES, 3DES, AES, etc 65nm, 1.4Ghz, <95W (chip), +60GB/s BW (4 FB DIMM controllers) Sun UltraSparc T3 (a.k.a. Niagara Falls) 16 cores 16-way SMT, 4-chip servers Thread Level Parallelism II: Multithreading 35

RESEARCH Thread Level Parallelism II: Multithreading 36

Research: Speculative Multithreading Speculative multithreading Use multiple threads/processors for single-thread performance Speculatively parallelize sequential loops, that might not be parallel Processing elements (called PE) arranged in logical ring Compiler or hardware assigns iterations to consecutive PEs Hardware tracks logical order to detect mis-parallelization Techniques for doing this on non-loop code too Detect reconvergence points (function calls, conditional code) Effectively chains ROBs of different processors into one big ROB Global commit head travels from one PE to the next Mis-parallelization flushes one PEs, but not all PEs Also known as split-window or Multiscalar Not commercially available yet eventually will be?? But it is the biggest idea from academia not yet adopted Thread Level Parallelism II: Multithreading 37

Research: Multithreading for Reliability Can multithreading help with reliability? Design bugs/manufacturing defects? No Gradual defects, e.g., thermal wear? No Transient errors? Yes Staggered redundant multithreading (SRT) Run two copies of program at a slight stagger Compare results, difference? Flush both copies and restart Significant performance overhead Thread Level Parallelism II: Multithreading 38

Research: QoS CMP Server Consolidation Definition : Server consolidation is an approach to the efficient usage of computer server resources in order to reduce the total number of servers or server locations that an organization requires www server database server #2 database server #1 middleware server #1 middleware server #1 Thread Level Parallelism II: Multithreading 39

Motivation: CMP & Server consolidation CMP based system are pervasive and an ideal hardware platform (in terms of cost) to deploy server consolidation www server 64-core CMP L2 Cache database server #2 Core L1 database server #1 middleware server #1 middleware server #1 Thread Level Parallelism II: Multithreading 40

Virtualization Virtualization is the most suitable tool to achieve server consolidation Hides all the nasty details to the user Eases system administration Performance? Most virtualization layers provide a huge set of parameters to regulate the access to shared resources Disk Memory Network Time sharing CPU Thread Level Parallelism II: Multithreading 41

Performance Isolation & CMP CMP introduces a new dimension in shared resources A large portion of the memory hierarchy will be shared Two mayor potential problems On-chip cache capacity and/or bandwidth Off-chip bandwidth Question that arises: Too fine grained to be effective at software level? Has to be the CMP aware of virtualization layer? Has to provide the hardware some mechanism to apply virtualization policies? Thread Level Parallelism II: Multithreading 42

Is it a problem? Let s analyze the problem with state-of-the-art infrastructure Consolidated servers Significant workload to be consolidates (SPECWeb2005) with Apache Tomcat Misbehaving workload that stresses CMP (on-chip capacity & off-chip bandwidth) Software stack Virtualization layer (XEN & Solaris Zones) with resource controls Realistic Operating systems (GNU/Linux Debian5, Solaris 10) Thread Level Parallelism II: Multithreading 43

Hardware Setup Sun Fire T1000 HP Proliant DL160 G5 Processor Model Sun UltraSparc T1 Intel Xeon X5472 Threads, Cores, dies, chips 32, 8, 1, 1 8, 8, 4, 2 L1I L1D Private, 16KB 4-way set associative, 32-byte lines, 1-per core Private, 8KB 4-way set associative, 16-byte lines, writethrough, 1 per core Private, 32KB, 8-way set associative, 64-byte lines, 1-per core Private, 32KB, 8-way set associative, 64-byte lines, 1-per core L2 3MB Shared per chip, 12-way set associative, 64-byte lines 6MB Shared per die, 24-way set associative, 64 byte lines Memory 8GB, 533Mhz, DDR2 16GB, 667Mhz FB-DIMM Hard disk 1 SATA 2 SATA RAID 0 NIC Dual Gigabit Ethernet Dual Gigabit Ethernet (bounded) (bounded) Thread Level Parallelism II: Multithreading 44

Normalized Off-Chip Bandwidth Available to Primary VPS Normalized Performance Misbehaving application on T1000 1.2 1.2 1 1 0.8 0.6 0.4 0.8 0.6 0.4 0.2 0.2 0 0 Primary VPS / Secondary VPS Primary VPS / Secondary VPS Thread Level Parallelism II: Multithreading 45

Normalized Performance Performance isolation on T1000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Bank E-Commerce Support Average Benchmark Idle harm.on-chip harm.off-chip Thread Level Parallelism II: Multithreading 46

Normalized Performance Performance Isolation on Xeon 1 0.9 Die ( 1core per VPS) Chip (2 cores per VPS) Chip Interleaved (2 cores per VPS) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Bank E-Commerce Support Average Thread Level Parallelism II: Multithreading 47

What to do? Current virtualization + hardware is not enough to provide the expected level of performance Could a provider guarantee SLA? Potentially could reduce the server consolidation degree Potentially could loss its customers With many core-cmp advent the problem will exacerbate We need some sort of QoS provision Hardware level mechanisms to control the access to common resources Software could specify policies for those mechanisms Thread Level Parallelism II: Multithreading 48

Acknowledgments Slides developed by Amir Roth of University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. Slides enhanced by Milo Martin and Mark Hill with sources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith, Sohi, Vijaykumar, and Wood Slides re-adapted by V. Puente of University of Cantabria Thread Level Parallelism II: Multithreading 49