Circuit-Switched Coherence

Similar documents
Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

CHIP MULTIPROCESSOR COHERENCE AND INTERCONNECT SYSTEM DESIGN

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Thread level parallelism

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Efficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols

A Dynamic Link Allocation Router

- Nishad Nerurkar. - Aniket Mhatre

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip

Asynchronous Bypass Channels

HTM and NOC - Interdependent Network Traffic Solutions

Question: 3 When using Application Intelligence, Server Time may be defined as.

On-chip Interconnection Network for Accelerator-Rich Architectures

Synthetic Traffic Models that Capture Cache Coherent Behaviour. Mario Badr

3D On-chip Data Center Networks Using Circuit Switches and Packet Switches

Lecture 23: Multiprocessors

Interconnection Network Design

Interconnection Networks

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Precise and Accurate Processor Simulation

Why the Network Matters

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

On-Chip Interconnection Networks Low-Power Interconnect

WAN Data Link Protocols

MPLS. Packet switching vs. circuit switching Virtual circuits

ECE 358: Computer Networks. Solutions to Homework #4. Chapter 4 - The Network Layer

A Low-Latency Asynchronous Interconnection Network with Early Arbitration Resolution

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT:

Fiber Channel Over Ethernet (FCoE)

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Optical interconnection networks with time slot routing

ARCHITECTING EFFICIENT INTERCONNECTS FOR LARGE CACHES

Low-Overhead Hard Real-time Aware Interconnect Network Router

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Photonic Networks for Data Centres and High Performance Computing

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Study Plan Masters of Science in Computer Engineering and Networks (Thesis Track)

TRACKER: A Low Overhead Adaptive NoC Router with Load Balancing Selection Strategy

Design of a Feasible On-Chip Interconnection Network for a Chip Multiprocessor (CMP)

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

An Active Packet can be classified as

Leveraging Torus Topology with Deadlock Recovery for Cost-Efficient On-Chip Network

Computer Networks and the Internet

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Design of a Dynamic Priority-Based Fast Path Architecture for On-Chip Interconnects*

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

OpenSPARC T1 Processor

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Enterprise Applications

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Voice over Internet Protocol (VoIP) systems can be built up in numerous forms and these systems include mobile units, conferencing units and

Interconnection Networks

FPGA-based Multithreading for In-Memory Hash Joins

A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

GARNET: A Detailed On-Chip Network Model inside a Full-System Simulator

PART III. OPS-based wide area networks

Behavior Analysis of TCP Traffic in Mobile Ad Hoc Network using Reactive Routing Protocols

CCNA R&S: Introduction to Networks. Chapter 5: Ethernet

Introduction. Abusayeed Saifullah. CS 5600 Computer Networks. These slides are adapted from Kurose and Ross

A Low-Radix and Low-Diameter 3D Interconnection Network Design

Symmetric Multiprocessing

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi

Optimizing Configuration and Application Mapping for MPSoC Architectures

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB

- Hubs vs. Switches vs. Routers -

What is this Course All About

Parallel Firewalls on General-Purpose Graphics Processing Units

Performance Modeling and Analysis of a Database Server with Write-Heavy Workload

Data Center and Cloud Computing Market Landscape and Challenges

Switch Fabric Implementation Using Shared Memory

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Chapter 3. Enterprise Campus Network Design

MULTISTAGE INTERCONNECTION NETWORKS: A TRANSITION TO OPTICAL

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

TE in action. Some problems that TE tries to solve. Concept of Traffic Engineering (TE)

SUPPORT FOR HIGH-PRIORITY TRAFFIC IN VLSI COMMUNICATION SWITCHES

Introduction, Rate and Latency

Putting it all together: Intel Nehalem.

Scalability and Classifications

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

The Internet has made database management

SWEL: Hardware Cache Coherence Protocols to Map Shared Data onto Shared Caches

OC By Arsene Fansi T. POLIMI

Interconnection Networks

Router Architectures

InfiniBand Clustering

SOC architecture and design

Data Center Network Structure using Hybrid Optoelectronic Routers

Dynamic Directories: A Mechanism for Reducing On-Chip Interconnect Power in Multicores

Introduction to Metropolitan Area Networks and Wide Area Networks

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors

Transcription:

Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE International Symposium on Networks-on-Chip

Motivation Network on Chip for general purpose multi-core Replacing dedicated global wires Efficient/scalable communication on-chip Router latency overhead can be significant Exploit application characteristics to lower latency Co-design coherence protocol to match network functionality 2

Executive Summary Hybrid Network Interleaves circuit-switched and packetswitched flits Optimize setup latency Improve throughput over traditional circuitswitching Reduce interconnect delay by up to 22% Co-design cache coherence protocol Improves performance by up to 17% 3

Switching Techniques Packet Switching Efficient bandwidth utilization Router latency overhead Circuit Switching Poor bandwidth utilization Efficient Stalled bandwidth requests utilization due to unavailable + low latency resources Low latency Best of both worlds? Avoids router overhead after circuit is established 4

Circuit-Switched Coherence Two key observations Commercial workloads are very sensitive to communication Construct latency fast pair-wise circuits? Significant pair-wise sharing Commercial Workloads: SpecJBB, SpecWeb, TPC-H, TPC-W Scientific Workloads: Barnes-Hut, Ocean, Radiosity, Raytrace 5

Traditional Circuit Switching Traditional circuit-switching hurts performance by up to ~7% *Data collected for 16 in-order core chip multiprocessor 6

Circuit Switching Redesigned Latency is critical Utilize Circuit Switching for lower latency A circuit connects resources across multiple hops to avoid router overhead Traditional circuit-switching performs poorly My contributions Novel setup mechanism Bandwidth stealing 7

Outline Motivation Router Design Setup Mechanism Bandwidth Stealing Coherence Protocol Co-design Pair-wise sharing 3-hop optimization Region prediction Results Conclusions 8

Traditional Circuit Switching Path Setup (with Acknowledgement) 0 Configuration Probe 5 Data Circuit Acknowledgement Significant latency overhead prior to data transfer Other requests forced to wait for resources 9

Novel Circuit Setup Policy 0 A Configuration Packet 5 Data Circuit Overlap circuit setup with 1 st data transfer Reconfigure existing circuits if no unused links available Allows piggy-backed request to always achieve low latency Multiple circuit planes prevent frequent reconfiguration 4/11/2008 Natalie Enright Jerger - University of Wisconsin 10

Setup Network Light-weight setup network Narrow Circuit plane identifier (2 bits) + Destination (4 bits) Low Load No virtual channels small area footprint Stores circuit configuration information Multiple narrow circuit planes prevent frequent reconfiguration Reconfiguration Buffered, traverses packet-switched pipeline 11

Packet-Switched Bandwidth Stealing Remember: problem with traditional Circuit-Switching is poor bandwidth Need to overcome this limitation Hybrid Circuit-Switched Solution: Packetswitched messages snoop incoming links When there are no circuit-switched messages on the link A waiting packet-switched message can steal idle bandwidth 12

Hybrid Circuit-Switched Router Design Allocators Inj N S E T T T T Ej N S E W W T Crossbar 13

HCS Pipeline Circuit-switched messages: 1 stage Switch Traversal Link Traversal Router Link Packet-switched messages: 3 stages Aggressive Speculation reduces stages Buffer Write Virtual Channel/ Switch Allocation Switch Traversal Link Traversal Router Link 14

Outline Motivation Router Design Setup Mechanism Bandwidth Stealing Coherence Protocol Co-design Pair-wise sharing 3-hop optimization Region prediction Results Conclusions 15

Sharing Characterization Temporal sharing relationship: 67-76% of misses are serviced by 2 most recently shared with cores Commercial Workloads: SpecJBB, SpecWeb, TPC-H, TPC-W Scientific Workloads: Barnes-Hut, Ocean, Radiosity, Raytrace 16

Directory Coherence 1 Read A 3 Data Response A 1 2 Directory Address State Sharers A Exclusive Shared 1,2 2 B Shared 1,2 2 Forward Read A 17

Coherence Protocol Co-Design Goal: Better exploit circuits through coherence protocol Modifications: Allow a cache to send a request directly to another cache Notify the directory in parallel Prediction mechanism for pair-wise sharers Directory is sole ordering point 18

Circuit-Switched Coherence Optimization 1 Update A 2 Data Response A 1 2 1 Read A 3 Ack A Directory Address State Sharers A Exclusive Shared 1,2 2 B Shared 1,2 19

Region Prediction Region Table A -- 2 B 3 1 Miss A[0] 4 Region A Update 3 Data Response A[0] 1 2 5 Read A[1] Directory Address State Sharers A[0] Shared 1,2 2 A[1] Shared 2 Each memory region spans 1KB Forward Read A[0] Takes advantage of spatial and temporal sharing 20 2

Simulation Methodology PHARMSim Full-system multi-core simulator Detailed network level model Cycle accurate router model Flit-level contention modeled More results in paper 21

Simulation Workloads SPECjbb SPECweb TPC-W TPC-H Barnes-Hut Ocean Radiosity Raytrace Uniform Random Permutation Traffic Commercial Java server workload 24 warehouse, 200 requests Web server, 300 requests Web e-commerce, 40 transactions Decision support system Scientific 8k particles, full run 514x514, parallel phase Parallel phase Car input, parallel phase Synthetic Destination select with uniform random distribution Each node communicates with one other node (pair-wise) 22

Cores Simulation Configuration L1 I/D Caches Private L2 caches Shared L3 Cache Main Memory Latency Packet-switched baseline Processors Memory System 16 in-order general purpose 32 KB 2-way set associative 1 cycle 512 KB 4-way set associative 6 cycles 64 Byte lines 16 MB (1MB bank/tile) 4-way set associative 12 cycles 100 cycles Interconnect: 4x4 2-D Mesh Optimized 1-3 router stages 4 Virtual channels with 4 Buffers each Table with config parameters Hybrid Circuit Switching 1 router stage 2 or 4 Circuit planes 23

Network Results Communication latency is key: shave off precious cycles in network latency 24

Flit breakdown 4/11/2008 Reduce interconnect latency for a significant fraction of messages Natalie Enright Jerger - University of Wisconsin 25

HCS + Protocol Optimization Improvement of HCS + Protocol optimization is greater than the sum of HCS or Protocol Optimization alone. Protocol Optimization drives up circuit reuse, better utilizing HCS 26

Uniform Random Traffic HCS successfully overcomes bandwidth limitations associated with Circuit Switching 27

Related Work Router optimizations Express Virtual Channels [Kumar, ISCA 2007] Single-cycle router [Mullins, ISCA 2004] Many more Hybrid Circuit-Switching Wave-switching [Duato, ICPP 1996] SoCBus [Wiklund, IPDPS 2003] Coherence Protocols Significant research in removing overhead of indirection 28

Circuit-Switched Coherence Summary Replace packet-switched mesh with hybrid circuit-switched mesh Interleave circuit and packet switched flits Reconfigurable circuits Dedicated bandwidth for frequent pair-wise sharers Low Latency and low power Avoid switching/routing Devise novel coherence mechanisms to take advantage of benefits of circuit switching 29

Thank you www.ece.wisc.edu/~pharm enrightn@cae.wisc.edu 30

Circuit Setup Novel Setup Policy Overlap circuit setup with first data transfer Store circuit information at each router Reconfigure existing circuits if no unused links available Allows piggy-backed request to always achieve low latency Multiple narrow circuit planes prevent frequent reconfiguration Reconfiguration Buffered, traverses packet-switched pipeline 31