Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Similar documents
Efficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Asynchronous Bypass Channels

OpenSoC Fabric: On-Chip Network Generator

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Network

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches

Topological Properties

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Leveraging Torus Topology with Deadlock Recovery for Cost-Efficient On-Chip Network

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Why the Network Matters

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

On-Chip Interconnection Networks Low-Power Interconnect

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip

From Hypercubes to Dragonflies a short history of interconnect

Photonic Networks for Data Centres and High Performance Computing

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Symmetric Multiprocessing

Distributed Systems. REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1

On-chip Interconnection Network for Accelerator-Rich Architectures

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

- Nishad Nerurkar. - Aniket Mhatre

Synthetic Traffic Models that Capture Cache Coherent Behaviour. Mario Badr

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Distance-Aware Round-Robin Mapping for Large NUCA Caches

GARNET: A Detailed On-Chip Network Model inside a Full-System Simulator

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

Distributed Systems LEEC (2005/06 2º Sem.)

Network Architecture and Topology

Towards a Design Space Exploration Methodology for System-on-Chip

Optimizing Configuration and Application Mapping for MPSoC Architectures

Packetization and routing analysis of on-chip multiprocessor networks

InfiniBand Clustering

How To Understand The Concept Of A Distributed System

vci_anoc_network Specifications & implementation for the SoClib platform

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

COMMUNICATION PERFORMANCE EVALUATION AND ANALYSIS OF A MESH SYSTEM AREA NETWORK FOR HIGH PERFORMANCE COMPUTERS

Analysis of QoS Routing Approach and the starvation`s evaluation in LAN

Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

TRACKER: A Low Overhead Adaptive NoC Router with Load Balancing Selection Strategy

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Optimizing Enterprise Network Bandwidth For Security Applications. Improving Performance Using Antaira s Management Features

Principles and characteristics of distributed systems and environments

3D On-chip Data Center Networks Using Circuit Switches and Packet Switches

Chapter 2. Multiprocessors Interconnection Networks

Ring Protection: Wrapping vs. Steering

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Annotation to the assignments and the solution sheet. Note the following points

Improved LS-DYNA Performance on Sun Servers

OpenSPARC T1 Processor

Wide Area Networks. Learning Objectives. LAN and WAN. School of Business Eastern Illinois University. (Week 11, Thursday 3/22/2007)

Sun Constellation System: The Open Petascale Computing Architecture

Interconnection Network Design

IP SAN Fundamentals: An Introduction to IP SANs and iscsi

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Design of a Dynamic Priority-Based Fast Path Architecture for On-Chip Interconnects*

STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS

Oracle Applications Release 10.7 NCA Network Performance for the Enterprise. An Oracle White Paper January 1998

Fiber Channel Over Ethernet (FCoE)

A Lab Course on Computer Architecture

CCNA R&S: Introduction to Networks. Chapter 5: Ethernet

Memory Architecture and Management in a NoC Platform

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

A CDMA Based Scalable Hierarchical Architecture for Network- On-Chip

Wide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)

Pivot3 Reference Architecture for VMware View Version 1.03

RoCE vs. iwarp Competitive Analysis

Preserving Message Integrity in Dynamic Process Migration

How to Optimize 3D CMP Cache Hierarchy

Single-chip Cloud Computer IA Tera-scale Research Processor

Components: Interconnect Page 1 of 18

CONTINUOUS scaling of CMOS technology makes it possible

An Oracle White Paper June Consolidating Oracle Siebel CRM Environments with High Availability on Sun SPARC Enterprise Servers

On-Chip Communications Network Report

SERVER CLUSTERING TECHNOLOGY & CONCEPT

Question: 3 When using Application Intelligence, Server Time may be defined as.

VMware Virtual SAN 6.2 Network Design Guide

ECLIPSE Performance Benchmarks and Profiling. January 2009

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Power-Aware High-Performance Scientific Computing

Scalability and Classifications

SUPPORT FOR HIGH-PRIORITY TRAFFIC IN VLSI COMMUNICATION SWITCHES

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

PARSEC vs. SPLASH 2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors

A network is a group of devices (Nodes) connected by media links. A node can be a computer, printer or any other device capable of sending and

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

Enabling Technologies for Distributed Computing

Transcription:

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia December 12, 2009 Hans Eberle, Nils Gura, Wladek Olesinski Sun Microsystems Conference title 1

Index Introduction Network simulator Simulation model Performance analysis Conclusions Future work 2 Second International Workshop on Network on Chip Architectures 2

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 3 Second International Workshop on Network on Chip Architectures 3

Introduction - Networks-on-chip (NoCs) are the critical component of a chip multiprocessor (CMP) as the number of cores increases - CMPs with 32 cores are already on the drawing table - 48 cores recently announced by Intel - Need for a full-system simulator with an accurate network simulation model - Not considering the network component and fullsystem simulation may lead to Incorrect Conclusions 4 Second International Workshop on Network on Chip Architectures 4

Introduction Topology considerations for NoCs (in CMPs) - Crossbars simplify the design, but they have a limited scalability [Micro07] - 2D-Meshes have better scalability than crossbars and simplify the design of a tiled organization - Rings have a simpler design than 2D-Meshes, but the average distance between nodes is higher - The network capacity is also a critical parameter in the design of NoCs - [Micro07] Hoskote Y., Vangal S., Singh A., Borkar N., Borkar S.: A 5-GHz mesh interconnect for a teraflops processor, IEEE Micro Mag., 2007, 27, (5), pp. 51 61 5 Second International Workshop on Network on Chip Architectures 5

Introduction Goals - To develop an accurate simulation tool for the on-chip network taking into account the target machine: coherence protocol, OS, and application - At the network level the simulation tool needs to allow: - Collective communication - Different topologies - Different architectures: - Switch architecture - Switching mechanisms (WH, VCT) - Flit size, flow control 6 Second International Workshop on Network on Chip Architectures 6

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 7 Second International Workshop on Network on Chip Architectures 7

Network simulator - SIMICS + GEMS + GAPNET - SIMICS: Full-system simulator - GEMS: A set of modules for SIMICS that enables detailed simulation of Chip-Multiprocessors (CMPs) - Provides a detailed memory system simulator - Implements the cache coherence protocol - GAPNET: Event-driven network simulator providing collective communication 8 Second International Workshop on Network on Chip Architectures 8

Network simulator GapNet and network interface 9 Second International Workshop on Network on Chip Architectures 9

Network simulator GapNet simulator events GEMS Src Dst INTERFACE Wakeup Enqueue GAPNET Send Route Cross Transmit Receive 10 Second International Workshop on Network on Chip Architectures 10

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 11 Second International Workshop on Network on Chip Architectures 11

Simulation model - Sarek machine (Sun Fire server) with Solaris10-32 cores with a SPARC CPU, private cache for the L1 and shared cache among all the processors for the L2 L1 cache L2 cache Size 128 KB 8 MB Associativity 8-way 16-way Line Size 64 B 64 B Hit Latency 3 cycles 6 cycles - Cache coherency protocol is a directory protocol with non-inclusive and blocking caches 12 Second International Workshop on Network on Chip Architectures 12

Interconnects Simulation model - Four interconnect types: fixed delay interconnect, crossbar, 2D-mesh and bidirectional ring - 2D-mesh is organized as a 4x8 array and routing is based on X-Y dimension order routing. Bidirectional ring choose the shortest path Ideal Crossbar 2D-Mesh Ring Link Latency [cycles] - 5 1 1 Switch Delay [cycles] 1..128 2 1 1 Fixed delay interconnect means constant latency and infinite bandwidth 13 Second International Workshop on Network on Chip Architectures 13

Interconnects Simulation model 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Crossbar 2D-mesh 14 Second International Workshop on Network on Chip Architectures 14

Interconnects Simulation model Ring Ideal network: - fixed delay - free of contention - unlimited amount of bandwidth 15 Second International Workshop on Network on Chip Architectures 15

Tile based design Simulation model Tile based 2D-mesh 4x4 16 Second International Workshop on Network on Chip Architectures 16

Simulation model Network capacity the network changing the flit size - We change the capacity of the network by modifying the flit size - The flit is the minimum amount of data information that can be flow-controlled through a link - The flit size is an important parameter at 2 levels: - Architectural level: Assuming wormhole, different flit sizes lead to different contention levels - Design level: Large flit size lead to more expensive router designs that consume more area and power 17 Second International Workshop on Network on Chip Architectures 17

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 18 Second International Workshop on Network on Chip Architectures 18

Performance analysis - Ideal network: normalized execution time (cycles) delay spectrum for each benchmark The system (for most applications) is very sensitive to network latency. E.g. 41% increase for FFT, 171% for Raytrace, 32% for Radix (8c vs 1c delay) 19 Second International Workshop on Network on Chip Architectures 19

Performance analysis - Ideal network: normalized number of L1 misses delay spectrum for each benchmark 20 Second International Workshop on Network on Chip Architectures 20

Performance analysis - Ideal network: normalized number of messages delay spectrum for each benchmark 21 Second International Workshop on Network on Chip Architectures 21

Performance analysis - 2D-Mesh achieves the best performance. The average savings for narrow flits: - 19% when compare with ring - 26% when compare with crossbar - Crossbar for wide flits perform better than ring in FMM, LU, FFT and Barnes and similar than the others. - As we shrink the flit size, the behavior change and the crossbar becomes worse. - Ring with wide flits achieve similar performance than 2D-Mesh with narrow flits. - Narrow flits tend to delay execution time, regardless of the topology, however 2D-Mesh is less affected. - A good trade-off would be a 2D-Mesh with moderate flit sizes (for example 8B), for this CMP configuration. Second International Workshop on Network on Chip Architectures 22 22

Performance analysis Comparison between 2D-mesh, ring and crossbar FMM LU FFT LU Radix Raytrace 23 Second International Workshop on Network on Chip Architectures 23

Performance analysis Comparison between 2D-mesh, ring and crossbar FTT Barnes FFT LU Radiosity L1 Miss Types in Radiosity 16 8 4 User 537,136 541,517 539,187 Supervisor 198,764 480,737 201,876 Total 735,901 1,022,255 741,063 24 Second International Workshop on Network on Chip Architectures 24

Performance analysis L1 miss rates (%): low network load mesh ring xbar Radix 16B 0.35 0.33 0.33 8B 0.35 0.37 0.36 4B 0.38 0.30 0.29 Radiosity 16B 0.09 0.09 0.09 8B 0.07 0.09 0.09 4B 0.09 0.09 0.09 FFT 16B 0.36 0.29 0.32 8B 0.36 0.28 0.29 4B 0.31 0.26 0.22 Barnes 16B 0.15 0.13 0.14 8B 0.15 0.13 0.13 4B 0.14 0.13 0.13 Raytrace 16B 0.84 0.52 0.38 8B 0.82 0.53 0.32 4B 0.68 0.38 0.20 Congestion is not an issue (in this CMP configuration) 25 Second International Workshop on Network on Chip Architectures 25

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 26 Second International Workshop on Network on Chip Architectures 26

Conclusions - Developed and interfaced a detailed on-chip network simulator to GEMS/SIMICS - Analyzed the impact of topology and flit sizes on real application s execution time - Results: - - Applications are very sensitive to network latency - - Application + system behavior may change because of the network (unpredicted behavior captured by our simulation tool) - - 2D-Meshes always outperforms rings and crossbars - For this CMP configuration, 2D-Mesh with moderate flit sizes is the best option Second International Workshop on Network on Chip Architectures 27 27

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 28 Second International Workshop on Network on Chip Architectures 28

Future work - The tool will enable us to: - Evaluation of other cache coherence protocols (token and hammer) with strong requirements for collective communication - Impact of multicast traffic on application s execution time - Impact of memory controllers on application s execution time - Evaluation of commercial workloads 29 Second International Workshop on Network on Chip Architectures 29

Thank you! Jesús Camacho Villanueva e-mail: jecavil@gap.upv.es December 12, 2009 Conference title 30