Synthetic Traffic Models that Capture Cache Coherent Behaviour. Mario Badr

Similar documents
Asynchronous Bypass Channels

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

FPGA-based Multithreading for In-Memory Hash Joins

How To Provide Qos Based Routing In The Internet

Performance Workload Design

Scaling 10Gb/s Clustering at Wire-Speed

Juniper Networks QFabric: Scaling for the Modern Data Center

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Everything you need to know about flash storage performance

Why the Network Matters

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator

CHAPTER 1 INTRODUCTION

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

Networking Virtualization Using FPGAs

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

NetFlow Performance Analysis

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Load Balancing and Switch Scheduling

Windows Server Performance Monitoring

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches

PART III. OPS-based wide area networks

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

Energy Efficient MapReduce

Realistic Workload Characterization and Analysis for Networks-on-Chip Design

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

On some Potential Research Contributions to the Multi-Core Enterprise

Optical interconnection networks with time slot routing

Preserving Message Integrity in Dynamic Process Migration

HP Smart Array Controllers and basic RAID performance factors

Lecture 2.1 : The Distributed Bellman-Ford Algorithm. Lecture 2.2 : The Destination Sequenced Distance Vector (DSDV) protocol

White Paper Abstract Disclaimer

A Dynamic Link Allocation Router

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors

Scalability and Classifications

co Characterizing and Tracing Packet Floods Using Cisco R

PERFORMANCE OF MOBILE AD HOC NETWORKING ROUTING PROTOCOLS IN REALISTIC SCENARIOS

Lecture 2 Parallel Programming Platforms

Load Balancing Mechanisms in Data Center Networks

Congestion Control Overview

Interconnection Network Design

COMMUNICATION PERFORMANCE EVALUATION AND ANALYSIS OF A MESH SYSTEM AREA NETWORK FOR HIGH PERFORMANCE COMPUTERS

Web Server Software Architectures

Interconnection Network

Runtime Hardware Reconfiguration using Machine Learning

Components: Interconnect Page 1 of 18

Network-Wide Class of Service (CoS) Management with Route Analytics. Integrated Traffic and Routing Visibility for Effective CoS Delivery

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

MAGENTO HOSTING Progressive Server Performance Improvements

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM CALIENT Technologies

HPAM: Hybrid Protocol for Application Level Multicast. Yeo Chai Kiat

Cluster Analysis for Evaluating Trading Strategies 1

Architectures and Platforms

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Application Performance Testing Basics

CONTINUOUS scaling of CMOS technology makes it possible

Performance of networks containing both MaxNet and SumNet links

Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS?

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Photonic Networks for Data Centres and High Performance Computing

Binary search tree with SIMD bandwidth optimization using SSE

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL

Module 7. Routing and Congestion Control. Version 2 CSE IIT, Kharagpur

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

TRACKER: A Low Overhead Adaptive NoC Router with Load Balancing Selection Strategy

AN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK

Performance Analysis of Storage Area Network Switches

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks

How To Build A Cloud Computer

Performance Monitoring of Parallel Scientific Applications

Joint ITU-T/IEEE Workshop on Carrier-class Ethernet

ΤΕΙ Κρήτης, Παράρτηµα Χανίων

Lustre Networking BY PETER J. BRAAM

Load Distribution in Large Scale Network Monitoring Infrastructures

CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale Cloud Computing Environments

Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications

Network Architecture and Topology

VMWARE WHITE PAPER 1

SDN and FTTH Software defined networking for fiber networks

Detecting Network Anomalies. Anant Shah

ECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers

- Nishad Nerurkar. - Aniket Mhatre

Monitoring Large Flows in Network

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Efficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols

D1.1 Service Discovery system: Load balancing mechanisms

CHAPTER 1 INTRODUCTION

A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing

Transcription:

Synthetic Traffic Models that Capture Cache Coherent Behaviour by Mario Badr A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto c Copyright 2014 by Mario Badr

Abstract Synthetic Traffic Models that Capture Cache Coherent Behaviour Mario Badr Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014 Modern and future many-core systems represent large and complex architectures. The communication fabrics in these large systems play an important role in their performance and power consumption. Current simulation methodologies for evaluating networks-on-chip (NoCs) are not keeping pace with the increased complexity of our systems; architects often want to explore many different design knobs quickly. Methodologies that trade-off some accuracy but maintain important workload trends for faster simulation times are highly beneficial at early stages of architectural exploration. We propose a synthetic traffic generation methodology that captures both application behaviour and cache coherence traffic to rapidly evaluate NoCs. This allows designers to quickly indulge in detailed performance simulations without the cost of long-running full system simulation but still capture a full range of application and coherence behaviour. Our methodology has an average (geometric) error of 10.9% relative to full system simulation, and provides 50 speedup on average over full system simulation. ii

Contents 1 Introduction 1 1.1 Exploring the NoC Design Space................................ 2 1.2 A High-Level Approach..................................... 2 1.3 Thesis Organization and Contributions............................. 4 2 Background 5 2.1 NoC Primer............................................ 5 2.1.1 Topology......................................... 6 2.2 Routing and Flow Control.................................... 6 2.2.1 Performance....................................... 7 3 Simulation Methodologies 8 3.1 Synthetic Traffic Patterns.................................... 8 3.2 Trace Simulation......................................... 10 3.3 Traces with Dependencies.................................... 11 3.4 Other Methodologies and Related Work............................ 12 3.4.1 Simulation Acceleration................................. 12 3.4.2 Workload Modelling and Synthetic Benchmarks.................... 12 4 Time-Varying Application Behaviour 14 4.1 Phase Behaviour in Applications................................ 14 4.2 Feature Vector Design...................................... 16 4.2.1 Total Injection...................................... 17 4.2.2 Coherence Composition................................. 17 4.2.3 Node Injection...................................... 18 4.2.4 Row-Column Flow.................................... 19 4.2.5 Per-Node Flows...................................... 20 4.2.6 Summary......................................... 20 4.3 The Injection Process...................................... 21 4.3.1 Macro Scale........................................ 22 4.3.2 Micro Scale........................................ 23 4.3.3 Phase Transitions.................................... 24 4.4 Summary............................................. 25 iii

5 Synthesizing Traffic 26 5.1 Overview............................................. 26 5.2 Initiating Packets......................................... 27 5.3 Reactive Packets......................................... 28 5.3.1 Forwarding versus Off-Chip............................... 30 5.3.2 Invalidates........................................ 31 5.4 Summary............................................. 32 6 Evaluation and Results 33 6.1 System Configuration...................................... 33 6.2 Model Exploration........................................ 34 6.2.1 Macro Phases....................................... 35 6.2.2 Congestion at the Micro Level............................. 36 6.2.3 Time Interval Size.................................... 37 6.2.4 Parameter Recommendations.............................. 38 6.3 NoC Performance Evaluation.................................. 39 6.4 Exploiting Markov Chains for Speedup............................. 41 7 Conclusion and Future Work 44 Bibliography 45 iv

Chapter 1 Introduction As uniprocessors are now limited by power and heat constraints, the architecture community has been investing increasing research and development efforts into multi- and many-core processors. As a result, the design space has grown larger, and more complex trade-offs are associated with processor designs. To accurately evaluate candidate architectures, we can model each component on- and off-chip and perform full-system simulation. These full-system simulations can provide high fidelity performance metrics before synthesis and prototyping of the actual design. An important part of this larger design space is the communication fabric used to connect the cores on a single die. In particular, Networks-on-Chip (NoCs) have been proposed as a modular and scalable fabric that can facilitate communication between multi- and many-core chips [10]. Applications targeted for the system will impose different bandwidth requirements on the interconnect, and NoCs need to be provisioned for performance whilst meeting power and area cost constraints. But NoCs themselves have their own large design space, and modelling each component via full-system simulation is timeconsuming [24]. Simulation efforts are being strained in part because of the large number of system components that must be modeled. Detailed models of processors (possibly multiple heterogeneous ones), caches, DRAM and networks are critical for accurate performance results, not to mention the need to run many large applications and fully model OS behaviour. However, full-system simulation is still appealing because of its fidelity. Designers who are willing to sacrifice fidelity due to time-to-market constraints or design space exploration have other software simulation methodologies available to them. Trace simulations are prevalent in a number of domains, where relevant information is recorded during a single full-system run and is then replayed on a specific component. For NoCs, synthetic traffic patterns can quickly reveal bottlenecks in the design. Software simulation methods that look to maintain fidelity while accelerating long simulations also exist - for example, sampling [1, 8]. However, outside of full-system simulation, several methodologies fail to accurately capture OS or cache coherence traffic, which can have significant effects on the performance of the system [23]. A simulation methodology that captures this behaviour while quickly evaluating NoC designs is needed. 1

NoC Design Application Modelling Micro Architecture Chapter 1. Introduction 2 1.1 Exploring the NoC Design Space There are several parameters that can affect the traffic a communication fabric needs to support (the application, cache hierarchy, and coherence protocol, to name a few). Software simulation that models all these components takes time. A methodology that allows for NoC design space exploration needs to abstract away these parameters so that researchers can focus on an NoC s infrastructure and communication paradigms. To better understand this abstraction, we present a generic design flow in Figure 1.1. Before any design can begin, the question of What are the target applications for the system? needs to be answered. This is done during the Application Modelling phase (top left of the figure), and the design space of applications can be vast [5], especially for general purpose processors. Once the application space has been determined, their corresponding threads are mapped and scheduled onto different processor cores (bottom left). The microarchitecture of these cores is another rich design space with several parameters. Finally, these cores along with their caches are connected to the communication fabric via routers, and the routers are in turn connected to other routers via links and buffers according to some topology (right-side of the figure) to make up the NoC. Application Core Cache Router Dir. Threads Buffers Links Cores Network Figure 1.1: A generic design flow, with emphasis on NoCs. In order to abstract away the other design spaces, our methodology looks only at the traffic injected into the routers. This means that interactions between components (such as between a core and its cache), are not modelled, allowing those designs to be varied independently. Another advantage is simplicity; we seek to synthetically inject packets into the network without modelling the complexities of each individual component, as full-system simulation would. 1.2 A High-Level Approach As technology scales and machines evolve, new applications and workloads that take advantage of the hardware follow. As a result, older benchmark suites become less relevant, and new benchmark suites

Chapter 1. Introduction 3 emerge to explore new algorithms and workloads [5, 12]. Because applications and the systems they run on are constantly changing, our work focuses on creating a generic, flexible, and fast methodology that can capture the communication behaviour of any application on any system. Application Full System Simulation Cache Coherence Cache Configuration Microarchitecture Off-Chip Memory Et cetera Traffic Behaviour Ideal Network Packets Injected Synthetic Traffic Generation NoC Simulator Packet Latency Buffer Utilization Link Utilization Et cetera NoC Metrics Figure 1.2: A high level overview of our methodology for design space exploration. The right of the dashed line shows examples of inputs or outputs of the different components. Figure 1.2 shows the high level approach to our methodology. We begin by performing a fullsystem simulation on an ideal network (i.e. the network does not model congestion - packets arrive at their destination in uniform, single-cycle time). By using an ideal network, we ensure that our traffic modelling does not capture any aspects of a specific NoC configuration. Ideal networks also make it easier to understand traffic behaviour because each packet has a single-cycle latency. Once we have the traffic generated by full-system simulation on an ideal network, our traffic modelling can extract the spatial and temporal characteristics of the traffic such that there is sufficient information to recreate the traffic synthetically - we use the term synthetic because the traffic is artificial (that is, it is not produced by a full system simulation). The models we create need to provide enough information to recreate the traffic synthetically. For example, a model that looks only at the average hop count for its spatial characteristic will know how far to send a packet, but not which destination. Finally, the model is applied to a traffic generator that recreates the traffic in a fashion similar to full-system simulation, but without the complex modelling of specific components (e.g. caches, off-chip memory, cores). This improves simulation time, and setting up the traffic generator to drive an NoC simulator is easier than setting up a full system simulator. There are many different ways to model and generate traffic. In this dissertation, we analyze several models and parameters that explore the spatial and temporal characteristics of real application traffic

Chapter 1. Introduction 4 and how they effect the performance of different NoCs. 1.3 Thesis Organization and Contributions The focus of this dissertation is to propose a simulation methodology that can assess various NoC designs without the need for full-system simulation. We provide a brief primer on NoCs in Chapter 2. In Chapter 3, we introduce simulation methodologies already available to designers. In this thesis we pay special attention to NoC performance, which is greatly affected by the traffic generated from an application run. Therefore, an important first step is to understand the behaviour of application traffic, and develop models that capture its spatial and temporal characteristics. In Chapter 4, we discuss a methodology that captures the time-varying behaviour of the application. We apply a two-level hierarchical divide-and-conquer approach that splits application traffic into intervals. Chapter 5 then looks at how to reproduce intervals at the lowest level of our hierarchy to resemble traffic generated by a real application on a shared memory architecture. In Chapter 6, we explore how the parameters of our model can be changed to improve the fidelity of our methodology. We then apply our recommended parameters to a variety of applications from the PARSEC [5] and SPLASH-2 [41] benchmark suites, and compare NoC performance and simulation time to full-system simulation. The main contributions of this dissertation is creating a novel methodology that: 1. Generates bursty traffic similar to what would be seen by applications run on a full-system simulator. Burstiness can have a major impact on NoC performance because the network must accomodate a large amount of packets in a short amount of time. 2. Produces traffic that resembles a real cache coherence protocol. Cache coherence is a crucial component of current and future designs [23], and can impact traffic patterns as we will see in Chapter 3. 3. Can be applied to multiple NoC designs without loss of fidelity. Designers often need to explore several different designs to optimize the NoC for their applications. 4. Provides significant speed up over full-system simulation. Time-to-market constraints and the aforementioned need for design space exploration means that simulations must be fast as well as accurate.

Chapter 2 Background Efficient on-chip communication is essential for the performance of CMPs. NoCs have become the de facto interconnect of the future [22]. In this chapter, we present a brief primer regarding the basics of an NoC and how to evaluate its performance. 2.1 NoC Primer NoCs provide a scalable, high bandwidth alternative to bus-based and point-to-point interconnects. This is, in large part, thanks to the modular approach of connecting nodes to routers, depicted in Figure 6.2. A network node can contain several components, such as a processor core or a cache, and sends messages to other nodes via a router. In this chapter, we discuss: 1. Topology: How the routers are connected to each other 2. Routing & Flow Control: How messages are transported in the NoC 3. Performance: How NoC Performance is measured C D $ Figure 2.1: Nodes (squares) connected to Routers (circles). Nodes can contain several components, such as cores (C), caches ($), and Directories (D). 5

Chapter 2. Background 6 2.1.1 Topology An NoC s topology can have a significant impact on its performance because it defines the paths available between nodes through routers [11]. Figure 2.2 shows one example of a 9-node ring topology, where the circles are considered to be nodes with routers and the lines connecting them are channels. In order for a message to travel from Node 0 to Node 3, it must travel through routers at Nodes 1 and 2 as well. This means messages going from node 0 to Node 3 take three hops to arrive at their destination. Conversely, a mesh network shown in Figure 2.3 allows messages travelling from Node 0 to Node 3 to take only one hop. Assuming perfect routing and flow control, the topology will give designers an upper bound for the NoC s performance. 0 1 2 0 1 2 3 4 5 3 4 5 6 7 8 6 7 8 Figure 2.2: A 9-node ring network topology Figure 2.3: A 9-node mesh network topology 2.2 Routing and Flow Control Messages move through the network according to a routing algorithm. The routing algorithm defines the path that can be taken by a message to its destination. Dimension-Order-Routing (DOR) is a deterministic routing algorithm that defines the minimal route (i.e. shortest path) between a source and its destination. The advantage to deterministic routing algorithms is that they are easy to implement, however these paths do not consider congestion in the NoC (Figure 2.4). Adaptive routing algorithms exist that allow messages to take alternate routes if the current path is too congested. This is analogous to driving downtown in a busy city; drivers are more likely to look for alternate roads to their destination rather than wait in traffic. Figure 2.5 shows how adaptive routing can completely avoid congested channels and take an alternate path to arrive at its destination earlier. From an NoC perspective, adaptive routing algorithms can improve performance by increasing path diversity through the network. Messages (or Packets) sent by nodes into the NoC can hold varying amounts of data, which are further discretized into flits. Flits help serialize a packet so that they can traverse through the network according to the bandwidth allowed by the channel. Flits are routed through different network resources, such as router buffers and virtual channels (VCs). Each channel in the network has multiple virtual channels associated with a router port. Multiple VCs helps improve link utilization because, if one packet stalls on its way to its destination (due to contention for a network resource), other packets can continue to route through the same path using a different VC. A high number of VCs can increase the bandwidth capabilities of an NoC, however they are also expensive in terms of area and power [11].

Chapter 2. Background 7 0 1 2 0 1 2 3 4 5 3 4 5 6 7 8 6 7 8 Figure 2.4: A packet routed from Node 0 to Node 7 using Dimension-Order-Routing. Note the congestion between nodes 3, 6, and 7. Figure 2.5: A packet routed from Node 0 to Node 7 using adaptive routing, avoiding the congestion between nodes 3, 6, and 7. 2.2.1 Performance The main performance metric associated with NoC design is the packet latency. This is the time it takes for a packet to arrive at its destination from its source node. The latency of a packet can vary when we consider contention in the network; as the NoC becomes congested, its resources become full and packets must wait before they can continue to traverse the network. Waiting (or stalling) for network resources can dramatically increase the latency of a packet, and severely hinder network performance. The average packet latency is a common metric used to get a quick evaluation of NoC designs, however it is also informative to consider the packet latency distribution. Similar average packet latencies can be achieved with very different distributions. For example, a Gaussian distribution and bi-modal distribution can be constructed to give the same average. However, the bi-modal distribution has a large number of packets with very high and very low latencies. As a result, NoC designers would provision the network differently in order to improve network performance, taking into consideration the high bandwidth requirements for a given application. Conversely, a Gaussian distribution implies that the traffic is more easily manageable, and not as bursty (i.e. a large number of packets injected into the network over a short time). A common pitfall in measuing packet latency is disregarding the source queue. When a packet is to be sent into the network, it is possible that the input port to the network is busy. As a result, the source must delay injection of the packet until the input is ready [11]. This situation is common during bursty injection, and can have a significant impact on packet latency when the network is congested. In this dissertation, packet latency includes the source queue.

Chapter 3 Simulation Methodologies There are several simulation methodologies available to researchers when evaluating NoCs. In this chapter, we explore two software simulation methodologies (synthetic traffic patterns, and traces) and discuss recent advancements in the area of software simulation and other related work. 3.1 Synthetic Traffic Patterns Synthetic traffic patterns such as uniform random, permutation, tornado, etc. are widely used in NoC research. Many of these traditional synthetic traffic patterns are based on the communication pattern of specific applications. For example, transpose traffic is based on a matrix transpose application, and the shuffle permutation is derived from Fast-Fourier Transforms (FFTs) [2, 11]. However, these synthetic traffic patterns are not representative of the wide range of applications that run on current and future CMPs. Even if these traffic patterns were representative, the configuration of a cache-coherent system can mask or destroy the inherent communication pattern of the original algorithm due to the presence of indirections and control messages. Synthetic traffic patterns are typically applied to an NoC with a fixed injection rate. In this section we will demonstrate two methods for selecting an appropriate synthetic traffic pattern to explore an NoC design. The first method will use the injection rate from a real application, Facesim, and sweep several traffic patterns. The second method will use a synthetic traffic pattern, Shuffle, that is historically based on a real application (FFT). In both cases, we will see that current synthetic traffic patterns are not characteristic of the modern application landscape. The Facesim benchmark computes a realistic animation for a face through physics simulation. When run on an ideal network, it has an average injection rate of 0.16 packets per cycle. We apply this injection rate using a variety of synthetic traffic patterns that are used in the evaluation of NoCs [11]. Figure 3.1 shows the average packet latency, yielding an average error of 23.64% (our system configuration and methodology is explained in Chapter 6). Looking just at average packet latency, one could come to the conclusion that the Bit Complement traffic pattern best approximates the Facesim benchmark. However, average packet latency tells us little about the congestion in the network. Bit Complement is a subset of permutation traffic that is typically used to stress NoC configurations [11]. Specifically, given a 4-bit source address S x = {s 3, s 2, s 1, s 0 }, the destination is computed as the complement of each bit: D x = { s 3, s 2, s 1, s 0 }. Figure 3.2 shows the packet latency 8

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40+ Percentage of Packets Average Packet Latency Chapter 3. Simulation Methodologies 9 30 25 Facesim 20 15 10 5 0 Figure 3.1: Average Packet Latency of different traffic patterns on an NoC. The dashed line is the average packet latency of the Facesim benchmark. distributions for Facesim and Bit Complement (as a percentage of packets injected). Bit Complement has three definitive packet latencies (17, 27, and 37 cycles) whereas Facesim s packet latencies are more evenly distributed due to time-varying behaviour in the simulation. Bit Complement may produce the same average behaviour as Facesim (with regards to packet latency), but it does not provide bursts of traffic as one would expect from a real application. Facesim Bit Complement 50.00% 45.00% 40.00% 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% Packet Latency Figure 3.2: Packet latency distributions for Facesim and Bit Complement. Synthetic traffic pattern simulations can be made more complicated using varying injection rates for nodes, modulated-markov processes for injection, etc [11]. However, they are not representative of real application traffic due to the shared memory architectures that are typically modelled in full system simulation. The arrangement of cores, caches, directories, and memory controllers directly influences the flow of communication when running an application. To illustrate this point, we compare a synthetic shuffle pattern with the FFT benchmark from SPLASH-2 [41]. The shuffle pattern is a bit permutation where the destination is calculated via the function d i = s i 1 mod b where b is the number of bits required to represent the nodes of the network [11]. FFT is run in full-system simulation while shuffle is run in network-only simulation. Figure 3.3 shows the number

Chapter 3. Simulation Methodologies 10 Destination 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 600 500 400 300 200 100 0 Destination 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 200000 150000 100000 50000 0 1 2 3 4 5 6 7 8 9 101112131415 Source (a) Shuffle Traffic Pattern 0 1 2 3 4 5 6 7 8 9 101112131415 Source (b) FFT Application Figure 3.3: A comparison of the spatial behaviour between synthetic and real application traffic of packets sent from a source to a destination 1. In Figure 3.3b, we see notable destination hot spots at nodes 0, 2, and 5, as well as source hot spots at nodes 0 and 5. However, Figure 3.3a shows hot spots only for specific source-destination pairs. The best NoC design for the traffic in Figure 3.3a is unlikely to be the best NoC for the traffic in Figure 3.3b. The sharp contrast in Figure 3.3 is due to coherence transactions needing to visit several nodes in a shared memory architecture before completing. For example, a write request first visits a directory to receive ownership of a cache line. The directory forwards requests to the core caching the data, and also invalidates caches who are sharing the data. Invalidated caches must send acknowledgements this domino effect is the typical behaviour in a shared memory architecture, and can significantly change an application s spatial behaviour and should be correctly modelled for realistic traffic generation. Synthetic traffic patterns are useful in revealing bottlenecks in the network while sweeping injection rates to find an NoC s saturation point. However, for researchers looking to design NoCs that are correctly provisioned for the application space and infrastructure of a real system, synthetic traffic patterns cannot be used to make conclusive decisions. 3.2 Trace Simulation Trace simulation first records the injected traffic from a full-system simulation and then replays the traces in an NoC simulator. This maintains the time-varying behaviour of the application and traces can also include cache coherence information about packets, alleviating the issues faced by synthetic traffic patterns. The main problem with this approach is that it ignores the dependencies that packets exhibit between each other. An over-provisioned NoC will likely deliver packets faster than one that is under-provisioned, and these could result in two very different traces for the same application. Traffic is dependent on how quickly messages are delivered, and is more suited to a control system that can react to ejected packets [21]. To demonstrate this simulation methodology s shortcomings, we take traces from full-system simulations that use an ideal network. The traces are then applied to a mesh network with two virtual channels and adaptive routing. Control packets are 8 bytes while data packets are 72 bytes, and the flit size is 4 1 The number of packets in each figure is unimportant as we focus on source-destination traffic pairs.

Average Packet Latency Chapter 3. Simulation Methodologies 11 bytes (see Chapter 6 for full details regarding our system configuration). Figure 3.4 shows the average packet latency across multiple PARSEC benchmarks. Ideally, the trace simulation would have similar performance to the full-system simulation, however this methodology can yield an error as high as 666% (fluidanimate). On average, the average packet latency is off by 234.5%, and the error could be higher for less provisioned networks. Traces can lead us to make invalid assumptions about the performance of our network, and should not be used for design space exploration. Full System Trace 300 250 200 150 100 50 0 Figure 3.4: Comparison between full system and trace simulation for average packet latency across a network configuration 3.3 Traces with Dependencies Inferring packet dependencies in trace files can improve the fidelity of trace-based simulation, and there has been active research in the area [14, 27]. This provides similar benefits to regular trace simulations, but with the added benefit of throttling packets that should not be injected until a previous packet has arrived at its destination. In Netrace, dependencies between architectural components (such as a cache and memory controller) and the cache coherence protocol are inferred to create dependency traces [14]. Dependencies due to the program behaviour are not tracked, however the methodology uses simple in-order cores to alleviate this problem (i.e. memory requests from a core are serial, therefore no dependency tracking is necessary). Nitta et al. use a different approach by taking multiple full-system traces on a variety of NoC configurations. Comparing these traces for causality, a packet dependency graph (PDG) can be constructed for a particular application [27]. This approach can track program behaviour dependencies, however it is not always accurate Traces on other NoC configurations can introduce more dependencies to the graph. To evaluate traces with dependencies we employ an approach similar to Netrace, except we use Outof-Order cores to keep comparisons throughout this dissertation fair. In addition, in-order cores do not aggressively stress the network due to stalls for memory requests. When the processor stalls, no new messages will be injected into the network. We compare the fidelity and speed of trace simulations with dependencies in our evaluation (Chapter 6).

Chapter 3. Simulation Methodologies 12 3.4 Other Methodologies and Related Work There are several simulation methodologies available to designers. In this section we present methodologies that improve simulation time and/or allow for many-core (i.e. hundreds, thousands of cores) design space exploration, as well as work that characterizes application behaviour either to better understand the application or to create synthetic benchmarks. 3.4.1 Simulation Acceleration Simulating small but representative parts of an application run has been widely explored, and two main methodologies exist: SimPoint [34] and SMARTS [42]. In SimPoint, Sherwood et al. capture the time-varying behaviour of programs using Basic Block Vectors (BBVs), which are then clustered (grouped) into phases. Multiple simulation points (hence, SimPoint) can then be inferred from these phases that represent the full execution of the simulation. SMARTS also simulates only parts of an application, but uses statistical sampling to determine which parts to simulate. Both SimPoint and SMARTS are targetted at single-thread workloads. However, recent work shows that, with some changes, the methodologies can be applied to parallel workloads as well [25, 43]. Sampling on multi-threaded applications has received renewed interest recently [1, 8]. These sampling methodologies have mostly been applied to micro-architectural simulation, so their efficacy for NoC evaluation is currently unknown. User-level simulators exist as an alternative to full system simulation for exploring thousands of cores [7, 24]. ZSim exploits parallel simulation with out-of-order core models to simulate 300 Million Instructions Per Second (MIPS) (compared to the roughly 200 KIPS by full system) [32]. However user-level simulators are designed to work without an operating system (current operating systems do not support thousands of cores), simulating only the user portion of an application. In addition, several components are not modelled, such as peripheral devices. As a result, many applications cannot be run on the simulation environment. Still, user-level simulators are a strong tool for futuristic design space exploration of thousand core architectures. FPGA-based acceleration has also been proposed [9, 37]. FIST implements an FPGA-based network simulator that can simulate mesh networks with significant speed up over software simulation [28]. DrNoC is FPGA framework for design space exploration of NoCs without requiring resynthesis of the NoC when its configuration changes [20]. DrNoC relies on the partial reconfigurability of Xilinx FPGAs, thereby requiring a specialized design flow [39]. The main drawback to FPGA-based simulation is that, while they are fast and accurate, they can be difficult to use [32]. In this dissertation we focus on speeding up simulation time for NoC design space exploration, and some network simulators exist that speed up simulation time. For example, Hornet [29] focuses on parallelizing an NoC simulation and can achieve a 12 speedup. Our work is orthogonal to Hornet because we can use our synthetic traffic generation to drive the network simulation. The key benefit here is that detailed modelling of cores, caches, and other components are not necessary, removing the need for full system simulation. 3.4.2 Workload Modelling and Synthetic Benchmarks Cloning can mimic workload behaviour by creating a reduced representation of the code [3, 17]. Much of this work focuses on cloning cache behavior; our work can be viewed as creating clones of cache coherence behaviour to stimulate the network. Creation of synthetic benchmarks for multi-threaded applications

Chapter 3. Simulation Methodologies 13 has been explored [13]; this work generates instruction streams that execute in simulation or on real hardware. Our work differs as we reproduce communication patterns and coherence behaviour, while abstracting away the processor and instruction execution. MinneSPEC [19] provides reduced input sets that effectively match the reference input for SPEC2000. Cloning can mimic workload behaviour by creating a reduced representation of the code [3, 17]. Much of this work focuses on cloning cache behavior; our work can be viewed as creating clones of cache coherence behaviour to stimulate the network. MinneSPEC [19] provides reduced input sets that effectively match the reference input for SPEC2000. Creation of synthetic benchmarks for multi-threaded applications has been explored [13]; this work generates instruction streams that execute in simulation or on real hardware. Our work differs as we reproduce communication patterns and coherence behavior while abstracting away the processor and instruction execution.

Chapter 4 Time-Varying Application Behaviour If traffic were monotonous and predictable, NoC design would be simple. Unfortunately, real applications exhibit time-varying behaviour that significantly impacts the packets injected into a network. This chapter introduces previous research on how applications vary with time and demonstrates how to capture this behaviour for both modelling and generating traffic. 4.1 Phase Behaviour in Applications Previous research shows that applications go through phases [33]. These phases can have a significant impact on the instructions per second (IPC), miss rates, and prediction rates of various microarchitectural components. Researchers can, in turn, exploit this phase behaviour to their advantage. One example is SimPoint [34], a methodology that proposes to simulate only small but representative parts of an application to reduce simulation time. Phase behaviour continues to be apparent in parallel applications, and can have significant effects on the time-varying behaviour of traffic generated by an application [43, 16]. This is important, as our methodology will need to capture this phase behaviour if it intends to realistically generate synthetic traffic. The time-varying behaviour of real application traffic makes it difficult to model on the whole. Instead, we propose looking at the data at both a macro- (millions or billions of cycles) and micro- (thousands or hundreds of thousands of cycles) level granularity. At each level, we divide the traffic into fixed-sized time intervals (see Defintion 1) in a divide-and-conquer approach that makes modelling the data more manageable. Once divided, we can group similar intervals that occur at different times in the application. Definition 1. An interval (I) is a span of cycles (C) ranging from C i to C i+, such that > 0 and therefore C i < C i+. Any two intervals cannot overlap and there is no gap between subsequent intervals. Figure 4.1 highlights two macro-intervals that would, from a visual standpoint, appear to have similar behaviour. We can analyze each interval by characterizing its traffic behaviour with a Feature Vector (a 14

Chapter 4. Time-Varying Application Behaviour 15 vector of elements, or features. An example of a feature could be the injection rate). The same features are used for each interval, but their magnitudes can (and likely will) differ. In the figure we see two feature vectors, V 1 and V 2, representing the two intervals. 200000 150000 Packets Injected 100000 50000 V 1 V 2 0 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 Time Bin (500,000 cycles) Figure 4.1: A high level view of clustering at a macro-level granularity. Once feature vectors have been constructed for each interval, we can mathematically determine those that are similar to each other. One method is to calculate the distance between them. There are many distance measures available, and this dissertation focuses on Euclidean distance. By calculating the distance between each pair of feature vectors, we can compile a distance matrix to compare all intervals to each other. From Figure 4.1, if our feature is simply the total number of packets injected (i.e. the y-axis), then the distance between V 1 and V 2 is small. Now that intervals can be compared using a quantitative metric, we can group them together into phases (Definition 2) using a method called clustering. This form of statistical analysis provides a methodology for detecting traffic phases in an application. The efficacy of this methodology is reliant on the clustering algorithm and feature vector used. There are several clustering algorithms available, however the feature vector can have the greatest impact on which intervals constitute which phases. Section 4.2 explores feature vectors in more detail, and Section 4.3 discusses how to reproduce these intervals using relevant clustering approaches and Markov chains. Definition 2. A traffic phase is a group (or cluster) of intervals that behave in a similar manner. Traffic phases are typically reccuring, but not necessarily periodic.

Chapter 4. Time-Varying Application Behaviour 16 4.2 Feature Vector Design The real key to an effective clustering technique comes from defining similar behaviour. What does it mean that one interval behaves similarly to another? The answer to this question will affect the design of the feature vector used. With respect to network traffic, the elements of a feature vector would describe communication behaviour. In this section, we introduce and discuss five different feature vectors, and in Chapter 6 we evaluate four of the five: 1. Total Injection - packets injected by all nodes 2. Coherence Composition (not evaluated) - packets divided by coherence type for all nodes 3. Node Injection - packets injected for each source 4. Row-Column Flow - packets injected for a group of source nodes to a group of destination nodes 5. Per-Node Flow - packets injected for each source-destination pair An important note when designing a feature vector, since we use Euclidean Distance, is to ensure all elements are on the same scale (or have the same unit of measurement). The reason for this is evident when calculating the distance between vectors. One element that is on a larger scale than another can eclipse it completely. For example, consider the injection rate of an interval to be in the range of [0, 1] as one element, and the total number of write requests in an interval to be in the range of [0, 40] (hypothetically) as another. Let s observe what happens between two intervals that use both these elements in a feature vector: V ector = (injection rate, write requests) A = (0.001, 3) B = (0.016, 15) EuclideanDistance = n (q i p i ) 2 = (.016.001) 2 + (15 3) 2 i=1 = 0.006 2 + 12 2 = 12.00 From above, we see that even though there is a 16 difference between injection rates (versus 5 for write requests), it does not affect the Euclidean Distance (the result is similar for Manhattan Distance). Therefore, features of the vector should not greatly differ in their empirically observed magnitudes. In addition to ensuring elements have the same unit of measure, it is sometimes tempting to use large feature vectors with many elements. However, such an approach is not always feasible. For one, there is the so-called curse of dimensionality where the data required to populate the vectors is insufficient for the size of the vector [4], and simply parsing the data to construct these feature vectors can take a long time. Feature vectors with a high dimensionality should be reserved for large data sets.

Chapter 4. Time-Varying Application Behaviour 17 4.2.1 Total Injection A simple, one-dimensional feature vector that can characterize communication behaviour is the total number of packets injected for all nodes in the network. This feature vector allows us to differentiate between intervals that are experiencing high, low, or in-between levels of communication. The benefits of such a feature vector is that it is easy to create; calculating the total number of packets in an interval is a simple subtraction operation. In addition, because it is one-dimensional, calculating the distance between vectors and running through the clustering algorithm is also fast. The disadvantages of the Total Injection feature vector are rooted in its simplicity. The total number of packets tells us nothing about the spatiality of the traffic behaviour. That is, even though two vectors may have similar magnitudes, their respective intervals could exhibit different spatial behaviour, such as hot spots. In addition, the vector does not tell us what types of messages are being injected. For example, one interval could be issuing several read requests to retrieve data, while another is issuing write requests that invalidate sharers. A one-dimensional vector is more suited to large datasets where constructing complex feature vectors takes too much time and has too much overhead. Because it does not capture several characteristics of communication behaviour, it can only give us a rough view of which intervals may be similar. 4.2.2 Coherence Composition During full-system simulation of a coherent, shared memory architecture, packets are injected according to a cache coherence protocol. Therefore each packet is associated with a message type (Definition 3) of the protocol. Using this information we can discern between read and write phases of an application, among other things. Definition 3. A message type is used to describe the category a packet belongs to. The category (or type) reveals the reason a packet has been sent to a destination, allowing the cache coherence state machine to react accordingly. Consider intervals at the micro-scale (hundreds of cycles). It is likely that a write request interval is accompanied with several invalidate messages to notify sharers that the cache line has changed. Similarly, a read request interval would not be associated with invalidated messages because the cache line is not updated. Intuitively, we would group write request and read request intervals into two separate phases, and a coherence composition feature vector does this well. Figure 4.2 shows the composition of three different message types (reads, writes, and invalidates) across 25 consecutive intervals (data taken from the Swaptions application). It is easy to see the distinction between a read phase (intervals 2-11) and a write phase (intervals 14-25). During the early part of the write phase (intervals 14-18), several invalidate messages are sent to update sharers on the status of the cache line. The dimensionality of a coherence composition vector can be small. In a cache coherence protocol, there are several message types that rely on others in a one-to-one mapping. For example, invalidate packets are typically followed by acknowledgement packets to complete a handshake. We can therefore exclude acknowledgement packets from the feature vector because they are already encompassed by the

Number of Packets Chapter 4. Time-Varying Application Behaviour 18 Reads Invalidates Writes 18 16 14 12 10 8 6 4 2 0 1 1 2 1 2 2 2 2 2 2 2 2 1 3 3 3 3 3 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 2 2 2 2 1 3 3 3 3 3 1 1 1 1 1 1 1 Figure 4.2: Number of read, write, and invalidate packets over a 25 interval portion of the Swaptions benchmark invalidate feature. This example can apply to several message types, and therefore only message types most representative of cache coherence behaviour need to be used. The advantage of this feature vector is that its dimensionality can be kept low, and also constant (i.e. it does not scale with any parameter, such as the number of nodes in the network) according to the protocol being used. It allows us to intuitively distinguish between intervals that are exhibiting different behaviour according to the cache coherence protocol. However, the feature vector does not include information about the spatial behaviour of the application. Therefore, it ignores potential hot spot information, which can have a significant impact on performance during simulation. Coherence Composition is most useful at the micro-level to differentiate between short phases that may occur due to short loops and control flows in the code. Coherence Composition can also be useful at the macro level, however depending on the size and number of intervals it can introduce a lot of overhead because each packet in the interval needs to be analyzed and sorted by message type. 4.2.3 Node Injection If we look at the injection distribution across N nodes in the system, we can construct a feature vector that includes the spatial characteristics of application traffic. This feature vector is similar to Total Injection in how the number of packets are counted, however it now scales with N because a dimension exists for each node (Figure 4.3). The spatial injection distribution helps identify injecting hotspots that is, nodes that send a lot of packets. But hot spots can also exist as a destintation that is, nodes that receive a lot of packets. Intuitively, a node that sends a lot of packets has likely received a lot of packets as well (following the request-response mantra common in cache coherence protocols, which we will see in Chapter 5). However the relationship between sent and received messages is lost. That is, a Node Injection feature vector cannot tell us which nodes are communicating with each other; the feature vector only tells us which nodes are communicating more than others. At the micro-scale, Node Injection is not ideal because during low-communication phases of an application several nodes may be injecting zero packets. This would skew distance measures and classify several intervals as similar, even if they had different injecting hot spots. Take, for example, a 3-node

Chapter 4. Time-Varying Application Behaviour 19 <N 1, N 2,, N 9 > Figure 4.3: How nodes map to the Node Injection feature vector. architecture. In one interval (I 1 ), all packets are injected from Node 1 (30 packets). In a second interval (I 2 ), all packets are injected from Node 3 (25 packets). And in a third interval (I 3 ), all packets are injected from Node 2 (28 packets). From the Euclidean distance matrix (Table 4.1) we can see that despite different nodes being responsible for injection, the difference between distances is small (approximately ±2, with similar deltas using Manhattan Distance). That is, intervals are considered similar if they have at least a one-node hotspot (in this example). I 1 I 2 I 3 I 1 0 39.05 41.04 I 2 39.05 0 37.54 I 3 41.04 37.54 0 Table 4.1: A distance matrix for three example vectors. We can improve the scaling of the Node Injection feature vector by looking at rows and columns instead of individual nodes [16]. That is, we observe the total number of packets injected by each row of nodes, and each column of nodes, to construct a feature vector that scales with 2 N. Because we look at both the rows and the columns, there is overlap in the elements of the vector, which can tell us more specifically where the hot spot is (although not as specifically as using a per-node vector). Node Injection is better suited for macro-scale clustering where more packets ensure a more populated feature vector. This way all elements have a magnitude that can influence the distance measure, allowing the clustering algorithm to more accurately group similar intervals. 4.2.4 Row-Column Flow The Row-Column Flow feature vector captures the spatial behaviour of traffic, as well as the relationship between sent and received messages across rows and columns. Each element of the vector corresponds to the number of packets sent by one group of nodes (a row) to another group of nodes (a column), shown

Chapter 4. Time-Varying Application Behaviour 20 in Figure 4.4. We use the words row and column to make the vector easier to understand the actual mapping of nodes onto the network does not have to be grid-like. Source (Row) <N 1, N 2, N 3,, N 9 > Flow Destination (Column) Figure 4.4: A visual demonstration of how Row-Column Flow combinations map to a feature vector. In this example, Row 1 and Column 3 make up the third source-destination pair For an N-node system, the vector scales at a rate of N, just like the Node Injection vector. However, because Row-Column Flow vectors contain both source and destination information (albeit aggregated into rows and columns), they can more accurately detect hotspots that occur during simulation. 4.2.5 Per-Node Flows Capturing the relationship between each and every node creates a large feature vector that scales at a rate of N 2. This method counts the number of packets being sent from each source to each destination, which we define as Per-Node Flows (Definition 4). Definition 4. A flow is a source-destination pair [16]. In an N node network, there are N 2 flows. The Per-Node Flow feature vector reveals communication behaviour at the finest granularity, and can help identify the exact location of hotspots by not aggregating information across several nodes. Because of its size, the vector should only be used when sufficient data is available to populate each element (otherwise a similar situation to Table 4.1 will occur). 4.2.6 Summary We have introduced five feature vectors with different advantages and disadvantages depending on the number of nodes or packet information available (A summary can be found in Table 4.2). The vectors are used to characterize the communication behaviour of an interval (recall Definition 1). In Section 4.3, we discuss a methodology for recreating these intervals during software simulation. In our evaluation (Chapter 6), we will compare Total, Node, Row-Column Flow, and Per-Node Flow vectors. We omit Coherence Composition because, in our experiments, it did not accurately