Synthetic Traffic Models that Capture Cache Coherent Behaviour. Mario Badr

Size: px
Start display at page:

Download "Synthetic Traffic Models that Capture Cache Coherent Behaviour. Mario Badr"

Transcription

1 Synthetic Traffic Models that Capture Cache Coherent Behaviour by Mario Badr A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto c Copyright 2014 by Mario Badr

2 Abstract Synthetic Traffic Models that Capture Cache Coherent Behaviour Mario Badr Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014 Modern and future many-core systems represent large and complex architectures. The communication fabrics in these large systems play an important role in their performance and power consumption. Current simulation methodologies for evaluating networks-on-chip (NoCs) are not keeping pace with the increased complexity of our systems; architects often want to explore many different design knobs quickly. Methodologies that trade-off some accuracy but maintain important workload trends for faster simulation times are highly beneficial at early stages of architectural exploration. We propose a synthetic traffic generation methodology that captures both application behaviour and cache coherence traffic to rapidly evaluate NoCs. This allows designers to quickly indulge in detailed performance simulations without the cost of long-running full system simulation but still capture a full range of application and coherence behaviour. Our methodology has an average (geometric) error of 10.9% relative to full system simulation, and provides 50 speedup on average over full system simulation. ii

3 Contents 1 Introduction Exploring the NoC Design Space A High-Level Approach Thesis Organization and Contributions Background NoC Primer Topology Routing and Flow Control Performance Simulation Methodologies Synthetic Traffic Patterns Trace Simulation Traces with Dependencies Other Methodologies and Related Work Simulation Acceleration Workload Modelling and Synthetic Benchmarks Time-Varying Application Behaviour Phase Behaviour in Applications Feature Vector Design Total Injection Coherence Composition Node Injection Row-Column Flow Per-Node Flows Summary The Injection Process Macro Scale Micro Scale Phase Transitions Summary iii

4 5 Synthesizing Traffic Overview Initiating Packets Reactive Packets Forwarding versus Off-Chip Invalidates Summary Evaluation and Results System Configuration Model Exploration Macro Phases Congestion at the Micro Level Time Interval Size Parameter Recommendations NoC Performance Evaluation Exploiting Markov Chains for Speedup Conclusion and Future Work 44 Bibliography 45 iv

5 Chapter 1 Introduction As uniprocessors are now limited by power and heat constraints, the architecture community has been investing increasing research and development efforts into multi- and many-core processors. As a result, the design space has grown larger, and more complex trade-offs are associated with processor designs. To accurately evaluate candidate architectures, we can model each component on- and off-chip and perform full-system simulation. These full-system simulations can provide high fidelity performance metrics before synthesis and prototyping of the actual design. An important part of this larger design space is the communication fabric used to connect the cores on a single die. In particular, Networks-on-Chip (NoCs) have been proposed as a modular and scalable fabric that can facilitate communication between multi- and many-core chips [10]. Applications targeted for the system will impose different bandwidth requirements on the interconnect, and NoCs need to be provisioned for performance whilst meeting power and area cost constraints. But NoCs themselves have their own large design space, and modelling each component via full-system simulation is timeconsuming [24]. Simulation efforts are being strained in part because of the large number of system components that must be modeled. Detailed models of processors (possibly multiple heterogeneous ones), caches, DRAM and networks are critical for accurate performance results, not to mention the need to run many large applications and fully model OS behaviour. However, full-system simulation is still appealing because of its fidelity. Designers who are willing to sacrifice fidelity due to time-to-market constraints or design space exploration have other software simulation methodologies available to them. Trace simulations are prevalent in a number of domains, where relevant information is recorded during a single full-system run and is then replayed on a specific component. For NoCs, synthetic traffic patterns can quickly reveal bottlenecks in the design. Software simulation methods that look to maintain fidelity while accelerating long simulations also exist - for example, sampling [1, 8]. However, outside of full-system simulation, several methodologies fail to accurately capture OS or cache coherence traffic, which can have significant effects on the performance of the system [23]. A simulation methodology that captures this behaviour while quickly evaluating NoC designs is needed. 1

6 NoC Design Application Modelling Micro Architecture Chapter 1. Introduction Exploring the NoC Design Space There are several parameters that can affect the traffic a communication fabric needs to support (the application, cache hierarchy, and coherence protocol, to name a few). Software simulation that models all these components takes time. A methodology that allows for NoC design space exploration needs to abstract away these parameters so that researchers can focus on an NoC s infrastructure and communication paradigms. To better understand this abstraction, we present a generic design flow in Figure 1.1. Before any design can begin, the question of What are the target applications for the system? needs to be answered. This is done during the Application Modelling phase (top left of the figure), and the design space of applications can be vast [5], especially for general purpose processors. Once the application space has been determined, their corresponding threads are mapped and scheduled onto different processor cores (bottom left). The microarchitecture of these cores is another rich design space with several parameters. Finally, these cores along with their caches are connected to the communication fabric via routers, and the routers are in turn connected to other routers via links and buffers according to some topology (right-side of the figure) to make up the NoC. Application Core Cache Router Dir. Threads Buffers Links Cores Network Figure 1.1: A generic design flow, with emphasis on NoCs. In order to abstract away the other design spaces, our methodology looks only at the traffic injected into the routers. This means that interactions between components (such as between a core and its cache), are not modelled, allowing those designs to be varied independently. Another advantage is simplicity; we seek to synthetically inject packets into the network without modelling the complexities of each individual component, as full-system simulation would. 1.2 A High-Level Approach As technology scales and machines evolve, new applications and workloads that take advantage of the hardware follow. As a result, older benchmark suites become less relevant, and new benchmark suites

7 Chapter 1. Introduction 3 emerge to explore new algorithms and workloads [5, 12]. Because applications and the systems they run on are constantly changing, our work focuses on creating a generic, flexible, and fast methodology that can capture the communication behaviour of any application on any system. Application Full System Simulation Cache Coherence Cache Configuration Microarchitecture Off-Chip Memory Et cetera Traffic Behaviour Ideal Network Packets Injected Synthetic Traffic Generation NoC Simulator Packet Latency Buffer Utilization Link Utilization Et cetera NoC Metrics Figure 1.2: A high level overview of our methodology for design space exploration. The right of the dashed line shows examples of inputs or outputs of the different components. Figure 1.2 shows the high level approach to our methodology. We begin by performing a fullsystem simulation on an ideal network (i.e. the network does not model congestion - packets arrive at their destination in uniform, single-cycle time). By using an ideal network, we ensure that our traffic modelling does not capture any aspects of a specific NoC configuration. Ideal networks also make it easier to understand traffic behaviour because each packet has a single-cycle latency. Once we have the traffic generated by full-system simulation on an ideal network, our traffic modelling can extract the spatial and temporal characteristics of the traffic such that there is sufficient information to recreate the traffic synthetically - we use the term synthetic because the traffic is artificial (that is, it is not produced by a full system simulation). The models we create need to provide enough information to recreate the traffic synthetically. For example, a model that looks only at the average hop count for its spatial characteristic will know how far to send a packet, but not which destination. Finally, the model is applied to a traffic generator that recreates the traffic in a fashion similar to full-system simulation, but without the complex modelling of specific components (e.g. caches, off-chip memory, cores). This improves simulation time, and setting up the traffic generator to drive an NoC simulator is easier than setting up a full system simulator. There are many different ways to model and generate traffic. In this dissertation, we analyze several models and parameters that explore the spatial and temporal characteristics of real application traffic

8 Chapter 1. Introduction 4 and how they effect the performance of different NoCs. 1.3 Thesis Organization and Contributions The focus of this dissertation is to propose a simulation methodology that can assess various NoC designs without the need for full-system simulation. We provide a brief primer on NoCs in Chapter 2. In Chapter 3, we introduce simulation methodologies already available to designers. In this thesis we pay special attention to NoC performance, which is greatly affected by the traffic generated from an application run. Therefore, an important first step is to understand the behaviour of application traffic, and develop models that capture its spatial and temporal characteristics. In Chapter 4, we discuss a methodology that captures the time-varying behaviour of the application. We apply a two-level hierarchical divide-and-conquer approach that splits application traffic into intervals. Chapter 5 then looks at how to reproduce intervals at the lowest level of our hierarchy to resemble traffic generated by a real application on a shared memory architecture. In Chapter 6, we explore how the parameters of our model can be changed to improve the fidelity of our methodology. We then apply our recommended parameters to a variety of applications from the PARSEC [5] and SPLASH-2 [41] benchmark suites, and compare NoC performance and simulation time to full-system simulation. The main contributions of this dissertation is creating a novel methodology that: 1. Generates bursty traffic similar to what would be seen by applications run on a full-system simulator. Burstiness can have a major impact on NoC performance because the network must accomodate a large amount of packets in a short amount of time. 2. Produces traffic that resembles a real cache coherence protocol. Cache coherence is a crucial component of current and future designs [23], and can impact traffic patterns as we will see in Chapter Can be applied to multiple NoC designs without loss of fidelity. Designers often need to explore several different designs to optimize the NoC for their applications. 4. Provides significant speed up over full-system simulation. Time-to-market constraints and the aforementioned need for design space exploration means that simulations must be fast as well as accurate.

9 Chapter 2 Background Efficient on-chip communication is essential for the performance of CMPs. NoCs have become the de facto interconnect of the future [22]. In this chapter, we present a brief primer regarding the basics of an NoC and how to evaluate its performance. 2.1 NoC Primer NoCs provide a scalable, high bandwidth alternative to bus-based and point-to-point interconnects. This is, in large part, thanks to the modular approach of connecting nodes to routers, depicted in Figure 6.2. A network node can contain several components, such as a processor core or a cache, and sends messages to other nodes via a router. In this chapter, we discuss: 1. Topology: How the routers are connected to each other 2. Routing & Flow Control: How messages are transported in the NoC 3. Performance: How NoC Performance is measured C D $ Figure 2.1: Nodes (squares) connected to Routers (circles). Nodes can contain several components, such as cores (C), caches ($), and Directories (D). 5

10 Chapter 2. Background Topology An NoC s topology can have a significant impact on its performance because it defines the paths available between nodes through routers [11]. Figure 2.2 shows one example of a 9-node ring topology, where the circles are considered to be nodes with routers and the lines connecting them are channels. In order for a message to travel from Node 0 to Node 3, it must travel through routers at Nodes 1 and 2 as well. This means messages going from node 0 to Node 3 take three hops to arrive at their destination. Conversely, a mesh network shown in Figure 2.3 allows messages travelling from Node 0 to Node 3 to take only one hop. Assuming perfect routing and flow control, the topology will give designers an upper bound for the NoC s performance Figure 2.2: A 9-node ring network topology Figure 2.3: A 9-node mesh network topology 2.2 Routing and Flow Control Messages move through the network according to a routing algorithm. The routing algorithm defines the path that can be taken by a message to its destination. Dimension-Order-Routing (DOR) is a deterministic routing algorithm that defines the minimal route (i.e. shortest path) between a source and its destination. The advantage to deterministic routing algorithms is that they are easy to implement, however these paths do not consider congestion in the NoC (Figure 2.4). Adaptive routing algorithms exist that allow messages to take alternate routes if the current path is too congested. This is analogous to driving downtown in a busy city; drivers are more likely to look for alternate roads to their destination rather than wait in traffic. Figure 2.5 shows how adaptive routing can completely avoid congested channels and take an alternate path to arrive at its destination earlier. From an NoC perspective, adaptive routing algorithms can improve performance by increasing path diversity through the network. Messages (or Packets) sent by nodes into the NoC can hold varying amounts of data, which are further discretized into flits. Flits help serialize a packet so that they can traverse through the network according to the bandwidth allowed by the channel. Flits are routed through different network resources, such as router buffers and virtual channels (VCs). Each channel in the network has multiple virtual channels associated with a router port. Multiple VCs helps improve link utilization because, if one packet stalls on its way to its destination (due to contention for a network resource), other packets can continue to route through the same path using a different VC. A high number of VCs can increase the bandwidth capabilities of an NoC, however they are also expensive in terms of area and power [11].

11 Chapter 2. Background Figure 2.4: A packet routed from Node 0 to Node 7 using Dimension-Order-Routing. Note the congestion between nodes 3, 6, and 7. Figure 2.5: A packet routed from Node 0 to Node 7 using adaptive routing, avoiding the congestion between nodes 3, 6, and Performance The main performance metric associated with NoC design is the packet latency. This is the time it takes for a packet to arrive at its destination from its source node. The latency of a packet can vary when we consider contention in the network; as the NoC becomes congested, its resources become full and packets must wait before they can continue to traverse the network. Waiting (or stalling) for network resources can dramatically increase the latency of a packet, and severely hinder network performance. The average packet latency is a common metric used to get a quick evaluation of NoC designs, however it is also informative to consider the packet latency distribution. Similar average packet latencies can be achieved with very different distributions. For example, a Gaussian distribution and bi-modal distribution can be constructed to give the same average. However, the bi-modal distribution has a large number of packets with very high and very low latencies. As a result, NoC designers would provision the network differently in order to improve network performance, taking into consideration the high bandwidth requirements for a given application. Conversely, a Gaussian distribution implies that the traffic is more easily manageable, and not as bursty (i.e. a large number of packets injected into the network over a short time). A common pitfall in measuing packet latency is disregarding the source queue. When a packet is to be sent into the network, it is possible that the input port to the network is busy. As a result, the source must delay injection of the packet until the input is ready [11]. This situation is common during bursty injection, and can have a significant impact on packet latency when the network is congested. In this dissertation, packet latency includes the source queue.

12 Chapter 3 Simulation Methodologies There are several simulation methodologies available to researchers when evaluating NoCs. In this chapter, we explore two software simulation methodologies (synthetic traffic patterns, and traces) and discuss recent advancements in the area of software simulation and other related work. 3.1 Synthetic Traffic Patterns Synthetic traffic patterns such as uniform random, permutation, tornado, etc. are widely used in NoC research. Many of these traditional synthetic traffic patterns are based on the communication pattern of specific applications. For example, transpose traffic is based on a matrix transpose application, and the shuffle permutation is derived from Fast-Fourier Transforms (FFTs) [2, 11]. However, these synthetic traffic patterns are not representative of the wide range of applications that run on current and future CMPs. Even if these traffic patterns were representative, the configuration of a cache-coherent system can mask or destroy the inherent communication pattern of the original algorithm due to the presence of indirections and control messages. Synthetic traffic patterns are typically applied to an NoC with a fixed injection rate. In this section we will demonstrate two methods for selecting an appropriate synthetic traffic pattern to explore an NoC design. The first method will use the injection rate from a real application, Facesim, and sweep several traffic patterns. The second method will use a synthetic traffic pattern, Shuffle, that is historically based on a real application (FFT). In both cases, we will see that current synthetic traffic patterns are not characteristic of the modern application landscape. The Facesim benchmark computes a realistic animation for a face through physics simulation. When run on an ideal network, it has an average injection rate of 0.16 packets per cycle. We apply this injection rate using a variety of synthetic traffic patterns that are used in the evaluation of NoCs [11]. Figure 3.1 shows the average packet latency, yielding an average error of 23.64% (our system configuration and methodology is explained in Chapter 6). Looking just at average packet latency, one could come to the conclusion that the Bit Complement traffic pattern best approximates the Facesim benchmark. However, average packet latency tells us little about the congestion in the network. Bit Complement is a subset of permutation traffic that is typically used to stress NoC configurations [11]. Specifically, given a 4-bit source address S x = {s 3, s 2, s 1, s 0 }, the destination is computed as the complement of each bit: D x = { s 3, s 2, s 1, s 0 }. Figure 3.2 shows the packet latency 8

13 Percentage of Packets Average Packet Latency Chapter 3. Simulation Methodologies Facesim Figure 3.1: Average Packet Latency of different traffic patterns on an NoC. The dashed line is the average packet latency of the Facesim benchmark. distributions for Facesim and Bit Complement (as a percentage of packets injected). Bit Complement has three definitive packet latencies (17, 27, and 37 cycles) whereas Facesim s packet latencies are more evenly distributed due to time-varying behaviour in the simulation. Bit Complement may produce the same average behaviour as Facesim (with regards to packet latency), but it does not provide bursts of traffic as one would expect from a real application. Facesim Bit Complement 50.00% 45.00% 40.00% 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% Packet Latency Figure 3.2: Packet latency distributions for Facesim and Bit Complement. Synthetic traffic pattern simulations can be made more complicated using varying injection rates for nodes, modulated-markov processes for injection, etc [11]. However, they are not representative of real application traffic due to the shared memory architectures that are typically modelled in full system simulation. The arrangement of cores, caches, directories, and memory controllers directly influences the flow of communication when running an application. To illustrate this point, we compare a synthetic shuffle pattern with the FFT benchmark from SPLASH-2 [41]. The shuffle pattern is a bit permutation where the destination is calculated via the function d i = s i 1 mod b where b is the number of bits required to represent the nodes of the network [11]. FFT is run in full-system simulation while shuffle is run in network-only simulation. Figure 3.3 shows the number

14 Chapter 3. Simulation Methodologies 10 Destination Destination Source (a) Shuffle Traffic Pattern Source (b) FFT Application Figure 3.3: A comparison of the spatial behaviour between synthetic and real application traffic of packets sent from a source to a destination 1. In Figure 3.3b, we see notable destination hot spots at nodes 0, 2, and 5, as well as source hot spots at nodes 0 and 5. However, Figure 3.3a shows hot spots only for specific source-destination pairs. The best NoC design for the traffic in Figure 3.3a is unlikely to be the best NoC for the traffic in Figure 3.3b. The sharp contrast in Figure 3.3 is due to coherence transactions needing to visit several nodes in a shared memory architecture before completing. For example, a write request first visits a directory to receive ownership of a cache line. The directory forwards requests to the core caching the data, and also invalidates caches who are sharing the data. Invalidated caches must send acknowledgements this domino effect is the typical behaviour in a shared memory architecture, and can significantly change an application s spatial behaviour and should be correctly modelled for realistic traffic generation. Synthetic traffic patterns are useful in revealing bottlenecks in the network while sweeping injection rates to find an NoC s saturation point. However, for researchers looking to design NoCs that are correctly provisioned for the application space and infrastructure of a real system, synthetic traffic patterns cannot be used to make conclusive decisions. 3.2 Trace Simulation Trace simulation first records the injected traffic from a full-system simulation and then replays the traces in an NoC simulator. This maintains the time-varying behaviour of the application and traces can also include cache coherence information about packets, alleviating the issues faced by synthetic traffic patterns. The main problem with this approach is that it ignores the dependencies that packets exhibit between each other. An over-provisioned NoC will likely deliver packets faster than one that is under-provisioned, and these could result in two very different traces for the same application. Traffic is dependent on how quickly messages are delivered, and is more suited to a control system that can react to ejected packets [21]. To demonstrate this simulation methodology s shortcomings, we take traces from full-system simulations that use an ideal network. The traces are then applied to a mesh network with two virtual channels and adaptive routing. Control packets are 8 bytes while data packets are 72 bytes, and the flit size is 4 1 The number of packets in each figure is unimportant as we focus on source-destination traffic pairs.

15 Average Packet Latency Chapter 3. Simulation Methodologies 11 bytes (see Chapter 6 for full details regarding our system configuration). Figure 3.4 shows the average packet latency across multiple PARSEC benchmarks. Ideally, the trace simulation would have similar performance to the full-system simulation, however this methodology can yield an error as high as 666% (fluidanimate). On average, the average packet latency is off by 234.5%, and the error could be higher for less provisioned networks. Traces can lead us to make invalid assumptions about the performance of our network, and should not be used for design space exploration. Full System Trace Figure 3.4: Comparison between full system and trace simulation for average packet latency across a network configuration 3.3 Traces with Dependencies Inferring packet dependencies in trace files can improve the fidelity of trace-based simulation, and there has been active research in the area [14, 27]. This provides similar benefits to regular trace simulations, but with the added benefit of throttling packets that should not be injected until a previous packet has arrived at its destination. In Netrace, dependencies between architectural components (such as a cache and memory controller) and the cache coherence protocol are inferred to create dependency traces [14]. Dependencies due to the program behaviour are not tracked, however the methodology uses simple in-order cores to alleviate this problem (i.e. memory requests from a core are serial, therefore no dependency tracking is necessary). Nitta et al. use a different approach by taking multiple full-system traces on a variety of NoC configurations. Comparing these traces for causality, a packet dependency graph (PDG) can be constructed for a particular application [27]. This approach can track program behaviour dependencies, however it is not always accurate Traces on other NoC configurations can introduce more dependencies to the graph. To evaluate traces with dependencies we employ an approach similar to Netrace, except we use Outof-Order cores to keep comparisons throughout this dissertation fair. In addition, in-order cores do not aggressively stress the network due to stalls for memory requests. When the processor stalls, no new messages will be injected into the network. We compare the fidelity and speed of trace simulations with dependencies in our evaluation (Chapter 6).

16 Chapter 3. Simulation Methodologies Other Methodologies and Related Work There are several simulation methodologies available to designers. In this section we present methodologies that improve simulation time and/or allow for many-core (i.e. hundreds, thousands of cores) design space exploration, as well as work that characterizes application behaviour either to better understand the application or to create synthetic benchmarks Simulation Acceleration Simulating small but representative parts of an application run has been widely explored, and two main methodologies exist: SimPoint [34] and SMARTS [42]. In SimPoint, Sherwood et al. capture the time-varying behaviour of programs using Basic Block Vectors (BBVs), which are then clustered (grouped) into phases. Multiple simulation points (hence, SimPoint) can then be inferred from these phases that represent the full execution of the simulation. SMARTS also simulates only parts of an application, but uses statistical sampling to determine which parts to simulate. Both SimPoint and SMARTS are targetted at single-thread workloads. However, recent work shows that, with some changes, the methodologies can be applied to parallel workloads as well [25, 43]. Sampling on multi-threaded applications has received renewed interest recently [1, 8]. These sampling methodologies have mostly been applied to micro-architectural simulation, so their efficacy for NoC evaluation is currently unknown. User-level simulators exist as an alternative to full system simulation for exploring thousands of cores [7, 24]. ZSim exploits parallel simulation with out-of-order core models to simulate 300 Million Instructions Per Second (MIPS) (compared to the roughly 200 KIPS by full system) [32]. However user-level simulators are designed to work without an operating system (current operating systems do not support thousands of cores), simulating only the user portion of an application. In addition, several components are not modelled, such as peripheral devices. As a result, many applications cannot be run on the simulation environment. Still, user-level simulators are a strong tool for futuristic design space exploration of thousand core architectures. FPGA-based acceleration has also been proposed [9, 37]. FIST implements an FPGA-based network simulator that can simulate mesh networks with significant speed up over software simulation [28]. DrNoC is FPGA framework for design space exploration of NoCs without requiring resynthesis of the NoC when its configuration changes [20]. DrNoC relies on the partial reconfigurability of Xilinx FPGAs, thereby requiring a specialized design flow [39]. The main drawback to FPGA-based simulation is that, while they are fast and accurate, they can be difficult to use [32]. In this dissertation we focus on speeding up simulation time for NoC design space exploration, and some network simulators exist that speed up simulation time. For example, Hornet [29] focuses on parallelizing an NoC simulation and can achieve a 12 speedup. Our work is orthogonal to Hornet because we can use our synthetic traffic generation to drive the network simulation. The key benefit here is that detailed modelling of cores, caches, and other components are not necessary, removing the need for full system simulation Workload Modelling and Synthetic Benchmarks Cloning can mimic workload behaviour by creating a reduced representation of the code [3, 17]. Much of this work focuses on cloning cache behavior; our work can be viewed as creating clones of cache coherence behaviour to stimulate the network. Creation of synthetic benchmarks for multi-threaded applications

17 Chapter 3. Simulation Methodologies 13 has been explored [13]; this work generates instruction streams that execute in simulation or on real hardware. Our work differs as we reproduce communication patterns and coherence behaviour, while abstracting away the processor and instruction execution. MinneSPEC [19] provides reduced input sets that effectively match the reference input for SPEC2000. Cloning can mimic workload behaviour by creating a reduced representation of the code [3, 17]. Much of this work focuses on cloning cache behavior; our work can be viewed as creating clones of cache coherence behaviour to stimulate the network. MinneSPEC [19] provides reduced input sets that effectively match the reference input for SPEC2000. Creation of synthetic benchmarks for multi-threaded applications has been explored [13]; this work generates instruction streams that execute in simulation or on real hardware. Our work differs as we reproduce communication patterns and coherence behavior while abstracting away the processor and instruction execution.

18 Chapter 4 Time-Varying Application Behaviour If traffic were monotonous and predictable, NoC design would be simple. Unfortunately, real applications exhibit time-varying behaviour that significantly impacts the packets injected into a network. This chapter introduces previous research on how applications vary with time and demonstrates how to capture this behaviour for both modelling and generating traffic. 4.1 Phase Behaviour in Applications Previous research shows that applications go through phases [33]. These phases can have a significant impact on the instructions per second (IPC), miss rates, and prediction rates of various microarchitectural components. Researchers can, in turn, exploit this phase behaviour to their advantage. One example is SimPoint [34], a methodology that proposes to simulate only small but representative parts of an application to reduce simulation time. Phase behaviour continues to be apparent in parallel applications, and can have significant effects on the time-varying behaviour of traffic generated by an application [43, 16]. This is important, as our methodology will need to capture this phase behaviour if it intends to realistically generate synthetic traffic. The time-varying behaviour of real application traffic makes it difficult to model on the whole. Instead, we propose looking at the data at both a macro- (millions or billions of cycles) and micro- (thousands or hundreds of thousands of cycles) level granularity. At each level, we divide the traffic into fixed-sized time intervals (see Defintion 1) in a divide-and-conquer approach that makes modelling the data more manageable. Once divided, we can group similar intervals that occur at different times in the application. Definition 1. An interval (I) is a span of cycles (C) ranging from C i to C i+, such that > 0 and therefore C i < C i+. Any two intervals cannot overlap and there is no gap between subsequent intervals. Figure 4.1 highlights two macro-intervals that would, from a visual standpoint, appear to have similar behaviour. We can analyze each interval by characterizing its traffic behaviour with a Feature Vector (a 14

19 Chapter 4. Time-Varying Application Behaviour 15 vector of elements, or features. An example of a feature could be the injection rate). The same features are used for each interval, but their magnitudes can (and likely will) differ. In the figure we see two feature vectors, V 1 and V 2, representing the two intervals Packets Injected V 1 V e e e e e+08 Time Bin (500,000 cycles) Figure 4.1: A high level view of clustering at a macro-level granularity. Once feature vectors have been constructed for each interval, we can mathematically determine those that are similar to each other. One method is to calculate the distance between them. There are many distance measures available, and this dissertation focuses on Euclidean distance. By calculating the distance between each pair of feature vectors, we can compile a distance matrix to compare all intervals to each other. From Figure 4.1, if our feature is simply the total number of packets injected (i.e. the y-axis), then the distance between V 1 and V 2 is small. Now that intervals can be compared using a quantitative metric, we can group them together into phases (Definition 2) using a method called clustering. This form of statistical analysis provides a methodology for detecting traffic phases in an application. The efficacy of this methodology is reliant on the clustering algorithm and feature vector used. There are several clustering algorithms available, however the feature vector can have the greatest impact on which intervals constitute which phases. Section 4.2 explores feature vectors in more detail, and Section 4.3 discusses how to reproduce these intervals using relevant clustering approaches and Markov chains. Definition 2. A traffic phase is a group (or cluster) of intervals that behave in a similar manner. Traffic phases are typically reccuring, but not necessarily periodic.

20 Chapter 4. Time-Varying Application Behaviour Feature Vector Design The real key to an effective clustering technique comes from defining similar behaviour. What does it mean that one interval behaves similarly to another? The answer to this question will affect the design of the feature vector used. With respect to network traffic, the elements of a feature vector would describe communication behaviour. In this section, we introduce and discuss five different feature vectors, and in Chapter 6 we evaluate four of the five: 1. Total Injection - packets injected by all nodes 2. Coherence Composition (not evaluated) - packets divided by coherence type for all nodes 3. Node Injection - packets injected for each source 4. Row-Column Flow - packets injected for a group of source nodes to a group of destination nodes 5. Per-Node Flow - packets injected for each source-destination pair An important note when designing a feature vector, since we use Euclidean Distance, is to ensure all elements are on the same scale (or have the same unit of measurement). The reason for this is evident when calculating the distance between vectors. One element that is on a larger scale than another can eclipse it completely. For example, consider the injection rate of an interval to be in the range of [0, 1] as one element, and the total number of write requests in an interval to be in the range of [0, 40] (hypothetically) as another. Let s observe what happens between two intervals that use both these elements in a feature vector: V ector = (injection rate, write requests) A = (0.001, 3) B = (0.016, 15) EuclideanDistance = n (q i p i ) 2 = ( ) 2 + (15 3) 2 i=1 = = From above, we see that even though there is a 16 difference between injection rates (versus 5 for write requests), it does not affect the Euclidean Distance (the result is similar for Manhattan Distance). Therefore, features of the vector should not greatly differ in their empirically observed magnitudes. In addition to ensuring elements have the same unit of measure, it is sometimes tempting to use large feature vectors with many elements. However, such an approach is not always feasible. For one, there is the so-called curse of dimensionality where the data required to populate the vectors is insufficient for the size of the vector [4], and simply parsing the data to construct these feature vectors can take a long time. Feature vectors with a high dimensionality should be reserved for large data sets.

21 Chapter 4. Time-Varying Application Behaviour Total Injection A simple, one-dimensional feature vector that can characterize communication behaviour is the total number of packets injected for all nodes in the network. This feature vector allows us to differentiate between intervals that are experiencing high, low, or in-between levels of communication. The benefits of such a feature vector is that it is easy to create; calculating the total number of packets in an interval is a simple subtraction operation. In addition, because it is one-dimensional, calculating the distance between vectors and running through the clustering algorithm is also fast. The disadvantages of the Total Injection feature vector are rooted in its simplicity. The total number of packets tells us nothing about the spatiality of the traffic behaviour. That is, even though two vectors may have similar magnitudes, their respective intervals could exhibit different spatial behaviour, such as hot spots. In addition, the vector does not tell us what types of messages are being injected. For example, one interval could be issuing several read requests to retrieve data, while another is issuing write requests that invalidate sharers. A one-dimensional vector is more suited to large datasets where constructing complex feature vectors takes too much time and has too much overhead. Because it does not capture several characteristics of communication behaviour, it can only give us a rough view of which intervals may be similar Coherence Composition During full-system simulation of a coherent, shared memory architecture, packets are injected according to a cache coherence protocol. Therefore each packet is associated with a message type (Definition 3) of the protocol. Using this information we can discern between read and write phases of an application, among other things. Definition 3. A message type is used to describe the category a packet belongs to. The category (or type) reveals the reason a packet has been sent to a destination, allowing the cache coherence state machine to react accordingly. Consider intervals at the micro-scale (hundreds of cycles). It is likely that a write request interval is accompanied with several invalidate messages to notify sharers that the cache line has changed. Similarly, a read request interval would not be associated with invalidated messages because the cache line is not updated. Intuitively, we would group write request and read request intervals into two separate phases, and a coherence composition feature vector does this well. Figure 4.2 shows the composition of three different message types (reads, writes, and invalidates) across 25 consecutive intervals (data taken from the Swaptions application). It is easy to see the distinction between a read phase (intervals 2-11) and a write phase (intervals 14-25). During the early part of the write phase (intervals 14-18), several invalidate messages are sent to update sharers on the status of the cache line. The dimensionality of a coherence composition vector can be small. In a cache coherence protocol, there are several message types that rely on others in a one-to-one mapping. For example, invalidate packets are typically followed by acknowledgement packets to complete a handshake. We can therefore exclude acknowledgement packets from the feature vector because they are already encompassed by the

22 Number of Packets Chapter 4. Time-Varying Application Behaviour 18 Reads Invalidates Writes Figure 4.2: Number of read, write, and invalidate packets over a 25 interval portion of the Swaptions benchmark invalidate feature. This example can apply to several message types, and therefore only message types most representative of cache coherence behaviour need to be used. The advantage of this feature vector is that its dimensionality can be kept low, and also constant (i.e. it does not scale with any parameter, such as the number of nodes in the network) according to the protocol being used. It allows us to intuitively distinguish between intervals that are exhibiting different behaviour according to the cache coherence protocol. However, the feature vector does not include information about the spatial behaviour of the application. Therefore, it ignores potential hot spot information, which can have a significant impact on performance during simulation. Coherence Composition is most useful at the micro-level to differentiate between short phases that may occur due to short loops and control flows in the code. Coherence Composition can also be useful at the macro level, however depending on the size and number of intervals it can introduce a lot of overhead because each packet in the interval needs to be analyzed and sorted by message type Node Injection If we look at the injection distribution across N nodes in the system, we can construct a feature vector that includes the spatial characteristics of application traffic. This feature vector is similar to Total Injection in how the number of packets are counted, however it now scales with N because a dimension exists for each node (Figure 4.3). The spatial injection distribution helps identify injecting hotspots that is, nodes that send a lot of packets. But hot spots can also exist as a destintation that is, nodes that receive a lot of packets. Intuitively, a node that sends a lot of packets has likely received a lot of packets as well (following the request-response mantra common in cache coherence protocols, which we will see in Chapter 5). However the relationship between sent and received messages is lost. That is, a Node Injection feature vector cannot tell us which nodes are communicating with each other; the feature vector only tells us which nodes are communicating more than others. At the micro-scale, Node Injection is not ideal because during low-communication phases of an application several nodes may be injecting zero packets. This would skew distance measures and classify several intervals as similar, even if they had different injecting hot spots. Take, for example, a 3-node

23 Chapter 4. Time-Varying Application Behaviour 19 <N 1, N 2,, N 9 > Figure 4.3: How nodes map to the Node Injection feature vector. architecture. In one interval (I 1 ), all packets are injected from Node 1 (30 packets). In a second interval (I 2 ), all packets are injected from Node 3 (25 packets). And in a third interval (I 3 ), all packets are injected from Node 2 (28 packets). From the Euclidean distance matrix (Table 4.1) we can see that despite different nodes being responsible for injection, the difference between distances is small (approximately ±2, with similar deltas using Manhattan Distance). That is, intervals are considered similar if they have at least a one-node hotspot (in this example). I 1 I 2 I 3 I I I Table 4.1: A distance matrix for three example vectors. We can improve the scaling of the Node Injection feature vector by looking at rows and columns instead of individual nodes [16]. That is, we observe the total number of packets injected by each row of nodes, and each column of nodes, to construct a feature vector that scales with 2 N. Because we look at both the rows and the columns, there is overlap in the elements of the vector, which can tell us more specifically where the hot spot is (although not as specifically as using a per-node vector). Node Injection is better suited for macro-scale clustering where more packets ensure a more populated feature vector. This way all elements have a magnitude that can influence the distance measure, allowing the clustering algorithm to more accurately group similar intervals Row-Column Flow The Row-Column Flow feature vector captures the spatial behaviour of traffic, as well as the relationship between sent and received messages across rows and columns. Each element of the vector corresponds to the number of packets sent by one group of nodes (a row) to another group of nodes (a column), shown

24 Chapter 4. Time-Varying Application Behaviour 20 in Figure 4.4. We use the words row and column to make the vector easier to understand the actual mapping of nodes onto the network does not have to be grid-like. Source (Row) <N 1, N 2, N 3,, N 9 > Flow Destination (Column) Figure 4.4: A visual demonstration of how Row-Column Flow combinations map to a feature vector. In this example, Row 1 and Column 3 make up the third source-destination pair For an N-node system, the vector scales at a rate of N, just like the Node Injection vector. However, because Row-Column Flow vectors contain both source and destination information (albeit aggregated into rows and columns), they can more accurately detect hotspots that occur during simulation Per-Node Flows Capturing the relationship between each and every node creates a large feature vector that scales at a rate of N 2. This method counts the number of packets being sent from each source to each destination, which we define as Per-Node Flows (Definition 4). Definition 4. A flow is a source-destination pair [16]. In an N node network, there are N 2 flows. The Per-Node Flow feature vector reveals communication behaviour at the finest granularity, and can help identify the exact location of hotspots by not aggregating information across several nodes. Because of its size, the vector should only be used when sufficient data is available to populate each element (otherwise a similar situation to Table 4.1 will occur) Summary We have introduced five feature vectors with different advantages and disadvantages depending on the number of nodes or packet information available (A summary can be found in Table 4.2). The vectors are used to characterize the communication behaviour of an interval (recall Definition 1). In Section 4.3, we discuss a methodology for recreating these intervals during software simulation. In our evaluation (Chapter 6), we will compare Total, Node, Row-Column Flow, and Per-Node Flow vectors. We omit Coherence Composition because, in our experiments, it did not accurately

Asynchronous Bypass Channels

Asynchronous Bypass Channels Asynchronous Bypass Channels Improving Performance for Multi-Synchronous NoCs T. Jain, P. Gratz, A. Sprintson, G. Choi, Department of Electrical and Computer Engineering, Texas A&M University, USA Table

More information

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1 System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

How To Provide Qos Based Routing In The Internet

How To Provide Qos Based Routing In The Internet CHAPTER 2 QoS ROUTING AND ITS ROLE IN QOS PARADIGM 22 QoS ROUTING AND ITS ROLE IN QOS PARADIGM 2.1 INTRODUCTION As the main emphasis of the present research work is on achieving QoS in routing, hence this

More information

Performance Workload Design

Performance Workload Design Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles

More information

Scaling 10Gb/s Clustering at Wire-Speed

Scaling 10Gb/s Clustering at Wire-Speed Scaling 10Gb/s Clustering at Wire-Speed InfiniBand offers cost-effective wire-speed scaling with deterministic performance Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400

More information

Juniper Networks QFabric: Scaling for the Modern Data Center

Juniper Networks QFabric: Scaling for the Modern Data Center Juniper Networks QFabric: Scaling for the Modern Data Center Executive Summary The modern data center has undergone a series of changes that have significantly impacted business operations. Applications

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Everything you need to know about flash storage performance

Everything you need to know about flash storage performance Everything you need to know about flash storage performance The unique characteristics of flash make performance validation testing immensely challenging and critically important; follow these best practices

More information

Why the Network Matters

Why the Network Matters Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing

More information

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Ms Lavanya Thunuguntla 1, Saritha Sapa 2 1 Associate Professor, Department of ECE, HITAM, Telangana

More information

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng Architectural Level Power Consumption of Network Presenter: YUAN Zheng Why Architectural Low Power Design? High-speed and large volume communication among different parts on a chip Problem: Power consumption

More information

A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator

A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator Nan Jiang Stanford University qtedq@cva.stanford.edu James Balfour Google Inc. jbalfour@google.com Daniel U. Becker Stanford University

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09 Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia December 12,

More information

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING CHAPTER 6 CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING 6.1 INTRODUCTION The technical challenges in WMNs are load balancing, optimal routing, fairness, network auto-configuration and mobility

More information

Networking Virtualization Using FPGAs

Networking Virtualization Using FPGAs Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,

More information

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors 2011 International Symposium on Computer Networks and Distributed Systems (CNDS), February 23-24, 2011 Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors Atefeh Khosravi,

More information

NetFlow Performance Analysis

NetFlow Performance Analysis NetFlow Performance Analysis Last Updated: May, 2007 The Cisco IOS NetFlow feature set allows for the tracking of individual IP flows as they are received at a Cisco router or switching device. Network

More information

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere! Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

Load Balancing and Switch Scheduling

Load Balancing and Switch Scheduling EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load

More information

Windows Server Performance Monitoring

Windows Server Performance Monitoring Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

More information

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches Xi Chen 1, Zheng Xu 1, Hyungjun Kim 1, Paul V. Gratz 1, Jiang Hu 1, Michael Kishinevsky 2 and Umit Ogras 2

More information

PART III. OPS-based wide area networks

PART III. OPS-based wide area networks PART III OPS-based wide area networks Chapter 7 Introduction to the OPS-based wide area network 7.1 State-of-the-art In this thesis, we consider the general switch architecture with full connectivity

More information

Lecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)

Lecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Lecture 18: Interconnection Networks CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Project deadlines: - Mon, April 2: project proposal: 1-2 page writeup - Fri,

More information

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Vorlesung Rechnerarchitektur 2 Seite 178 DASH Vorlesung Rechnerarchitektur 2 Seite 178 Architecture for Shared () The -architecture is a cache coherent, NUMA multiprocessor system, developed at CSL-Stanford by John Hennessy, Daniel Lenoski, Monica

More information

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy Hardware Implementation of Improved Adaptive NoC Rer with Flit Flow History based Load Balancing Selection Strategy Parag Parandkar 1, Sumant Katiyal 2, Geetesh Kwatra 3 1,3 Research Scholar, School of

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Realistic Workload Characterization and Analysis for Networks-on-Chip Design

Realistic Workload Characterization and Analysis for Networks-on-Chip Design Realistic Workload Characterization and Analysis for Networks-on-Chip Design Paul V. Gratz Department of Electrical and Computer Engineering Texas A&M University Email: pgratz@tamu.edu Abstract As silicon

More information

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage sponsored by Dan Sullivan Chapter 1: Advantages of Hybrid Storage... 1 Overview of Flash Deployment in Hybrid Storage Systems...

More information

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a Parallel Computer Hardware Multiple Processors Multiple Memories Interconnection Network System Software Parallel

More information

On some Potential Research Contributions to the Multi-Core Enterprise

On some Potential Research Contributions to the Multi-Core Enterprise On some Potential Research Contributions to the Multi-Core Enterprise Oded Maler CNRS - VERIMAG Grenoble, France February 2009 Background This presentation is based on observations made in the Athole project

More information

Optical interconnection networks with time slot routing

Optical interconnection networks with time slot routing Theoretical and Applied Informatics ISSN 896 5 Vol. x 00x, no. x pp. x x Optical interconnection networks with time slot routing IRENEUSZ SZCZEŚNIAK AND ROMAN WYRZYKOWSKI a a Institute of Computer and

More information

Preserving Message Integrity in Dynamic Process Migration

Preserving Message Integrity in Dynamic Process Migration Preserving Message Integrity in Dynamic Process Migration E. Heymann, F. Tinetti, E. Luque Universidad Autónoma de Barcelona Departamento de Informática 8193 - Bellaterra, Barcelona, Spain e-mail: e.heymann@cc.uab.es

More information

HP Smart Array Controllers and basic RAID performance factors

HP Smart Array Controllers and basic RAID performance factors Technical white paper HP Smart Array Controllers and basic RAID performance factors Technology brief Table of contents Abstract 2 Benefits of drive arrays 2 Factors that affect performance 2 HP Smart Array

More information

Lecture 2.1 : The Distributed Bellman-Ford Algorithm. Lecture 2.2 : The Destination Sequenced Distance Vector (DSDV) protocol

Lecture 2.1 : The Distributed Bellman-Ford Algorithm. Lecture 2.2 : The Destination Sequenced Distance Vector (DSDV) protocol Lecture 2 : The DSDV Protocol Lecture 2.1 : The Distributed Bellman-Ford Algorithm Lecture 2.2 : The Destination Sequenced Distance Vector (DSDV) protocol The Routing Problem S S D D The routing problem

More information

White Paper Abstract Disclaimer

White Paper Abstract Disclaimer White Paper Synopsis of the Data Streaming Logical Specification (Phase I) Based on: RapidIO Specification Part X: Data Streaming Logical Specification Rev. 1.2, 08/2004 Abstract The Data Streaming specification

More information

A Dynamic Link Allocation Router

A Dynamic Link Allocation Router A Dynamic Link Allocation Router Wei Song and Doug Edwards School of Computer Science, the University of Manchester Oxford Road, Manchester M13 9PL, UK {songw, doug}@cs.man.ac.uk Abstract The connection

More information

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors Hyungjun Kim, 1 Arseniy Vitkovsky, 2 Paul V. Gratz, 1 Vassos Soteriou 2 1 Department of Electrical and Computer Engineering, Texas

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

co Characterizing and Tracing Packet Floods Using Cisco R

co Characterizing and Tracing Packet Floods Using Cisco R co Characterizing and Tracing Packet Floods Using Cisco R Table of Contents Characterizing and Tracing Packet Floods Using Cisco Routers...1 Introduction...1 Before You Begin...1 Conventions...1 Prerequisites...1

More information

PERFORMANCE OF MOBILE AD HOC NETWORKING ROUTING PROTOCOLS IN REALISTIC SCENARIOS

PERFORMANCE OF MOBILE AD HOC NETWORKING ROUTING PROTOCOLS IN REALISTIC SCENARIOS PERFORMANCE OF MOBILE AD HOC NETWORKING ROUTING PROTOCOLS IN REALISTIC SCENARIOS Julian Hsu, Sameer Bhatia, Mineo Takai, Rajive Bagrodia, Scalable Network Technologies, Inc., Culver City, CA, and Michael

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Load Balancing Mechanisms in Data Center Networks

Load Balancing Mechanisms in Data Center Networks Load Balancing Mechanisms in Data Center Networks Santosh Mahapatra Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 33 {mahapatr,xyuan}@cs.fsu.edu Abstract We consider

More information

Congestion Control Overview

Congestion Control Overview Congestion Control Overview Problem: When too many packets are transmitted through a network, congestion occurs t very high traffic, performance collapses completely, and almost no packets are delivered

More information

Interconnection Network Design

Interconnection Network Design Interconnection Network Design Vida Vukašinović 1 Introduction Parallel computer networks are interesting topic, but they are also difficult to understand in an overall sense. The topological structure

More information

COMMUNICATION PERFORMANCE EVALUATION AND ANALYSIS OF A MESH SYSTEM AREA NETWORK FOR HIGH PERFORMANCE COMPUTERS

COMMUNICATION PERFORMANCE EVALUATION AND ANALYSIS OF A MESH SYSTEM AREA NETWORK FOR HIGH PERFORMANCE COMPUTERS COMMUNICATION PERFORMANCE EVALUATION AND ANALYSIS OF A MESH SYSTEM AREA NETWORK FOR HIGH PERFORMANCE COMPUTERS PLAMENKA BOROVSKA, OGNIAN NAKOV, DESISLAVA IVANOVA, KAMEN IVANOV, GEORGI GEORGIEV Computer

More information

Web Server Software Architectures

Web Server Software Architectures Web Server Software Architectures Author: Daniel A. Menascé Presenter: Noshaba Bakht Web Site performance and scalability 1.workload characteristics. 2.security mechanisms. 3. Web cluster architectures.

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

Runtime Hardware Reconfiguration using Machine Learning

Runtime Hardware Reconfiguration using Machine Learning Runtime Hardware Reconfiguration using Machine Learning Tanmay Gangwani University of Illinois, Urbana-Champaign gangwan2@illinois.edu Abstract Tailoring the machine hardware to varying needs of the software

More information

Components: Interconnect Page 1 of 18

Components: Interconnect Page 1 of 18 Components: Interconnect Page 1 of 18 PE to PE interconnect: The most expensive supercomputer component Possible implementations: FULL INTERCONNECTION: The ideal Usually not attainable Each PE has a direct

More information

Network-Wide Class of Service (CoS) Management with Route Analytics. Integrated Traffic and Routing Visibility for Effective CoS Delivery

Network-Wide Class of Service (CoS) Management with Route Analytics. Integrated Traffic and Routing Visibility for Effective CoS Delivery Network-Wide Class of Service (CoS) Management with Route Analytics Integrated Traffic and Routing Visibility for Effective CoS Delivery E x e c u t i v e S u m m a r y Enterprise IT and service providers

More information

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

Agenda. Distributed System Structures. Why Distributed Systems? Motivation Agenda Distributed System Structures CSCI 444/544 Operating Systems Fall 2008 Motivation Network structure Fundamental network services Sockets and ports Client/server model Remote Procedure Call (RPC)

More information

MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

More information

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE CHAPTER 5 71 FINITE STATE MACHINE FOR LOOKUP ENGINE 5.1 INTRODUCTION Finite State Machines (FSMs) are important components of digital systems. Therefore, techniques for area efficiency and fast implementation

More information

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC

More information

The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM. 2012-13 CALIENT Technologies www.calient.

The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM. 2012-13 CALIENT Technologies www.calient. The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM 2012-13 CALIENT Technologies www.calient.net 1 INTRODUCTION In datacenter networks, video, mobile data, and big data

More information

HPAM: Hybrid Protocol for Application Level Multicast. Yeo Chai Kiat

HPAM: Hybrid Protocol for Application Level Multicast. Yeo Chai Kiat HPAM: Hybrid Protocol for Application Level Multicast Yeo Chai Kiat Scope 1. Introduction 2. Hybrid Protocol for Application Level Multicast (HPAM) 3. Features of HPAM 4. Conclusion 1. Introduction Video

More information

Cluster Analysis for Evaluating Trading Strategies 1

Cluster Analysis for Evaluating Trading Strategies 1 CONTRIBUTORS Jeff Bacidore Managing Director, Head of Algorithmic Trading, ITG, Inc. Jeff.Bacidore@itg.com +1.212.588.4327 Kathryn Berkow Quantitative Analyst, Algorithmic Trading, ITG, Inc. Kathryn.Berkow@itg.com

More information

Architectures and Platforms

Architectures and Platforms Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation

More information

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,

More information

Application Performance Testing Basics

Application Performance Testing Basics Application Performance Testing Basics ABSTRACT Todays the web is playing a critical role in all the business domains such as entertainment, finance, healthcare etc. It is much important to ensure hassle-free

More information

CONTINUOUS scaling of CMOS technology makes it possible

CONTINUOUS scaling of CMOS technology makes it possible IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 693 It s a Small World After All : NoC Performance Optimization Via Long-Range Link Insertion Umit Y. Ogras,

More information

Performance of networks containing both MaxNet and SumNet links

Performance of networks containing both MaxNet and SumNet links Performance of networks containing both MaxNet and SumNet links Lachlan L. H. Andrew and Bartek P. Wydrowski Abstract Both MaxNet and SumNet are distributed congestion control architectures suitable for

More information

Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS?

Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS? 18-345: Introduction to Telecommunication Networks Lectures 20: Quality of Service Peter Steenkiste Spring 2015 www.cs.cmu.edu/~prs/nets-ece Overview What is QoS? Queuing discipline and scheduling Traffic

More information

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data White Paper Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data What You Will Learn Financial market technology is advancing at a rapid pace. The integration of

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Computer Network. Interconnected collection of autonomous computers that are able to exchange information Introduction Computer Network. Interconnected collection of autonomous computers that are able to exchange information No master/slave relationship between the computers in the network Data Communications.

More information

Photonic Networks for Data Centres and High Performance Computing

Photonic Networks for Data Centres and High Performance Computing Photonic Networks for Data Centres and High Performance Computing Philip Watts Department of Electronic Engineering, UCL Yury Audzevich, Nick Barrow-Williams, Robert Mullins, Simon Moore, Andrew Moore

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL This chapter is to introduce the client-server model and its role in the development of distributed network systems. The chapter

More information

Module 7. Routing and Congestion Control. Version 2 CSE IIT, Kharagpur

Module 7. Routing and Congestion Control. Version 2 CSE IIT, Kharagpur Module 7 Routing and Congestion Control Lesson 4 Border Gateway Protocol (BGP) Specific Instructional Objectives On completion of this lesson, the students will be able to: Explain the operation of the

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

TRACKER: A Low Overhead Adaptive NoC Router with Load Balancing Selection Strategy

TRACKER: A Low Overhead Adaptive NoC Router with Load Balancing Selection Strategy TRACKER: A Low Overhead Adaptive NoC Router with Load Balancing Selection Strategy John Jose, K.V. Mahathi, J. Shiva Shankar and Madhu Mutyam PACE Laboratory, Department of Computer Science and Engineering

More information

AN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK

AN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK Abstract AN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK Mrs. Amandeep Kaur, Assistant Professor, Department of Computer Application, Apeejay Institute of Management, Ramamandi, Jalandhar-144001, Punjab,

More information

Performance Analysis of Storage Area Network Switches

Performance Analysis of Storage Area Network Switches Performance Analysis of Storage Area Network Switches Andrea Bianco, Paolo Giaccone, Enrico Maria Giraudo, Fabio Neri, Enrico Schiattarella Dipartimento di Elettronica - Politecnico di Torino - Italy e-mail:

More information

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP

More information

Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks

Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks Benjamin Schiller and Thorsten Strufe P2P Networks - TU Darmstadt [schiller, strufe][at]cs.tu-darmstadt.de

More information

How To Build A Cloud Computer

How To Build A Cloud Computer Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

Joint ITU-T/IEEE Workshop on Carrier-class Ethernet

Joint ITU-T/IEEE Workshop on Carrier-class Ethernet Joint ITU-T/IEEE Workshop on Carrier-class Ethernet Quality of Service for unbounded data streams Reactive Congestion Management (proposals considered in IEE802.1Qau) Hugh Barrass (Cisco) 1 IEEE 802.1Qau

More information

ΤΕΙ Κρήτης, Παράρτηµα Χανίων

ΤΕΙ Κρήτης, Παράρτηµα Χανίων ΤΕΙ Κρήτης, Παράρτηµα Χανίων ΠΣΕ, Τµήµα Τηλεπικοινωνιών & ικτύων Η/Υ Εργαστήριο ιαδίκτυα & Ενδοδίκτυα Η/Υ Modeling Wide Area Networks (WANs) ρ Θεοδώρου Παύλος Χανιά 2003 8. Modeling Wide Area Networks

More information

Lustre Networking BY PETER J. BRAAM

Lustre Networking BY PETER J. BRAAM Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information

More information

Load Distribution in Large Scale Network Monitoring Infrastructures

Load Distribution in Large Scale Network Monitoring Infrastructures Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu

More information

CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale Cloud Computing Environments

CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale Cloud Computing Environments 433-659 DISTRIBUTED COMPUTING PROJECT, CSSE DEPT., UNIVERSITY OF MELBOURNE CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale Cloud Computing Environments MEDC Project Report

More information

Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications

Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications by Samuel D. Kounev (skounev@ito.tu-darmstadt.de) Information Technology Transfer Office Abstract Modern e-commerce

More information

Network Architecture and Topology

Network Architecture and Topology 1. Introduction 2. Fundamentals and design principles 3. Network architecture and topology 4. Network control and signalling 5. Network components 5.1 links 5.2 switches and routers 6. End systems 7. End-to-end

More information

VMWARE WHITE PAPER 1

VMWARE WHITE PAPER 1 1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the

More information

SDN and FTTH Software defined networking for fiber networks

SDN and FTTH Software defined networking for fiber networks SDN and FTTH Software defined networking for fiber networks A new method to simplify management of FTTH networks What is SDN Software Defined Networking (SDN) revolutionizes service deployment and service

More information

Detecting Network Anomalies. Anant Shah

Detecting Network Anomalies. Anant Shah Detecting Network Anomalies using Traffic Modeling Anant Shah Anomaly Detection Anomalies are deviations from established behavior In most cases anomalies are indications of problems The science of extracting

More information

ECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers

ECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers ECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers Christina Delimitrou 1, Sriram Sankar 2, Aman Kansal 3, Christos Kozyrakis 1 1 Stanford University 2 Microsoft 3

More information

- Nishad Nerurkar. - Aniket Mhatre

- Nishad Nerurkar. - Aniket Mhatre - Nishad Nerurkar - Aniket Mhatre Single Chip Cloud Computer is a project developed by Intel. It was developed by Intel Lab Bangalore, Intel Lab America and Intel Lab Germany. It is part of a larger project,

More information

Monitoring Large Flows in Network

Monitoring Large Flows in Network Monitoring Large Flows in Network Jing Li, Chengchen Hu, Bin Liu Department of Computer Science and Technology, Tsinghua University Beijing, P. R. China, 100084 { l-j02, hucc03 }@mails.tsinghua.edu.cn,

More information

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one

More information

Efficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols

Efficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols Universitat Politècnica de València Master Thesis Efficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols Author: Mario Lodde Advisor: Prof. José Flich Cardo A thesis

More information

D1.1 Service Discovery system: Load balancing mechanisms

D1.1 Service Discovery system: Load balancing mechanisms D1.1 Service Discovery system: Load balancing mechanisms VERSION 1.0 DATE 2011 EDITORIAL MANAGER Eddy Caron AUTHORS STAFF Eddy Caron, Cédric Tedeschi Copyright ANR SPADES. 08-ANR-SEGI-025. Contents Introduction

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION 1.1 Background The command over cloud computing infrastructure is increasing with the growing demands of IT infrastructure during the changed business scenario of the 21 st Century.

More information

A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing

A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing N.F. Huysamen and A.E. Krzesinski Department of Mathematical Sciences University of Stellenbosch 7600 Stellenbosch, South

More information