Synthetic Traffic Models that Capture Cache Coherent Behaviour. Mario Badr
|
|
- Gary Mark Carson
- 8 years ago
- Views:
Transcription
1 Synthetic Traffic Models that Capture Cache Coherent Behaviour by Mario Badr A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto c Copyright 2014 by Mario Badr
2 Abstract Synthetic Traffic Models that Capture Cache Coherent Behaviour Mario Badr Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014 Modern and future many-core systems represent large and complex architectures. The communication fabrics in these large systems play an important role in their performance and power consumption. Current simulation methodologies for evaluating networks-on-chip (NoCs) are not keeping pace with the increased complexity of our systems; architects often want to explore many different design knobs quickly. Methodologies that trade-off some accuracy but maintain important workload trends for faster simulation times are highly beneficial at early stages of architectural exploration. We propose a synthetic traffic generation methodology that captures both application behaviour and cache coherence traffic to rapidly evaluate NoCs. This allows designers to quickly indulge in detailed performance simulations without the cost of long-running full system simulation but still capture a full range of application and coherence behaviour. Our methodology has an average (geometric) error of 10.9% relative to full system simulation, and provides 50 speedup on average over full system simulation. ii
3 Contents 1 Introduction Exploring the NoC Design Space A High-Level Approach Thesis Organization and Contributions Background NoC Primer Topology Routing and Flow Control Performance Simulation Methodologies Synthetic Traffic Patterns Trace Simulation Traces with Dependencies Other Methodologies and Related Work Simulation Acceleration Workload Modelling and Synthetic Benchmarks Time-Varying Application Behaviour Phase Behaviour in Applications Feature Vector Design Total Injection Coherence Composition Node Injection Row-Column Flow Per-Node Flows Summary The Injection Process Macro Scale Micro Scale Phase Transitions Summary iii
4 5 Synthesizing Traffic Overview Initiating Packets Reactive Packets Forwarding versus Off-Chip Invalidates Summary Evaluation and Results System Configuration Model Exploration Macro Phases Congestion at the Micro Level Time Interval Size Parameter Recommendations NoC Performance Evaluation Exploiting Markov Chains for Speedup Conclusion and Future Work 44 Bibliography 45 iv
5 Chapter 1 Introduction As uniprocessors are now limited by power and heat constraints, the architecture community has been investing increasing research and development efforts into multi- and many-core processors. As a result, the design space has grown larger, and more complex trade-offs are associated with processor designs. To accurately evaluate candidate architectures, we can model each component on- and off-chip and perform full-system simulation. These full-system simulations can provide high fidelity performance metrics before synthesis and prototyping of the actual design. An important part of this larger design space is the communication fabric used to connect the cores on a single die. In particular, Networks-on-Chip (NoCs) have been proposed as a modular and scalable fabric that can facilitate communication between multi- and many-core chips [10]. Applications targeted for the system will impose different bandwidth requirements on the interconnect, and NoCs need to be provisioned for performance whilst meeting power and area cost constraints. But NoCs themselves have their own large design space, and modelling each component via full-system simulation is timeconsuming [24]. Simulation efforts are being strained in part because of the large number of system components that must be modeled. Detailed models of processors (possibly multiple heterogeneous ones), caches, DRAM and networks are critical for accurate performance results, not to mention the need to run many large applications and fully model OS behaviour. However, full-system simulation is still appealing because of its fidelity. Designers who are willing to sacrifice fidelity due to time-to-market constraints or design space exploration have other software simulation methodologies available to them. Trace simulations are prevalent in a number of domains, where relevant information is recorded during a single full-system run and is then replayed on a specific component. For NoCs, synthetic traffic patterns can quickly reveal bottlenecks in the design. Software simulation methods that look to maintain fidelity while accelerating long simulations also exist - for example, sampling [1, 8]. However, outside of full-system simulation, several methodologies fail to accurately capture OS or cache coherence traffic, which can have significant effects on the performance of the system [23]. A simulation methodology that captures this behaviour while quickly evaluating NoC designs is needed. 1
6 NoC Design Application Modelling Micro Architecture Chapter 1. Introduction Exploring the NoC Design Space There are several parameters that can affect the traffic a communication fabric needs to support (the application, cache hierarchy, and coherence protocol, to name a few). Software simulation that models all these components takes time. A methodology that allows for NoC design space exploration needs to abstract away these parameters so that researchers can focus on an NoC s infrastructure and communication paradigms. To better understand this abstraction, we present a generic design flow in Figure 1.1. Before any design can begin, the question of What are the target applications for the system? needs to be answered. This is done during the Application Modelling phase (top left of the figure), and the design space of applications can be vast [5], especially for general purpose processors. Once the application space has been determined, their corresponding threads are mapped and scheduled onto different processor cores (bottom left). The microarchitecture of these cores is another rich design space with several parameters. Finally, these cores along with their caches are connected to the communication fabric via routers, and the routers are in turn connected to other routers via links and buffers according to some topology (right-side of the figure) to make up the NoC. Application Core Cache Router Dir. Threads Buffers Links Cores Network Figure 1.1: A generic design flow, with emphasis on NoCs. In order to abstract away the other design spaces, our methodology looks only at the traffic injected into the routers. This means that interactions between components (such as between a core and its cache), are not modelled, allowing those designs to be varied independently. Another advantage is simplicity; we seek to synthetically inject packets into the network without modelling the complexities of each individual component, as full-system simulation would. 1.2 A High-Level Approach As technology scales and machines evolve, new applications and workloads that take advantage of the hardware follow. As a result, older benchmark suites become less relevant, and new benchmark suites
7 Chapter 1. Introduction 3 emerge to explore new algorithms and workloads [5, 12]. Because applications and the systems they run on are constantly changing, our work focuses on creating a generic, flexible, and fast methodology that can capture the communication behaviour of any application on any system. Application Full System Simulation Cache Coherence Cache Configuration Microarchitecture Off-Chip Memory Et cetera Traffic Behaviour Ideal Network Packets Injected Synthetic Traffic Generation NoC Simulator Packet Latency Buffer Utilization Link Utilization Et cetera NoC Metrics Figure 1.2: A high level overview of our methodology for design space exploration. The right of the dashed line shows examples of inputs or outputs of the different components. Figure 1.2 shows the high level approach to our methodology. We begin by performing a fullsystem simulation on an ideal network (i.e. the network does not model congestion - packets arrive at their destination in uniform, single-cycle time). By using an ideal network, we ensure that our traffic modelling does not capture any aspects of a specific NoC configuration. Ideal networks also make it easier to understand traffic behaviour because each packet has a single-cycle latency. Once we have the traffic generated by full-system simulation on an ideal network, our traffic modelling can extract the spatial and temporal characteristics of the traffic such that there is sufficient information to recreate the traffic synthetically - we use the term synthetic because the traffic is artificial (that is, it is not produced by a full system simulation). The models we create need to provide enough information to recreate the traffic synthetically. For example, a model that looks only at the average hop count for its spatial characteristic will know how far to send a packet, but not which destination. Finally, the model is applied to a traffic generator that recreates the traffic in a fashion similar to full-system simulation, but without the complex modelling of specific components (e.g. caches, off-chip memory, cores). This improves simulation time, and setting up the traffic generator to drive an NoC simulator is easier than setting up a full system simulator. There are many different ways to model and generate traffic. In this dissertation, we analyze several models and parameters that explore the spatial and temporal characteristics of real application traffic
8 Chapter 1. Introduction 4 and how they effect the performance of different NoCs. 1.3 Thesis Organization and Contributions The focus of this dissertation is to propose a simulation methodology that can assess various NoC designs without the need for full-system simulation. We provide a brief primer on NoCs in Chapter 2. In Chapter 3, we introduce simulation methodologies already available to designers. In this thesis we pay special attention to NoC performance, which is greatly affected by the traffic generated from an application run. Therefore, an important first step is to understand the behaviour of application traffic, and develop models that capture its spatial and temporal characteristics. In Chapter 4, we discuss a methodology that captures the time-varying behaviour of the application. We apply a two-level hierarchical divide-and-conquer approach that splits application traffic into intervals. Chapter 5 then looks at how to reproduce intervals at the lowest level of our hierarchy to resemble traffic generated by a real application on a shared memory architecture. In Chapter 6, we explore how the parameters of our model can be changed to improve the fidelity of our methodology. We then apply our recommended parameters to a variety of applications from the PARSEC [5] and SPLASH-2 [41] benchmark suites, and compare NoC performance and simulation time to full-system simulation. The main contributions of this dissertation is creating a novel methodology that: 1. Generates bursty traffic similar to what would be seen by applications run on a full-system simulator. Burstiness can have a major impact on NoC performance because the network must accomodate a large amount of packets in a short amount of time. 2. Produces traffic that resembles a real cache coherence protocol. Cache coherence is a crucial component of current and future designs [23], and can impact traffic patterns as we will see in Chapter Can be applied to multiple NoC designs without loss of fidelity. Designers often need to explore several different designs to optimize the NoC for their applications. 4. Provides significant speed up over full-system simulation. Time-to-market constraints and the aforementioned need for design space exploration means that simulations must be fast as well as accurate.
9 Chapter 2 Background Efficient on-chip communication is essential for the performance of CMPs. NoCs have become the de facto interconnect of the future [22]. In this chapter, we present a brief primer regarding the basics of an NoC and how to evaluate its performance. 2.1 NoC Primer NoCs provide a scalable, high bandwidth alternative to bus-based and point-to-point interconnects. This is, in large part, thanks to the modular approach of connecting nodes to routers, depicted in Figure 6.2. A network node can contain several components, such as a processor core or a cache, and sends messages to other nodes via a router. In this chapter, we discuss: 1. Topology: How the routers are connected to each other 2. Routing & Flow Control: How messages are transported in the NoC 3. Performance: How NoC Performance is measured C D $ Figure 2.1: Nodes (squares) connected to Routers (circles). Nodes can contain several components, such as cores (C), caches ($), and Directories (D). 5
10 Chapter 2. Background Topology An NoC s topology can have a significant impact on its performance because it defines the paths available between nodes through routers [11]. Figure 2.2 shows one example of a 9-node ring topology, where the circles are considered to be nodes with routers and the lines connecting them are channels. In order for a message to travel from Node 0 to Node 3, it must travel through routers at Nodes 1 and 2 as well. This means messages going from node 0 to Node 3 take three hops to arrive at their destination. Conversely, a mesh network shown in Figure 2.3 allows messages travelling from Node 0 to Node 3 to take only one hop. Assuming perfect routing and flow control, the topology will give designers an upper bound for the NoC s performance Figure 2.2: A 9-node ring network topology Figure 2.3: A 9-node mesh network topology 2.2 Routing and Flow Control Messages move through the network according to a routing algorithm. The routing algorithm defines the path that can be taken by a message to its destination. Dimension-Order-Routing (DOR) is a deterministic routing algorithm that defines the minimal route (i.e. shortest path) between a source and its destination. The advantage to deterministic routing algorithms is that they are easy to implement, however these paths do not consider congestion in the NoC (Figure 2.4). Adaptive routing algorithms exist that allow messages to take alternate routes if the current path is too congested. This is analogous to driving downtown in a busy city; drivers are more likely to look for alternate roads to their destination rather than wait in traffic. Figure 2.5 shows how adaptive routing can completely avoid congested channels and take an alternate path to arrive at its destination earlier. From an NoC perspective, adaptive routing algorithms can improve performance by increasing path diversity through the network. Messages (or Packets) sent by nodes into the NoC can hold varying amounts of data, which are further discretized into flits. Flits help serialize a packet so that they can traverse through the network according to the bandwidth allowed by the channel. Flits are routed through different network resources, such as router buffers and virtual channels (VCs). Each channel in the network has multiple virtual channels associated with a router port. Multiple VCs helps improve link utilization because, if one packet stalls on its way to its destination (due to contention for a network resource), other packets can continue to route through the same path using a different VC. A high number of VCs can increase the bandwidth capabilities of an NoC, however they are also expensive in terms of area and power [11].
11 Chapter 2. Background Figure 2.4: A packet routed from Node 0 to Node 7 using Dimension-Order-Routing. Note the congestion between nodes 3, 6, and 7. Figure 2.5: A packet routed from Node 0 to Node 7 using adaptive routing, avoiding the congestion between nodes 3, 6, and Performance The main performance metric associated with NoC design is the packet latency. This is the time it takes for a packet to arrive at its destination from its source node. The latency of a packet can vary when we consider contention in the network; as the NoC becomes congested, its resources become full and packets must wait before they can continue to traverse the network. Waiting (or stalling) for network resources can dramatically increase the latency of a packet, and severely hinder network performance. The average packet latency is a common metric used to get a quick evaluation of NoC designs, however it is also informative to consider the packet latency distribution. Similar average packet latencies can be achieved with very different distributions. For example, a Gaussian distribution and bi-modal distribution can be constructed to give the same average. However, the bi-modal distribution has a large number of packets with very high and very low latencies. As a result, NoC designers would provision the network differently in order to improve network performance, taking into consideration the high bandwidth requirements for a given application. Conversely, a Gaussian distribution implies that the traffic is more easily manageable, and not as bursty (i.e. a large number of packets injected into the network over a short time). A common pitfall in measuing packet latency is disregarding the source queue. When a packet is to be sent into the network, it is possible that the input port to the network is busy. As a result, the source must delay injection of the packet until the input is ready [11]. This situation is common during bursty injection, and can have a significant impact on packet latency when the network is congested. In this dissertation, packet latency includes the source queue.
12 Chapter 3 Simulation Methodologies There are several simulation methodologies available to researchers when evaluating NoCs. In this chapter, we explore two software simulation methodologies (synthetic traffic patterns, and traces) and discuss recent advancements in the area of software simulation and other related work. 3.1 Synthetic Traffic Patterns Synthetic traffic patterns such as uniform random, permutation, tornado, etc. are widely used in NoC research. Many of these traditional synthetic traffic patterns are based on the communication pattern of specific applications. For example, transpose traffic is based on a matrix transpose application, and the shuffle permutation is derived from Fast-Fourier Transforms (FFTs) [2, 11]. However, these synthetic traffic patterns are not representative of the wide range of applications that run on current and future CMPs. Even if these traffic patterns were representative, the configuration of a cache-coherent system can mask or destroy the inherent communication pattern of the original algorithm due to the presence of indirections and control messages. Synthetic traffic patterns are typically applied to an NoC with a fixed injection rate. In this section we will demonstrate two methods for selecting an appropriate synthetic traffic pattern to explore an NoC design. The first method will use the injection rate from a real application, Facesim, and sweep several traffic patterns. The second method will use a synthetic traffic pattern, Shuffle, that is historically based on a real application (FFT). In both cases, we will see that current synthetic traffic patterns are not characteristic of the modern application landscape. The Facesim benchmark computes a realistic animation for a face through physics simulation. When run on an ideal network, it has an average injection rate of 0.16 packets per cycle. We apply this injection rate using a variety of synthetic traffic patterns that are used in the evaluation of NoCs [11]. Figure 3.1 shows the average packet latency, yielding an average error of 23.64% (our system configuration and methodology is explained in Chapter 6). Looking just at average packet latency, one could come to the conclusion that the Bit Complement traffic pattern best approximates the Facesim benchmark. However, average packet latency tells us little about the congestion in the network. Bit Complement is a subset of permutation traffic that is typically used to stress NoC configurations [11]. Specifically, given a 4-bit source address S x = {s 3, s 2, s 1, s 0 }, the destination is computed as the complement of each bit: D x = { s 3, s 2, s 1, s 0 }. Figure 3.2 shows the packet latency 8
13 Percentage of Packets Average Packet Latency Chapter 3. Simulation Methodologies Facesim Figure 3.1: Average Packet Latency of different traffic patterns on an NoC. The dashed line is the average packet latency of the Facesim benchmark. distributions for Facesim and Bit Complement (as a percentage of packets injected). Bit Complement has three definitive packet latencies (17, 27, and 37 cycles) whereas Facesim s packet latencies are more evenly distributed due to time-varying behaviour in the simulation. Bit Complement may produce the same average behaviour as Facesim (with regards to packet latency), but it does not provide bursts of traffic as one would expect from a real application. Facesim Bit Complement 50.00% 45.00% 40.00% 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% Packet Latency Figure 3.2: Packet latency distributions for Facesim and Bit Complement. Synthetic traffic pattern simulations can be made more complicated using varying injection rates for nodes, modulated-markov processes for injection, etc [11]. However, they are not representative of real application traffic due to the shared memory architectures that are typically modelled in full system simulation. The arrangement of cores, caches, directories, and memory controllers directly influences the flow of communication when running an application. To illustrate this point, we compare a synthetic shuffle pattern with the FFT benchmark from SPLASH-2 [41]. The shuffle pattern is a bit permutation where the destination is calculated via the function d i = s i 1 mod b where b is the number of bits required to represent the nodes of the network [11]. FFT is run in full-system simulation while shuffle is run in network-only simulation. Figure 3.3 shows the number
14 Chapter 3. Simulation Methodologies 10 Destination Destination Source (a) Shuffle Traffic Pattern Source (b) FFT Application Figure 3.3: A comparison of the spatial behaviour between synthetic and real application traffic of packets sent from a source to a destination 1. In Figure 3.3b, we see notable destination hot spots at nodes 0, 2, and 5, as well as source hot spots at nodes 0 and 5. However, Figure 3.3a shows hot spots only for specific source-destination pairs. The best NoC design for the traffic in Figure 3.3a is unlikely to be the best NoC for the traffic in Figure 3.3b. The sharp contrast in Figure 3.3 is due to coherence transactions needing to visit several nodes in a shared memory architecture before completing. For example, a write request first visits a directory to receive ownership of a cache line. The directory forwards requests to the core caching the data, and also invalidates caches who are sharing the data. Invalidated caches must send acknowledgements this domino effect is the typical behaviour in a shared memory architecture, and can significantly change an application s spatial behaviour and should be correctly modelled for realistic traffic generation. Synthetic traffic patterns are useful in revealing bottlenecks in the network while sweeping injection rates to find an NoC s saturation point. However, for researchers looking to design NoCs that are correctly provisioned for the application space and infrastructure of a real system, synthetic traffic patterns cannot be used to make conclusive decisions. 3.2 Trace Simulation Trace simulation first records the injected traffic from a full-system simulation and then replays the traces in an NoC simulator. This maintains the time-varying behaviour of the application and traces can also include cache coherence information about packets, alleviating the issues faced by synthetic traffic patterns. The main problem with this approach is that it ignores the dependencies that packets exhibit between each other. An over-provisioned NoC will likely deliver packets faster than one that is under-provisioned, and these could result in two very different traces for the same application. Traffic is dependent on how quickly messages are delivered, and is more suited to a control system that can react to ejected packets [21]. To demonstrate this simulation methodology s shortcomings, we take traces from full-system simulations that use an ideal network. The traces are then applied to a mesh network with two virtual channels and adaptive routing. Control packets are 8 bytes while data packets are 72 bytes, and the flit size is 4 1 The number of packets in each figure is unimportant as we focus on source-destination traffic pairs.
15 Average Packet Latency Chapter 3. Simulation Methodologies 11 bytes (see Chapter 6 for full details regarding our system configuration). Figure 3.4 shows the average packet latency across multiple PARSEC benchmarks. Ideally, the trace simulation would have similar performance to the full-system simulation, however this methodology can yield an error as high as 666% (fluidanimate). On average, the average packet latency is off by 234.5%, and the error could be higher for less provisioned networks. Traces can lead us to make invalid assumptions about the performance of our network, and should not be used for design space exploration. Full System Trace Figure 3.4: Comparison between full system and trace simulation for average packet latency across a network configuration 3.3 Traces with Dependencies Inferring packet dependencies in trace files can improve the fidelity of trace-based simulation, and there has been active research in the area [14, 27]. This provides similar benefits to regular trace simulations, but with the added benefit of throttling packets that should not be injected until a previous packet has arrived at its destination. In Netrace, dependencies between architectural components (such as a cache and memory controller) and the cache coherence protocol are inferred to create dependency traces [14]. Dependencies due to the program behaviour are not tracked, however the methodology uses simple in-order cores to alleviate this problem (i.e. memory requests from a core are serial, therefore no dependency tracking is necessary). Nitta et al. use a different approach by taking multiple full-system traces on a variety of NoC configurations. Comparing these traces for causality, a packet dependency graph (PDG) can be constructed for a particular application [27]. This approach can track program behaviour dependencies, however it is not always accurate Traces on other NoC configurations can introduce more dependencies to the graph. To evaluate traces with dependencies we employ an approach similar to Netrace, except we use Outof-Order cores to keep comparisons throughout this dissertation fair. In addition, in-order cores do not aggressively stress the network due to stalls for memory requests. When the processor stalls, no new messages will be injected into the network. We compare the fidelity and speed of trace simulations with dependencies in our evaluation (Chapter 6).
16 Chapter 3. Simulation Methodologies Other Methodologies and Related Work There are several simulation methodologies available to designers. In this section we present methodologies that improve simulation time and/or allow for many-core (i.e. hundreds, thousands of cores) design space exploration, as well as work that characterizes application behaviour either to better understand the application or to create synthetic benchmarks Simulation Acceleration Simulating small but representative parts of an application run has been widely explored, and two main methodologies exist: SimPoint [34] and SMARTS [42]. In SimPoint, Sherwood et al. capture the time-varying behaviour of programs using Basic Block Vectors (BBVs), which are then clustered (grouped) into phases. Multiple simulation points (hence, SimPoint) can then be inferred from these phases that represent the full execution of the simulation. SMARTS also simulates only parts of an application, but uses statistical sampling to determine which parts to simulate. Both SimPoint and SMARTS are targetted at single-thread workloads. However, recent work shows that, with some changes, the methodologies can be applied to parallel workloads as well [25, 43]. Sampling on multi-threaded applications has received renewed interest recently [1, 8]. These sampling methodologies have mostly been applied to micro-architectural simulation, so their efficacy for NoC evaluation is currently unknown. User-level simulators exist as an alternative to full system simulation for exploring thousands of cores [7, 24]. ZSim exploits parallel simulation with out-of-order core models to simulate 300 Million Instructions Per Second (MIPS) (compared to the roughly 200 KIPS by full system) [32]. However user-level simulators are designed to work without an operating system (current operating systems do not support thousands of cores), simulating only the user portion of an application. In addition, several components are not modelled, such as peripheral devices. As a result, many applications cannot be run on the simulation environment. Still, user-level simulators are a strong tool for futuristic design space exploration of thousand core architectures. FPGA-based acceleration has also been proposed [9, 37]. FIST implements an FPGA-based network simulator that can simulate mesh networks with significant speed up over software simulation [28]. DrNoC is FPGA framework for design space exploration of NoCs without requiring resynthesis of the NoC when its configuration changes [20]. DrNoC relies on the partial reconfigurability of Xilinx FPGAs, thereby requiring a specialized design flow [39]. The main drawback to FPGA-based simulation is that, while they are fast and accurate, they can be difficult to use [32]. In this dissertation we focus on speeding up simulation time for NoC design space exploration, and some network simulators exist that speed up simulation time. For example, Hornet [29] focuses on parallelizing an NoC simulation and can achieve a 12 speedup. Our work is orthogonal to Hornet because we can use our synthetic traffic generation to drive the network simulation. The key benefit here is that detailed modelling of cores, caches, and other components are not necessary, removing the need for full system simulation Workload Modelling and Synthetic Benchmarks Cloning can mimic workload behaviour by creating a reduced representation of the code [3, 17]. Much of this work focuses on cloning cache behavior; our work can be viewed as creating clones of cache coherence behaviour to stimulate the network. Creation of synthetic benchmarks for multi-threaded applications
17 Chapter 3. Simulation Methodologies 13 has been explored [13]; this work generates instruction streams that execute in simulation or on real hardware. Our work differs as we reproduce communication patterns and coherence behaviour, while abstracting away the processor and instruction execution. MinneSPEC [19] provides reduced input sets that effectively match the reference input for SPEC2000. Cloning can mimic workload behaviour by creating a reduced representation of the code [3, 17]. Much of this work focuses on cloning cache behavior; our work can be viewed as creating clones of cache coherence behaviour to stimulate the network. MinneSPEC [19] provides reduced input sets that effectively match the reference input for SPEC2000. Creation of synthetic benchmarks for multi-threaded applications has been explored [13]; this work generates instruction streams that execute in simulation or on real hardware. Our work differs as we reproduce communication patterns and coherence behavior while abstracting away the processor and instruction execution.
18 Chapter 4 Time-Varying Application Behaviour If traffic were monotonous and predictable, NoC design would be simple. Unfortunately, real applications exhibit time-varying behaviour that significantly impacts the packets injected into a network. This chapter introduces previous research on how applications vary with time and demonstrates how to capture this behaviour for both modelling and generating traffic. 4.1 Phase Behaviour in Applications Previous research shows that applications go through phases [33]. These phases can have a significant impact on the instructions per second (IPC), miss rates, and prediction rates of various microarchitectural components. Researchers can, in turn, exploit this phase behaviour to their advantage. One example is SimPoint [34], a methodology that proposes to simulate only small but representative parts of an application to reduce simulation time. Phase behaviour continues to be apparent in parallel applications, and can have significant effects on the time-varying behaviour of traffic generated by an application [43, 16]. This is important, as our methodology will need to capture this phase behaviour if it intends to realistically generate synthetic traffic. The time-varying behaviour of real application traffic makes it difficult to model on the whole. Instead, we propose looking at the data at both a macro- (millions or billions of cycles) and micro- (thousands or hundreds of thousands of cycles) level granularity. At each level, we divide the traffic into fixed-sized time intervals (see Defintion 1) in a divide-and-conquer approach that makes modelling the data more manageable. Once divided, we can group similar intervals that occur at different times in the application. Definition 1. An interval (I) is a span of cycles (C) ranging from C i to C i+, such that > 0 and therefore C i < C i+. Any two intervals cannot overlap and there is no gap between subsequent intervals. Figure 4.1 highlights two macro-intervals that would, from a visual standpoint, appear to have similar behaviour. We can analyze each interval by characterizing its traffic behaviour with a Feature Vector (a 14
19 Chapter 4. Time-Varying Application Behaviour 15 vector of elements, or features. An example of a feature could be the injection rate). The same features are used for each interval, but their magnitudes can (and likely will) differ. In the figure we see two feature vectors, V 1 and V 2, representing the two intervals Packets Injected V 1 V e e e e e+08 Time Bin (500,000 cycles) Figure 4.1: A high level view of clustering at a macro-level granularity. Once feature vectors have been constructed for each interval, we can mathematically determine those that are similar to each other. One method is to calculate the distance between them. There are many distance measures available, and this dissertation focuses on Euclidean distance. By calculating the distance between each pair of feature vectors, we can compile a distance matrix to compare all intervals to each other. From Figure 4.1, if our feature is simply the total number of packets injected (i.e. the y-axis), then the distance between V 1 and V 2 is small. Now that intervals can be compared using a quantitative metric, we can group them together into phases (Definition 2) using a method called clustering. This form of statistical analysis provides a methodology for detecting traffic phases in an application. The efficacy of this methodology is reliant on the clustering algorithm and feature vector used. There are several clustering algorithms available, however the feature vector can have the greatest impact on which intervals constitute which phases. Section 4.2 explores feature vectors in more detail, and Section 4.3 discusses how to reproduce these intervals using relevant clustering approaches and Markov chains. Definition 2. A traffic phase is a group (or cluster) of intervals that behave in a similar manner. Traffic phases are typically reccuring, but not necessarily periodic.
20 Chapter 4. Time-Varying Application Behaviour Feature Vector Design The real key to an effective clustering technique comes from defining similar behaviour. What does it mean that one interval behaves similarly to another? The answer to this question will affect the design of the feature vector used. With respect to network traffic, the elements of a feature vector would describe communication behaviour. In this section, we introduce and discuss five different feature vectors, and in Chapter 6 we evaluate four of the five: 1. Total Injection - packets injected by all nodes 2. Coherence Composition (not evaluated) - packets divided by coherence type for all nodes 3. Node Injection - packets injected for each source 4. Row-Column Flow - packets injected for a group of source nodes to a group of destination nodes 5. Per-Node Flow - packets injected for each source-destination pair An important note when designing a feature vector, since we use Euclidean Distance, is to ensure all elements are on the same scale (or have the same unit of measurement). The reason for this is evident when calculating the distance between vectors. One element that is on a larger scale than another can eclipse it completely. For example, consider the injection rate of an interval to be in the range of [0, 1] as one element, and the total number of write requests in an interval to be in the range of [0, 40] (hypothetically) as another. Let s observe what happens between two intervals that use both these elements in a feature vector: V ector = (injection rate, write requests) A = (0.001, 3) B = (0.016, 15) EuclideanDistance = n (q i p i ) 2 = ( ) 2 + (15 3) 2 i=1 = = From above, we see that even though there is a 16 difference between injection rates (versus 5 for write requests), it does not affect the Euclidean Distance (the result is similar for Manhattan Distance). Therefore, features of the vector should not greatly differ in their empirically observed magnitudes. In addition to ensuring elements have the same unit of measure, it is sometimes tempting to use large feature vectors with many elements. However, such an approach is not always feasible. For one, there is the so-called curse of dimensionality where the data required to populate the vectors is insufficient for the size of the vector [4], and simply parsing the data to construct these feature vectors can take a long time. Feature vectors with a high dimensionality should be reserved for large data sets.
21 Chapter 4. Time-Varying Application Behaviour Total Injection A simple, one-dimensional feature vector that can characterize communication behaviour is the total number of packets injected for all nodes in the network. This feature vector allows us to differentiate between intervals that are experiencing high, low, or in-between levels of communication. The benefits of such a feature vector is that it is easy to create; calculating the total number of packets in an interval is a simple subtraction operation. In addition, because it is one-dimensional, calculating the distance between vectors and running through the clustering algorithm is also fast. The disadvantages of the Total Injection feature vector are rooted in its simplicity. The total number of packets tells us nothing about the spatiality of the traffic behaviour. That is, even though two vectors may have similar magnitudes, their respective intervals could exhibit different spatial behaviour, such as hot spots. In addition, the vector does not tell us what types of messages are being injected. For example, one interval could be issuing several read requests to retrieve data, while another is issuing write requests that invalidate sharers. A one-dimensional vector is more suited to large datasets where constructing complex feature vectors takes too much time and has too much overhead. Because it does not capture several characteristics of communication behaviour, it can only give us a rough view of which intervals may be similar Coherence Composition During full-system simulation of a coherent, shared memory architecture, packets are injected according to a cache coherence protocol. Therefore each packet is associated with a message type (Definition 3) of the protocol. Using this information we can discern between read and write phases of an application, among other things. Definition 3. A message type is used to describe the category a packet belongs to. The category (or type) reveals the reason a packet has been sent to a destination, allowing the cache coherence state machine to react accordingly. Consider intervals at the micro-scale (hundreds of cycles). It is likely that a write request interval is accompanied with several invalidate messages to notify sharers that the cache line has changed. Similarly, a read request interval would not be associated with invalidated messages because the cache line is not updated. Intuitively, we would group write request and read request intervals into two separate phases, and a coherence composition feature vector does this well. Figure 4.2 shows the composition of three different message types (reads, writes, and invalidates) across 25 consecutive intervals (data taken from the Swaptions application). It is easy to see the distinction between a read phase (intervals 2-11) and a write phase (intervals 14-25). During the early part of the write phase (intervals 14-18), several invalidate messages are sent to update sharers on the status of the cache line. The dimensionality of a coherence composition vector can be small. In a cache coherence protocol, there are several message types that rely on others in a one-to-one mapping. For example, invalidate packets are typically followed by acknowledgement packets to complete a handshake. We can therefore exclude acknowledgement packets from the feature vector because they are already encompassed by the
22 Number of Packets Chapter 4. Time-Varying Application Behaviour 18 Reads Invalidates Writes Figure 4.2: Number of read, write, and invalidate packets over a 25 interval portion of the Swaptions benchmark invalidate feature. This example can apply to several message types, and therefore only message types most representative of cache coherence behaviour need to be used. The advantage of this feature vector is that its dimensionality can be kept low, and also constant (i.e. it does not scale with any parameter, such as the number of nodes in the network) according to the protocol being used. It allows us to intuitively distinguish between intervals that are exhibiting different behaviour according to the cache coherence protocol. However, the feature vector does not include information about the spatial behaviour of the application. Therefore, it ignores potential hot spot information, which can have a significant impact on performance during simulation. Coherence Composition is most useful at the micro-level to differentiate between short phases that may occur due to short loops and control flows in the code. Coherence Composition can also be useful at the macro level, however depending on the size and number of intervals it can introduce a lot of overhead because each packet in the interval needs to be analyzed and sorted by message type Node Injection If we look at the injection distribution across N nodes in the system, we can construct a feature vector that includes the spatial characteristics of application traffic. This feature vector is similar to Total Injection in how the number of packets are counted, however it now scales with N because a dimension exists for each node (Figure 4.3). The spatial injection distribution helps identify injecting hotspots that is, nodes that send a lot of packets. But hot spots can also exist as a destintation that is, nodes that receive a lot of packets. Intuitively, a node that sends a lot of packets has likely received a lot of packets as well (following the request-response mantra common in cache coherence protocols, which we will see in Chapter 5). However the relationship between sent and received messages is lost. That is, a Node Injection feature vector cannot tell us which nodes are communicating with each other; the feature vector only tells us which nodes are communicating more than others. At the micro-scale, Node Injection is not ideal because during low-communication phases of an application several nodes may be injecting zero packets. This would skew distance measures and classify several intervals as similar, even if they had different injecting hot spots. Take, for example, a 3-node
23 Chapter 4. Time-Varying Application Behaviour 19 <N 1, N 2,, N 9 > Figure 4.3: How nodes map to the Node Injection feature vector. architecture. In one interval (I 1 ), all packets are injected from Node 1 (30 packets). In a second interval (I 2 ), all packets are injected from Node 3 (25 packets). And in a third interval (I 3 ), all packets are injected from Node 2 (28 packets). From the Euclidean distance matrix (Table 4.1) we can see that despite different nodes being responsible for injection, the difference between distances is small (approximately ±2, with similar deltas using Manhattan Distance). That is, intervals are considered similar if they have at least a one-node hotspot (in this example). I 1 I 2 I 3 I I I Table 4.1: A distance matrix for three example vectors. We can improve the scaling of the Node Injection feature vector by looking at rows and columns instead of individual nodes [16]. That is, we observe the total number of packets injected by each row of nodes, and each column of nodes, to construct a feature vector that scales with 2 N. Because we look at both the rows and the columns, there is overlap in the elements of the vector, which can tell us more specifically where the hot spot is (although not as specifically as using a per-node vector). Node Injection is better suited for macro-scale clustering where more packets ensure a more populated feature vector. This way all elements have a magnitude that can influence the distance measure, allowing the clustering algorithm to more accurately group similar intervals Row-Column Flow The Row-Column Flow feature vector captures the spatial behaviour of traffic, as well as the relationship between sent and received messages across rows and columns. Each element of the vector corresponds to the number of packets sent by one group of nodes (a row) to another group of nodes (a column), shown
24 Chapter 4. Time-Varying Application Behaviour 20 in Figure 4.4. We use the words row and column to make the vector easier to understand the actual mapping of nodes onto the network does not have to be grid-like. Source (Row) <N 1, N 2, N 3,, N 9 > Flow Destination (Column) Figure 4.4: A visual demonstration of how Row-Column Flow combinations map to a feature vector. In this example, Row 1 and Column 3 make up the third source-destination pair For an N-node system, the vector scales at a rate of N, just like the Node Injection vector. However, because Row-Column Flow vectors contain both source and destination information (albeit aggregated into rows and columns), they can more accurately detect hotspots that occur during simulation Per-Node Flows Capturing the relationship between each and every node creates a large feature vector that scales at a rate of N 2. This method counts the number of packets being sent from each source to each destination, which we define as Per-Node Flows (Definition 4). Definition 4. A flow is a source-destination pair [16]. In an N node network, there are N 2 flows. The Per-Node Flow feature vector reveals communication behaviour at the finest granularity, and can help identify the exact location of hotspots by not aggregating information across several nodes. Because of its size, the vector should only be used when sufficient data is available to populate each element (otherwise a similar situation to Table 4.1 will occur) Summary We have introduced five feature vectors with different advantages and disadvantages depending on the number of nodes or packet information available (A summary can be found in Table 4.2). The vectors are used to characterize the communication behaviour of an interval (recall Definition 1). In Section 4.3, we discuss a methodology for recreating these intervals during software simulation. In our evaluation (Chapter 6), we will compare Total, Node, Row-Column Flow, and Per-Node Flow vectors. We omit Coherence Composition because, in our experiments, it did not accurately
Asynchronous Bypass Channels
Asynchronous Bypass Channels Improving Performance for Multi-Synchronous NoCs T. Jain, P. Gratz, A. Sprintson, G. Choi, Department of Electrical and Computer Engineering, Texas A&M University, USA Table
More informationSystem Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1
System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect
More informationFPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More informationHow To Provide Qos Based Routing In The Internet
CHAPTER 2 QoS ROUTING AND ITS ROLE IN QOS PARADIGM 22 QoS ROUTING AND ITS ROLE IN QOS PARADIGM 2.1 INTRODUCTION As the main emphasis of the present research work is on achieving QoS in routing, hence this
More informationPerformance Workload Design
Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles
More informationScaling 10Gb/s Clustering at Wire-Speed
Scaling 10Gb/s Clustering at Wire-Speed InfiniBand offers cost-effective wire-speed scaling with deterministic performance Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400
More informationJuniper Networks QFabric: Scaling for the Modern Data Center
Juniper Networks QFabric: Scaling for the Modern Data Center Executive Summary The modern data center has undergone a series of changes that have significantly impacted business operations. Applications
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationEverything you need to know about flash storage performance
Everything you need to know about flash storage performance The unique characteristics of flash make performance validation testing immensely challenging and critically important; follow these best practices
More informationWhy the Network Matters
Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing
More informationDesign and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip
Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Ms Lavanya Thunuguntla 1, Saritha Sapa 2 1 Associate Professor, Department of ECE, HITAM, Telangana
More informationArchitectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng
Architectural Level Power Consumption of Network Presenter: YUAN Zheng Why Architectural Low Power Design? High-speed and large volume communication among different parts on a chip Problem: Power consumption
More informationA Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator
A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator Nan Jiang Stanford University qtedq@cva.stanford.edu James Balfour Google Inc. jbalfour@google.com Daniel U. Becker Stanford University
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
More informationPerformance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09
Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia December 12,
More informationCROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING
CHAPTER 6 CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING 6.1 INTRODUCTION The technical challenges in WMNs are load balancing, optimal routing, fairness, network auto-configuration and mobility
More informationNetworking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
More informationHyper Node Torus: A New Interconnection Network for High Speed Packet Processors
2011 International Symposium on Computer Networks and Distributed Systems (CNDS), February 23-24, 2011 Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors Atefeh Khosravi,
More informationNetFlow Performance Analysis
NetFlow Performance Analysis Last Updated: May, 2007 The Cisco IOS NetFlow feature set allows for the tracking of individual IP flows as they are received at a Cisco router or switching device. Network
More informationInterconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!
Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationLoad Balancing and Switch Scheduling
EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load
More informationWindows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
More informationIn-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches
In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches Xi Chen 1, Zheng Xu 1, Hyungjun Kim 1, Paul V. Gratz 1, Jiang Hu 1, Michael Kishinevsky 2 and Umit Ogras 2
More informationPART III. OPS-based wide area networks
PART III OPS-based wide area networks Chapter 7 Introduction to the OPS-based wide area network 7.1 State-of-the-art In this thesis, we consider the general switch architecture with full connectivity
More informationLecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
Lecture 18: Interconnection Networks CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Project deadlines: - Mon, April 2: project proposal: 1-2 page writeup - Fri,
More informationVorlesung Rechnerarchitektur 2 Seite 178 DASH
Vorlesung Rechnerarchitektur 2 Seite 178 Architecture for Shared () The -architecture is a cache coherent, NUMA multiprocessor system, developed at CSL-Stanford by John Hennessy, Daniel Lenoski, Monica
More informationHardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy
Hardware Implementation of Improved Adaptive NoC Rer with Flit Flow History based Load Balancing Selection Strategy Parag Parandkar 1, Sumant Katiyal 2, Geetesh Kwatra 3 1,3 Research Scholar, School of
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationRealistic Workload Characterization and Analysis for Networks-on-Chip Design
Realistic Workload Characterization and Analysis for Networks-on-Chip Design Paul V. Gratz Department of Electrical and Computer Engineering Texas A&M University Email: pgratz@tamu.edu Abstract As silicon
More informationThe Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage
The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage sponsored by Dan Sullivan Chapter 1: Advantages of Hybrid Storage... 1 Overview of Flash Deployment in Hybrid Storage Systems...
More informationIntroduction to Parallel Computing. George Karypis Parallel Programming Platforms
Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a Parallel Computer Hardware Multiple Processors Multiple Memories Interconnection Network System Software Parallel
More informationOn some Potential Research Contributions to the Multi-Core Enterprise
On some Potential Research Contributions to the Multi-Core Enterprise Oded Maler CNRS - VERIMAG Grenoble, France February 2009 Background This presentation is based on observations made in the Athole project
More informationOptical interconnection networks with time slot routing
Theoretical and Applied Informatics ISSN 896 5 Vol. x 00x, no. x pp. x x Optical interconnection networks with time slot routing IRENEUSZ SZCZEŚNIAK AND ROMAN WYRZYKOWSKI a a Institute of Computer and
More informationPreserving Message Integrity in Dynamic Process Migration
Preserving Message Integrity in Dynamic Process Migration E. Heymann, F. Tinetti, E. Luque Universidad Autónoma de Barcelona Departamento de Informática 8193 - Bellaterra, Barcelona, Spain e-mail: e.heymann@cc.uab.es
More informationHP Smart Array Controllers and basic RAID performance factors
Technical white paper HP Smart Array Controllers and basic RAID performance factors Technology brief Table of contents Abstract 2 Benefits of drive arrays 2 Factors that affect performance 2 HP Smart Array
More informationLecture 2.1 : The Distributed Bellman-Ford Algorithm. Lecture 2.2 : The Destination Sequenced Distance Vector (DSDV) protocol
Lecture 2 : The DSDV Protocol Lecture 2.1 : The Distributed Bellman-Ford Algorithm Lecture 2.2 : The Destination Sequenced Distance Vector (DSDV) protocol The Routing Problem S S D D The routing problem
More informationWhite Paper Abstract Disclaimer
White Paper Synopsis of the Data Streaming Logical Specification (Phase I) Based on: RapidIO Specification Part X: Data Streaming Logical Specification Rev. 1.2, 08/2004 Abstract The Data Streaming specification
More informationA Dynamic Link Allocation Router
A Dynamic Link Allocation Router Wei Song and Doug Edwards School of Computer Science, the University of Manchester Oxford Road, Manchester M13 9PL, UK {songw, doug}@cs.man.ac.uk Abstract The connection
More informationUse-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors
Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors Hyungjun Kim, 1 Arseniy Vitkovsky, 2 Paul V. Gratz, 1 Vassos Soteriou 2 1 Department of Electrical and Computer Engineering, Texas
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationco Characterizing and Tracing Packet Floods Using Cisco R
co Characterizing and Tracing Packet Floods Using Cisco R Table of Contents Characterizing and Tracing Packet Floods Using Cisco Routers...1 Introduction...1 Before You Begin...1 Conventions...1 Prerequisites...1
More informationPERFORMANCE OF MOBILE AD HOC NETWORKING ROUTING PROTOCOLS IN REALISTIC SCENARIOS
PERFORMANCE OF MOBILE AD HOC NETWORKING ROUTING PROTOCOLS IN REALISTIC SCENARIOS Julian Hsu, Sameer Bhatia, Mineo Takai, Rajive Bagrodia, Scalable Network Technologies, Inc., Culver City, CA, and Michael
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationLoad Balancing Mechanisms in Data Center Networks
Load Balancing Mechanisms in Data Center Networks Santosh Mahapatra Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 33 {mahapatr,xyuan}@cs.fsu.edu Abstract We consider
More informationCongestion Control Overview
Congestion Control Overview Problem: When too many packets are transmitted through a network, congestion occurs t very high traffic, performance collapses completely, and almost no packets are delivered
More informationInterconnection Network Design
Interconnection Network Design Vida Vukašinović 1 Introduction Parallel computer networks are interesting topic, but they are also difficult to understand in an overall sense. The topological structure
More informationCOMMUNICATION PERFORMANCE EVALUATION AND ANALYSIS OF A MESH SYSTEM AREA NETWORK FOR HIGH PERFORMANCE COMPUTERS
COMMUNICATION PERFORMANCE EVALUATION AND ANALYSIS OF A MESH SYSTEM AREA NETWORK FOR HIGH PERFORMANCE COMPUTERS PLAMENKA BOROVSKA, OGNIAN NAKOV, DESISLAVA IVANOVA, KAMEN IVANOV, GEORGI GEORGIEV Computer
More informationWeb Server Software Architectures
Web Server Software Architectures Author: Daniel A. Menascé Presenter: Noshaba Bakht Web Site performance and scalability 1.workload characteristics. 2.security mechanisms. 3. Web cluster architectures.
More informationInterconnection Network
Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network
More informationRuntime Hardware Reconfiguration using Machine Learning
Runtime Hardware Reconfiguration using Machine Learning Tanmay Gangwani University of Illinois, Urbana-Champaign gangwan2@illinois.edu Abstract Tailoring the machine hardware to varying needs of the software
More informationComponents: Interconnect Page 1 of 18
Components: Interconnect Page 1 of 18 PE to PE interconnect: The most expensive supercomputer component Possible implementations: FULL INTERCONNECTION: The ideal Usually not attainable Each PE has a direct
More informationNetwork-Wide Class of Service (CoS) Management with Route Analytics. Integrated Traffic and Routing Visibility for Effective CoS Delivery
Network-Wide Class of Service (CoS) Management with Route Analytics Integrated Traffic and Routing Visibility for Effective CoS Delivery E x e c u t i v e S u m m a r y Enterprise IT and service providers
More informationAgenda. Distributed System Structures. Why Distributed Systems? Motivation
Agenda Distributed System Structures CSCI 444/544 Operating Systems Fall 2008 Motivation Network structure Fundamental network services Sockets and ports Client/server model Remote Procedure Call (RPC)
More informationMAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents
More informationCHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE
CHAPTER 5 71 FINITE STATE MACHINE FOR LOOKUP ENGINE 5.1 INTRODUCTION Finite State Machines (FSMs) are important components of digital systems. Therefore, techniques for area efficiency and fast implementation
More informationOutline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip
Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC
More informationThe Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM. 2012-13 CALIENT Technologies www.calient.
The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM 2012-13 CALIENT Technologies www.calient.net 1 INTRODUCTION In datacenter networks, video, mobile data, and big data
More informationHPAM: Hybrid Protocol for Application Level Multicast. Yeo Chai Kiat
HPAM: Hybrid Protocol for Application Level Multicast Yeo Chai Kiat Scope 1. Introduction 2. Hybrid Protocol for Application Level Multicast (HPAM) 3. Features of HPAM 4. Conclusion 1. Introduction Video
More informationCluster Analysis for Evaluating Trading Strategies 1
CONTRIBUTORS Jeff Bacidore Managing Director, Head of Algorithmic Trading, ITG, Inc. Jeff.Bacidore@itg.com +1.212.588.4327 Kathryn Berkow Quantitative Analyst, Algorithmic Trading, ITG, Inc. Kathryn.Berkow@itg.com
More informationArchitectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
More information18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
More informationApplication Performance Testing Basics
Application Performance Testing Basics ABSTRACT Todays the web is playing a critical role in all the business domains such as entertainment, finance, healthcare etc. It is much important to ensure hassle-free
More informationCONTINUOUS scaling of CMOS technology makes it possible
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 693 It s a Small World After All : NoC Performance Optimization Via Long-Range Link Insertion Umit Y. Ogras,
More informationPerformance of networks containing both MaxNet and SumNet links
Performance of networks containing both MaxNet and SumNet links Lachlan L. H. Andrew and Bartek P. Wydrowski Abstract Both MaxNet and SumNet are distributed congestion control architectures suitable for
More informationQuality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS?
18-345: Introduction to Telecommunication Networks Lectures 20: Quality of Service Peter Steenkiste Spring 2015 www.cs.cmu.edu/~prs/nets-ece Overview What is QoS? Queuing discipline and scheduling Traffic
More informationEnhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data
White Paper Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data What You Will Learn Financial market technology is advancing at a rapid pace. The integration of
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More informationComputer Network. Interconnected collection of autonomous computers that are able to exchange information
Introduction Computer Network. Interconnected collection of autonomous computers that are able to exchange information No master/slave relationship between the computers in the network Data Communications.
More informationPhotonic Networks for Data Centres and High Performance Computing
Photonic Networks for Data Centres and High Performance Computing Philip Watts Department of Electronic Engineering, UCL Yury Audzevich, Nick Barrow-Williams, Robert Mullins, Simon Moore, Andrew Moore
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationCHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL
CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL This chapter is to introduce the client-server model and its role in the development of distributed network systems. The chapter
More informationModule 7. Routing and Congestion Control. Version 2 CSE IIT, Kharagpur
Module 7 Routing and Congestion Control Lesson 4 Border Gateway Protocol (BGP) Specific Instructional Objectives On completion of this lesson, the students will be able to: Explain the operation of the
More informationIn-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
More informationTRACKER: A Low Overhead Adaptive NoC Router with Load Balancing Selection Strategy
TRACKER: A Low Overhead Adaptive NoC Router with Load Balancing Selection Strategy John Jose, K.V. Mahathi, J. Shiva Shankar and Madhu Mutyam PACE Laboratory, Department of Computer Science and Engineering
More informationAN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK
Abstract AN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK Mrs. Amandeep Kaur, Assistant Professor, Department of Computer Application, Apeejay Institute of Management, Ramamandi, Jalandhar-144001, Punjab,
More informationPerformance Analysis of Storage Area Network Switches
Performance Analysis of Storage Area Network Switches Andrea Bianco, Paolo Giaccone, Enrico Maria Giraudo, Fabio Neri, Enrico Schiattarella Dipartimento di Elettronica - Politecnico di Torino - Italy e-mail:
More informationCOMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP
More informationDynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks
Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks Benjamin Schiller and Thorsten Strufe P2P Networks - TU Darmstadt [schiller, strufe][at]cs.tu-darmstadt.de
More informationHow To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology
More informationPerformance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
More informationJoint ITU-T/IEEE Workshop on Carrier-class Ethernet
Joint ITU-T/IEEE Workshop on Carrier-class Ethernet Quality of Service for unbounded data streams Reactive Congestion Management (proposals considered in IEE802.1Qau) Hugh Barrass (Cisco) 1 IEEE 802.1Qau
More informationΤΕΙ Κρήτης, Παράρτηµα Χανίων
ΤΕΙ Κρήτης, Παράρτηµα Χανίων ΠΣΕ, Τµήµα Τηλεπικοινωνιών & ικτύων Η/Υ Εργαστήριο ιαδίκτυα & Ενδοδίκτυα Η/Υ Modeling Wide Area Networks (WANs) ρ Θεοδώρου Παύλος Χανιά 2003 8. Modeling Wide Area Networks
More informationLustre Networking BY PETER J. BRAAM
Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information
More informationLoad Distribution in Large Scale Network Monitoring Infrastructures
Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu
More informationCloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale Cloud Computing Environments
433-659 DISTRIBUTED COMPUTING PROJECT, CSSE DEPT., UNIVERSITY OF MELBOURNE CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale Cloud Computing Environments MEDC Project Report
More informationPerformance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications
Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications by Samuel D. Kounev (skounev@ito.tu-darmstadt.de) Information Technology Transfer Office Abstract Modern e-commerce
More informationNetwork Architecture and Topology
1. Introduction 2. Fundamentals and design principles 3. Network architecture and topology 4. Network control and signalling 5. Network components 5.1 links 5.2 switches and routers 6. End systems 7. End-to-end
More informationVMWARE WHITE PAPER 1
1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the
More informationSDN and FTTH Software defined networking for fiber networks
SDN and FTTH Software defined networking for fiber networks A new method to simplify management of FTTH networks What is SDN Software Defined Networking (SDN) revolutionizes service deployment and service
More informationDetecting Network Anomalies. Anant Shah
Detecting Network Anomalies using Traffic Modeling Anant Shah Anomaly Detection Anomalies are deviations from established behavior In most cases anomalies are indications of problems The science of extracting
More informationECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers
ECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers Christina Delimitrou 1, Sriram Sankar 2, Aman Kansal 3, Christos Kozyrakis 1 1 Stanford University 2 Microsoft 3
More information- Nishad Nerurkar. - Aniket Mhatre
- Nishad Nerurkar - Aniket Mhatre Single Chip Cloud Computer is a project developed by Intel. It was developed by Intel Lab Bangalore, Intel Lab America and Intel Lab Germany. It is part of a larger project,
More informationMonitoring Large Flows in Network
Monitoring Large Flows in Network Jing Li, Chengchen Hu, Bin Liu Department of Computer Science and Technology, Tsinghua University Beijing, P. R. China, 100084 { l-j02, hucc03 }@mails.tsinghua.edu.cn,
More informationSolving I/O Bottlenecks to Enable Superior Cloud Efficiency
WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one
More informationEfficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols
Universitat Politècnica de València Master Thesis Efficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols Author: Mario Lodde Advisor: Prof. José Flich Cardo A thesis
More informationD1.1 Service Discovery system: Load balancing mechanisms
D1.1 Service Discovery system: Load balancing mechanisms VERSION 1.0 DATE 2011 EDITORIAL MANAGER Eddy Caron AUTHORS STAFF Eddy Caron, Cédric Tedeschi Copyright ANR SPADES. 08-ANR-SEGI-025. Contents Introduction
More informationCHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION 1.1 Background The command over cloud computing infrastructure is increasing with the growing demands of IT infrastructure during the changed business scenario of the 21 st Century.
More informationA Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing
A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing N.F. Huysamen and A.E. Krzesinski Department of Mathematical Sciences University of Stellenbosch 7600 Stellenbosch, South
More information