COMMUNICATION PERFORMANCE EVALUATION AND ANALYSIS OF A MESH SYSTEM AREA NETWORK FOR HIGH PERFORMANCE COMPUTERS



Similar documents
Asynchronous Bypass Channels

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

TRACKER: A Low Overhead Adaptive NoC Router with Load Balancing Selection Strategy

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Switched Interconnect for System-on-a-Chip Designs

Interconnection Network Design

Design of a Feasible On-Chip Interconnection Network for a Chip Multiprocessor (CMP)

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Interconnection Network

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs

3D On-chip Data Center Networks Using Circuit Switches and Packet Switches

Optimizing Configuration and Application Mapping for MPSoC Architectures

PERFORMANCE STUDY AND SIMULATION OF AN ANYCAST PROTOCOL FOR WIRELESS MOBILE AD HOC NETWORKS

Scalability and Classifications

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

vci_anoc_network Specifications & implementation for the SoClib platform

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Chapter 2. Multiprocessors Interconnection Networks

Performance Evaluation of Multi-Core Multi-Cluster Architecture (MCMCA)

A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator

Performance Evaluation of AODV, OLSR Routing Protocol in VOIP Over Ad Hoc

Int. J. Advanced Networking and Applications Volume: 5 Issue: 5 Pages: (2014) ISSN :

Performance Analysis of Storage Area Network Switches

DESIGN AND VERIFICATION OF LSR OF THE MPLS NETWORK USING VHDL

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

TDT 4260 lecture 11 spring semester Interconnection network continued

Low-Overhead Hard Real-time Aware Interconnect Network Router

Load Balancing Mechanisms in Data Center Networks

A Dynamic Link Allocation Router

Architecture of distributed network processors: specifics of application in information security systems

White Paper Abstract Disclaimer

Load Balancing and Switch Scheduling

Quality of Service (QoS) for Asynchronous On-Chip Networks

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Internet Firewall CSIS Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS net15 1. Routers can implement packet filtering

Smart Queue Scheduling for QoS Spring 2001 Final Report

Communication Networks. MAP-TELE 2011/12 José Ruela

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Lecture 2 Parallel Programming Platforms

Why the Network Matters

Scaling 10Gb/s Clustering at Wire-Speed

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

CONTINUOUS scaling of CMOS technology makes it possible

Faculty of Engineering Computer Engineering Department Islamic University of Gaza Network Chapter# 19 INTERNETWORK OPERATION

Content Delivery Network (CDN) and P2P Model


Towards a Design Space Exploration Methodology for System-on-Chip

4 Internet QoS Management

QUALITY OF SERVICE METRICS FOR DATA TRANSMISSION IN MESH TOPOLOGIES

Introduction to LAN/WAN. Network Layer

diversifeye Application Note

How To Monitor Performance On Eve

A CDMA Based Scalable Hierarchical Architecture for Network- On-Chip

A Low Latency Router Supporting Adaptivity for On-Chip Interconnects

Synthetic Traffic Models that Capture Cache Coherent Behaviour. Mario Badr

Web Server Software Architectures

High Performance Cluster Support for NLB on Window

AS the number of components in a system increases,

Application. Performance Testing

Configuration Discovery and Mapping of a Home Network

Assignment #3 Routing and Network Analysis. CIS3210 Computer Networks. University of Guelph

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors

Interconnection Networks

A Preferred Service Architecture for Payload Data Flows. Ray Gilstrap, Thom Stone, Ken Freeman

2. Research and Development on the Autonomic Operation. Control Infrastructure Technologies in the Cloud Computing Environment

On-Chip Interconnection Networks Low-Power Interconnect

PART III. OPS-based wide area networks

Maximizing Server Storage Performance with PCI Express and Serial Attached SCSI. Article for InfoStor November 2003 Paul Griffith Adaptec, Inc.


Transport Layer Protocols

Experimental Evaluation of Horizontal and Vertical Scalability of Cluster-Based Application Servers for Transactional Workloads

Optical interconnection networks with time slot routing

Course 12 Synchronous transmission multiplexing systems used in digital telephone networks

Providing Deterministic Quality-of-Service Guarantees on WDM Optical Networks

Study Plan Masters of Science in Computer Engineering and Networks (Thesis Track)

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS

An Active Packet can be classified as

Interconnection Networks

AN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK

A Comparison Study of Qos Using Different Routing Algorithms In Mobile Ad Hoc Networks

Optimizing Shared Resource Contention in HPC Clusters

Optimization of Computer Network for Efficient Performance

Topology adaptive network-on-chip design and implementation

Preserving Message Integrity in Dynamic Process Migration

Quality of Service Routing Network and Performance Evaluation*

RSVP- A Fault Tolerant Mechanism in MPLS Networks

Question: 3 When using Application Intelligence, Server Time may be defined as.

Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT:

How To Provide Qos Based Routing In The Internet

SAN Conceptual and Design Basics

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Topological Properties

Influence of Load Balancing on Quality of Real Time Data Transmission*

Efficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols

Transcription:

COMMUNICATION PERFORMANCE EVALUATION AND ANALYSIS OF A MESH SYSTEM AREA NETWORK FOR HIGH PERFORMANCE COMPUTERS PLAMENKA BOROVSKA, OGNIAN NAKOV, DESISLAVA IVANOVA, KAMEN IVANOV, GEORGI GEORGIEV Computer Systems Department Technical University of Sofia 8 Kliment Ohridski Boul., 1756 Sofia BULGARIA pborovska@tu-sofia.bg, nakov@tu-sofia.bg, d_ivanova@tu-sofia.bg, kamenveselinov@gmail.com, george.rusiichev@gmail.com http://cs-tusofia.eu/ Abstract: - The design of an interconnection network and switch architectural design are significantly influenced by contemporary supercomputer technology. As technology evolves, its impact on interconnection network needs to be reevaluated and elaborated. In this paper we have traced the path in this direction and addressed performance analyses of a high speed 4x4 switch design and interconnect network in mesh topology on the basis of computer simulations using OMNeT++. The suggested models have been verified on the basis of program implementations on IBM HS21 blade center. Key-Words: High Speed Design, Interconnect Network, 2DMesh network architecture, Simulation, Traffic pattern, Performance Analysis 1 Introduction The constantly increasing needs of powerful computer resources like supercomputers and clusters, which are collections of computers that are highly interconnected via a high-speed network, lead us to start research in the area of high-speed networks. The goal of the article is to evaluate the communication performance of the switch design and mesh network design built up on the basis of this switch via simulations using the discrete event simulator OMNeT++. Simulation experiments are intended to model a highperformance switch and system area network design for supercomputers, which connects nodes in a 2D mesh network architecture, implementing dimensional order routing (DOR), and utilizing wormhole flow-control. This topology is also preferred in the newest state-ofthe-art chips like the tiled Network on chip (NoC) architecture, implemented in Tilera s product line. The platform to build-up the simulation is going to be the OMNeT++ platform, because of its efficiency in queuing networks, its popularity in academia, plentiful online documentation and its extensibility (due to its open source model). 2 Network Architecture and Simulation Methodology The case study under investigation is mesh network architecture. Three kinds of meshes are distinguished: one-dimensional meshes (also called chains), twodimensional meshes (2-D meshes, grids), and threedimensional meshes (3-D meshes). 2-D mesh (Fig.1) is the one to be used for the network model. 1 3 OMNeT++ SIMULATION 2 High Speed Interconnect Network Fig. 1 Simulation Methodology 4 PERFORMANCE ANALYSIS 2.1 Motivation for the grid architecture (2Dmesh) The decision for a 2Dmesh is based on the following key characteristics of these architectures, ISSN: 1790-2769 217 ISBN: 978-960-474-188-5

which include mostly advantages over the other network topologies. 2.1.1 Main Advantages: - very good scalability -for instance, four bidirectional links handle all communications of a 2-Dmesh node. The number of links per node does not change if additional nodes are added to the mesh. - simple and cost-effective implementation - because of the fact that a mesh network consists of fewer links per node than most other architectures. - simple routing for instance, the packet header includes the destination information as x and y (in the two-dimensional case) representing the destination node distance in the x direction (horizontally) and the y direction (vertically), respectively. Packets may then be forwarded in the x direction first. The sign of x determines whether the positive or the negative direction must be chosen. Each intermediate node decrements /increments x. If x =0 is reached, the packet is forwarded in the y direction in the same manner. y =0 means that the destination is reached. 2.1.2 Disadvantages The major shortcoming is the blocking behavior of the net. Usually, messages pass several nodes and links before they reach their destination. As a result blocking occurs because specific links are demanded by multiple packet paths using the same link. Blocking can be pointed out as the main disadvantage of 2DMeshes. Blocking can be reduced if communication is mainly local, e.g., if the tasks communicating most intensely are spread over nearest neighbor nodes. Then, messages are exchanged only between nodes located close together. 2.2 OMNeT++ Tool and Simulation techniques OMNeT++ is an extensible, modular, component-based C++ simulation library and framework, with an Eclipsebased IDE and a graphical runtime environment. An OMNeT++ model consists of s that communicate with each-other by message passing. The active s are termed simple s. They are written in C++ using the simulation class library. There are also extensions for real-time simulation, network emulation, alternative programming languages (Java, C#), database integration, System C integration. Simple s can be grouped into compound s and so forth; the number of hierarchy levels has no limits. Messages can be sent either via connections that span between s or directly to their destination s. OMNeT++ also provides support for parallel simulation execution. Very large simulations may benefit from the parallel distributed simulation (PDES) feature, either by getting speedup, or by distributing memory requirements. If the simulation requires several Gigabytes of memory, distributing it over a cluster may turn out to be the only way to run it. For getting speedup (and not actually slowdown, which is also easily possible), the parallel hardware of the cluster should have low latency and the model should have inherent parallelism. 2.3 Parallel hardware platform The experimental framework is based on IMB Blade Center, which consists of three Blade servers, HS21, Xeon Quad Core E5405 80w 2.00GHz/1333MHz/12MB L2, and a disk subsystem IBM System Storage DS3400. 3 Simulation Models and Performance Analysis The switch architecture shown on the picture consists of 4 equivalent s for each direction. It has an input multiplexer, which chooses either direction from neighbor switch or from the attached host (the last with lower priority) Fig.2. The input registers store one flit and extracts its routing information. In this architecture DOR with X-first for deadlock avoidance algorithm is used: - Define in header Flit : dx = destinationx - sourcex dy = destinationy - sourcey - Routing decision by each hop dx = 0 and dy = 0 : deliver to Host dy > 0 : dy = dy - 1 and move to South dy < 0 : dy = dy + 1 and move to North dx > 0 : dx = dx - 1 and move to East dx < 0 : dx = dx + 1 and move to West Next element is DMUX_Host, which forwards the flit to the host, if dx = 0 and dy = 0, after passing the routing function. Alternatively, if the above condition is not met, the flit is delivered to one output queues (through the next demultiplexer), depending again on the routing decision, described above. Sorting the input flit stream into four separate queues eliminates head-of-queue blocking. The queues` outputs are connected to the output ports through a non-blocking crossbar, implemented with four multiplexers (one for each direction). The output ports are the common resource, which arbitration logic has to consider, when evacuating a flit from queues. They should be marked as busy as soon the header flit enters and available when the corresponding tail flit exits. The composition of header, payload and tail flits (forming one packet) makes a ISSN: 1790-2769 218 ISBN: 978-960-474-188-5

virtual path across the ports and the queues, which could not be cutted by other such virtual paths and they stay blocked, untill common resources are free. traffic generator and traffic synchronization in our simulation model. It maintains its own clock, which is configured in each simulation experiment, giving us different values of the offered bandwidth. Fig. 2 4x4 Architectural Design Fig. 3: 4x4 simulation model view `s components can be described as separate simple s, but better modeling strategy would be to simplify the model by reducing their number and also reduce the number of connections between them. In the there is an input register, queue, output port and auxiliary host arbiter, connecting instances of them; we have the following compound, showed in Fig. 3. Channels consist of data path and control links, which the editor (NED Editor) shows as just one line. The compound : the switch has a global clock, which is called directly (without using massage mechanism) via Module::Clock () virtual function, which each has. This allows easily defining the sequence of three stages pipelined clock for: input registers, queues and output ports. The next step is to write s behavior in C++. Each defines its own functionality, but common strategy is: keeps its input data/control state, when it receives a message and forwards it to the next on the system clock (when its Clock() method has been invoked). Host, which is a separate simple, is the The two dimension mesh network, which connects host and switches, is shown in Fig. 4. The size of the mesh is a configurable parameter. Every host and switch are given coordinates which are used as addresses in the routing algorithm. Connections are implemented as a separate object in OMNeT++ model, called datarate channel built-in type. Datarate allows configuring along with data rate itself, also implementing a delay of the channel and bit error parameter (BER) as well. This makes it possible to simulate real characteristics of a physical channel. Traffic modeling is defined by 3 parameters: packet spatial distribution (traffic profiles), packet injection rate and size of the packet [2]. Traffic profiles for the design and the analysis of the switch can be categorized into realistic and synthetic groups. Realistic traffic loads have been used to analyse the power and delay of different network architectures. Examples include GSM voice CODE [6], SPLASH-2 [7], MediaBench [8], and SPEC [9] traffic profiles. It ISSN: 1790-2769 219 ISBN: 978-960-474-188-5

should be noted that the traffic patterns generated by different s in a network strongly depend on the application for which the network is designed. Hot-spot Traffic (Fig. 5): Each node sends messages to other nodes with an equal probability except for a specific node (called Hotspot) which receives messages with a greater probability. The percentage of additional messages that a Hotspot node receives compared to the other nodes is indicated after the Hotspot name (e.g., Hotspot 10%). Fig. 5 Hotspot traffic pattern Transpose Traffic (Fig. 6): Each node sends messages only to a destination with the upper and lower halves of its own address transposed. i.e., the destination whose address is given by (nm/2n(m/2)+1 nmn1n2 n(m/2) 1). Fig. 4: 2D mesh network Since the communication performance of the network is a function of the traffic profile, the most accurate way to assess the characteristics of the network would be to invoke the traffic profiles corresponding to the application. In many cases, the system is designed for multiple applications. In these cases, the traffic profiles corresponding to all applications should be used during the network design and analysis. This can be time consuming even if all the applications are known beforehand. As another option, synthetic traffic profiles which can represent a class of applications may be used. This suggests that the use of both realistic and synthetic traffic profiles forms a complete set for the evaluation of the techniques proposed for a particular system. Different synthetic traffic patterns have been used for evaluating interconnection networks. Uniform, Transpose, Bit-Complement, Bit-Reversal, Hotspot [7], and Self-similar [10] are the most widely used traffic models for the analysis of power and delay in interconnection networks. To describe the synthetic patterns, let each node (x, y) in the netwok design be labeled with an address resulting from the concatenation of x and y indexes of the node. The m-bit binary number representation of xy is n1n2 nm-1nm. Uniform Traffic: Each node sends messages to other nodes with an equal probability (i.e., destination nodes are chosen randomly using a uniform probability distribution function). Fig. 6 Transpose traffic pattern Complement Traffic (Fig. 7) Each node sends messages only to an One s complement of its own address, i.e., the destination whose address is given by: {b3, b2, b1, b0} { b3, b2, b1, b0} Fig. 7 Complement traffic pattern ISSN: 1790-2769 220 ISBN: 978-960-474-188-5

Bit reversal Traffic (Fig. 8): Each node sends only to address that is bit reversal of the sender s address, i.e., the destination with address (nm nm-1 nm-2 n3 n2 n1). One of the most frequently used traffic distribution pattern (for generating destinations) is the uniform. Besides, another three very popular patterns have been used in the simulation tests. Fig. 8 Bit reversal traffic pattern Chaos Normal Form (CNF) graphs display accepted traffic on the first graph, and network latency on a second graph. In both graphs, the X-axis corresponds to normalized applied load. As shown in Fig.9 they-axis shows latency and on Fig.10 - the accepted traffic (throughput). Latency increases as the offered traffic grows, because the contention for output ports and probability for packet to be blocked increases. For the same reason accepted traffic reaches a saturation point. In this experiment, the time between packets is configurable and with exponential probability. Fig. 10: Delivered traffic (throughput) vs. offered load It should be noted that during a single simulation run, two phases can be distinguished. In the first phase, called initial transient phase, the system model transiently oscillates Until a steady state (if it exists) is reached. The steady state represents the second phase. Some investigations aim to determine measures E(Y t = t0) at a particular time t0 (called terminating simulation) while others are interested only in the steady state E(Y t ) (called steady state simulation). This is the case used in the described above simulation model. In this case, values of the initial transient phase distort the results, particularly the confidence level. Therefore, determining the initial transient phase and starting the observation of results in steady state improve the results and it is the main task in steady-state simulation. The information presented above (preconditions, preparation and configuration of the simulation and of course the simulation results) show vital information that could be used for achieving maximum efficiency of the whole system the complete switch, by using it with some particular applications, in the selected domain. Fig. 9: Latency vs. offered load ISSN: 1790-2769 221 ISBN: 978-960-474-188-5

4 Conclusions and Future Work In this paper we have presented the evaluation of the communication performance parameters (latency and bandwidth) of a system area network of 2-D mesh topology built upon the developed and specified switch architecture. The communication performance parameters are estimated on the basis of simulation models in the OMNeT++ network simulator environment which have been run on IBM HS21 Blade center for the case studies of several most popular communication patterns. OMNeT++ is a framework, which gives a fast way to compare different designs, using equivalent unified measurement model. In future work, performance metrics of extended architectures from the same type, as already described in this article, or even other architectures can be evaluated, and also diversifying the experiment scenarios by applying workload, taken for real world applications. Thanks to its open model, OMNeT++ can be connected to other external (and for example traffic generators) and syncs be implemented as an interface to real-world programs or another simulation environment. Development of a reusable library with standard components (like queues, multiplexers, demultiplexers, traffic generators and so forth), which supports different handshake methods is the second main goal to be achieved. This will be very helpful for fast and unified development of different configurations for different switches and topologies, in the field of high speed switch design. Heidelberg NewYork [5] D. Wu et al., Improving Routing Efficiency for Network-on-Chip through Contention-Aware Input Selection, Proceedings of Asia and South Pacific Conference on Design Automation (2006), pp. 36-41. [6] S.C. Woo et al., The Splash-2 Programs: Characterization and Methodological Considerations, Proceedings of International Symposium on Computer Architecture (1995), pp. 24-36. [7] C. Lee et al., Mediabench: a tool for evaluating and synthesizing multimedia and communications systems, Proceedings of the International Symposium on Microarchitecture (1997), pp. 330-335. [8] The Standard Performance Evaluation Corporation. Available [online]: http://www.spec.org/. ACKNOWLEDGEMENTS The results reported in this paper are part of a research project DO02-115/2008, supported by the National Science Fund, Bulgarian Ministry of Education and Science. References: [1] Borovska, P. (2009) Computer systems. Sofia; Bulgaria: Ciela, ISBN 954-649-633-2 (in Bulgarian) [2] Duato, J., Yalamanchili, S., Lionel M., (2002) Interconnection networks: an engineering approach. Morgan Kaufmann Publishers, ISBN 1-55860-852-4 [3] Varga, A., OMNeT++ version 4.0 User Manual http://omnetpp.org/doc/omnetpp40/manual/usman. html/ [4] DietmarTutsch, (1998) Performance Analysis of Network Architectures, Library of Congress Control Number: 2006929315, ISBN-103-540- 34308-3Springer Berlin Heidelberg NewYork ISBN-13978-3-540-34308-0 Springer Berlin ISSN: 1790-2769 222 ISBN: 978-960-474-188-5