COMMUNICATION PERFORMANCE EVALUATION AND ANALYSIS OF A MESH SYSTEM AREA NETWORK FOR HIGH PERFORMANCE COMPUTERS

COMMUNICATION PERFORMANCE EVALUATION AND ANALYSIS OF A MESH SYSTEM AREA NETWORK FOR HIGH PERFORMANCE COMPUTERS PLAMENKA BOROVSKA, OGNIAN NAKOV, DESISLAVA IVANOVA, KAMEN IVANOV, GEORGI GEORGIEV Computer Systems Department Technical University of Sofia 8 Kliment Ohridski Boul., 1756 Sofia BULGARIA pborovska@tu-sofia.bg, nakov@tu-sofia.bg, d_ivanova@tu-sofia.bg, kamenveselinov@gmail.com, george.rusiichev@gmail.com http://cs-tusofia.eu/ Abstract: - The design of an interconnection network and switch architectural design are significantly influenced by contemporary supercomputer technology. As technology evolves, its impact on interconnection network needs to be reevaluated and elaborated. In this paper we have traced the path in this direction and addressed performance analyses of a high speed 4x4 switch design and interconnect network in mesh topology on the basis of computer simulations using OMNeT++. The suggested models have been verified on the basis of program implementations on IBM HS21 blade center. Key-Words: High Speed Design, Interconnect Network, 2DMesh network architecture, Simulation, Traffic pattern, Performance Analysis 1 Introduction The constantly increasing needs of powerful computer resources like supercomputers and clusters, which are collections of computers that are highly interconnected via a high-speed network, lead us to start research in the area of high-speed networks. The goal of the article is to evaluate the communication performance of the switch design and mesh network design built up on the basis of this switch via simulations using the discrete event simulator OMNeT++. Simulation experiments are intended to model a highperformance switch and system area network design for supercomputers, which connects nodes in a 2D mesh network architecture, implementing dimensional order routing (DOR), and utilizing wormhole flow-control. This topology is also preferred in the newest state-ofthe-art chips like the tiled Network on chip (NoC) architecture, implemented in Tilera s product line. The platform to build-up the simulation is going to be the OMNeT++ platform, because of its efficiency in queuing networks, its popularity in academia, plentiful online documentation and its extensibility (due to its open source model). 2 Network Architecture and Simulation Methodology The case study under investigation is mesh network architecture. Three kinds of meshes are distinguished: one-dimensional meshes (also called chains), twodimensional meshes (2-D meshes, grids), and threedimensional meshes (3-D meshes). 2-D mesh (Fig.1) is the one to be used for the network model. 1 3 OMNeT++ SIMULATION 2 High Speed Interconnect Network Fig. 1 Simulation Methodology 4 PERFORMANCE ANALYSIS 2.1 Motivation for the grid architecture (2Dmesh) The decision for a 2Dmesh is based on the following key characteristics of these architectures, ISSN: 1790-2769 217 ISBN: 978-960-474-188-5

which include mostly advantages over the other network topologies. 2.1.1 Main Advantages: - very good scalability -for instance, four bidirectional links handle all communications of a 2-Dmesh node. The number of links per node does not change if additional nodes are added to the mesh. - simple and cost-effective implementation - because of the fact that a mesh network consists of fewer links per node than most other architectures. - simple routing for instance, the packet header includes the destination information as x and y (in the two-dimensional case) representing the destination node distance in the x direction (horizontally) and the y direction (vertically), respectively. Packets may then be forwarded in the x direction first. The sign of x determines whether the positive or the negative direction must be chosen. Each intermediate node decrements /increments x. If x =0 is reached, the packet is forwarded in the y direction in the same manner. y =0 means that the destination is reached. 2.1.2 Disadvantages The major shortcoming is the blocking behavior of the net. Usually, messages pass several nodes and links before they reach their destination. As a result blocking occurs because specific links are demanded by multiple packet paths using the same link. Blocking can be pointed out as the main disadvantage of 2DMeshes. Blocking can be reduced if communication is mainly local, e.g., if the tasks communicating most intensely are spread over nearest neighbor nodes. Then, messages are exchanged only between nodes located close together. 2.2 OMNeT++ Tool and Simulation techniques OMNeT++ is an extensible, modular, component-based C++ simulation library and framework, with an Eclipsebased IDE and a graphical runtime environment. An OMNeT++ model consists of s that communicate with each-other by message passing. The active s are termed simple s. They are written in C++ using the simulation class library. There are also extensions for real-time simulation, network emulation, alternative programming languages (Java, C#), database integration, System C integration. Simple s can be grouped into compound s and so forth; the number of hierarchy levels has no limits. Messages can be sent either via connections that span between s or directly to their destination s. OMNeT++ also provides support for parallel simulation execution. Very large simulations may benefit from the parallel distributed simulation (PDES) feature, either by getting speedup, or by distributing memory requirements. If the simulation requires several Gigabytes of memory, distributing it over a cluster may turn out to be the only way to run it. For getting speedup (and not actually slowdown, which is also easily possible), the parallel hardware of the cluster should have low latency and the model should have inherent parallelism. 2.3 Parallel hardware platform The experimental framework is based on IMB Blade Center, which consists of three Blade servers, HS21, Xeon Quad Core E5405 80w 2.00GHz/1333MHz/12MB L2, and a disk subsystem IBM System Storage DS3400. 3 Simulation Models and Performance Analysis The switch architecture shown on the picture consists of 4 equivalent s for each direction. It has an input multiplexer, which chooses either direction from neighbor switch or from the attached host (the last with lower priority) Fig.2. The input registers store one flit and extracts its routing information. In this architecture DOR with X-first for deadlock avoidance algorithm is used: - Define in header Flit : dx = destinationx - sourcex dy = destinationy - sourcey - Routing decision by each hop dx = 0 and dy = 0 : deliver to Host dy > 0 : dy = dy - 1 and move to South dy < 0 : dy = dy + 1 and move to North dx > 0 : dx = dx - 1 and move to East dx < 0 : dx = dx + 1 and move to West Next element is DMUX_Host, which forwards the flit to the host, if dx = 0 and dy = 0, after passing the routing function. Alternatively, if the above condition is not met, the flit is delivered to one output queues (through the next demultiplexer), depending again on the routing decision, described above. Sorting the input flit stream into four separate queues eliminates head-of-queue blocking. The queues` outputs are connected to the output ports through a non-blocking crossbar, implemented with four multiplexers (one for each direction). The output ports are the common resource, which arbitration logic has to consider, when evacuating a flit from queues. They should be marked as busy as soon the header flit enters and available when the corresponding tail flit exits. The composition of header, payload and tail flits (forming one packet) makes a ISSN: 1790-2769 218 ISBN: 978-960-474-188-5

virtual path across the ports and the queues, which could not be cutted by other such virtual paths and they stay blocked, untill common resources are free. traffic generator and traffic synchronization in our simulation model. It maintains its own clock, which is configured in each simulation experiment, giving us different values of the offered bandwidth. Fig. 2 4x4 Architectural Design Fig. 3: 4x4 simulation model view `s components can be described as separate simple s, but better modeling strategy would be to simplify the model by reducing their number and also reduce the number of connections between them. In the there is an input register, queue, output port and auxiliary host arbiter, connecting instances of them; we have the following compound, showed in Fig. 3. Channels consist of data path and control links, which the editor (NED Editor) shows as just one line. The compound : the switch has a global clock, which is called directly (without using massage mechanism) via Module::Clock () virtual function, which each has. This allows easily defining the sequence of three stages pipelined clock for: input registers, queues and output ports. The next step is to write s behavior in C++. Each defines its own functionality, but common strategy is: keeps its input data/control state, when it receives a message and forwards it to the next on the system clock (when its Clock() method has been invoked). Host, which is a separate simple, is the The two dimension mesh network, which connects host and switches, is shown in Fig. 4. The size of the mesh is a configurable parameter. Every host and switch are given coordinates which are used as addresses in the routing algorithm. Connections are implemented as a separate object in OMNeT++ model, called datarate channel built-in type. Datarate allows configuring along with data rate itself, also implementing a delay of the channel and bit error parameter (BER) as well. This makes it possible to simulate real characteristics of a physical channel. Traffic modeling is defined by 3 parameters: packet spatial distribution (traffic profiles), packet injection rate and size of the packet [2]. Traffic profiles for the design and the analysis of the switch can be categorized into realistic and synthetic groups. Realistic traffic loads have been used to analyse the power and delay of different network architectures. Examples include GSM voice CODE [6], SPLASH-2 [7], MediaBench [8], and SPEC [9] traffic profiles. It ISSN: 1790-2769 219 ISBN: 978-960-474-188-5

should be noted that the traffic patterns generated by different s in a network strongly depend on the application for which the network is designed. Hot-spot Traffic (Fig. 5): Each node sends messages to other nodes with an equal probability except for a specific node (called Hotspot) which receives messages with a greater probability. The percentage of additional messages that a Hotspot node receives compared to the other nodes is indicated after the Hotspot name (e.g., Hotspot 10%). Fig. 5 Hotspot traffic pattern Transpose Traffic (Fig. 6): Each node sends messages only to a destination with the upper and lower halves of its own address transposed. i.e., the destination whose address is given by (nm/2n(m/2)+1 nmn1n2 n(m/2) 1). Fig. 4: 2D mesh network Since the communication performance of the network is a function of the traffic profile, the most accurate way to assess the characteristics of the network would be to invoke the traffic profiles corresponding to the application. In many cases, the system is designed for multiple applications. In these cases, the traffic profiles corresponding to all applications should be used during the network design and analysis. This can be time consuming even if all the applications are known beforehand. As another option, synthetic traffic profiles which can represent a class of applications may be used. This suggests that the use of both realistic and synthetic traffic profiles forms a complete set for the evaluation of the techniques proposed for a particular system. Different synthetic traffic patterns have been used for evaluating interconnection networks. Uniform, Transpose, Bit-Complement, Bit-Reversal, Hotspot [7], and Self-similar [10] are the most widely used traffic models for the analysis of power and delay in interconnection networks. To describe the synthetic patterns, let each node (x, y) in the netwok design be labeled with an address resulting from the concatenation of x and y indexes of the node. The m-bit binary number representation of xy is n1n2 nm-1nm. Uniform Traffic: Each node sends messages to other nodes with an equal probability (i.e., destination nodes are chosen randomly using a uniform probability distribution function). Fig. 6 Transpose traffic pattern Complement Traffic (Fig. 7) Each node sends messages only to an One s complement of its own address, i.e., the destination whose address is given by: {b3, b2, b1, b0} { b3, b2, b1, b0} Fig. 7 Complement traffic pattern ISSN: 1790-2769 220 ISBN: 978-960-474-188-5

Bit reversal Traffic (Fig. 8): Each node sends only to address that is bit reversal of the sender s address, i.e., the destination with address (nm nm-1 nm-2 n3 n2 n1). One of the most frequently used traffic distribution pattern (for generating destinations) is the uniform. Besides, another three very popular patterns have been used in the simulation tests. Fig. 8 Bit reversal traffic pattern Chaos Normal Form (CNF) graphs display accepted traffic on the first graph, and network latency on a second graph. In both graphs, the X-axis corresponds to normalized applied load. As shown in Fig.9 they-axis shows latency and on Fig.10 - the accepted traffic (throughput). Latency increases as the offered traffic grows, because the contention for output ports and probability for packet to be blocked increases. For the same reason accepted traffic reaches a saturation point. In this experiment, the time between packets is configurable and with exponential probability. Fig. 10: Delivered traffic (throughput) vs. offered load It should be noted that during a single simulation run, two phases can be distinguished. In the first phase, called initial transient phase, the system model transiently oscillates Until a steady state (if it exists) is reached. The steady state represents the second phase. Some investigations aim to determine measures E(Y t = t0) at a particular time t0 (called terminating simulation) while others are interested only in the steady state E(Y t ) (called steady state simulation). This is the case used in the described above simulation model. In this case, values of the initial transient phase distort the results, particularly the confidence level. Therefore, determining the initial transient phase and starting the observation of results in steady state improve the results and it is the main task in steady-state simulation. The information presented above (preconditions, preparation and configuration of the simulation and of course the simulation results) show vital information that could be used for achieving maximum efficiency of the whole system the complete switch, by using it with some particular applications, in the selected domain. Fig. 9: Latency vs. offered load ISSN: 1790-2769 221 ISBN: 978-960-474-188-5

4 Conclusions and Future Work In this paper we have presented the evaluation of the communication performance parameters (latency and bandwidth) of a system area network of 2-D mesh topology built upon the developed and specified switch architecture. The communication performance parameters are estimated on the basis of simulation models in the OMNeT++ network simulator environment which have been run on IBM HS21 Blade center for the case studies of several most popular communication patterns. OMNeT++ is a framework, which gives a fast way to compare different designs, using equivalent unified measurement model. In future work, performance metrics of extended architectures from the same type, as already described in this article, or even other architectures can be evaluated, and also diversifying the experiment scenarios by applying workload, taken for real world applications. Thanks to its open model, OMNeT++ can be connected to other external (and for example traffic generators) and syncs be implemented as an interface to real-world programs or another simulation environment. Development of a reusable library with standard components (like queues, multiplexers, demultiplexers, traffic generators and so forth), which supports different handshake methods is the second main goal to be achieved. This will be very helpful for fast and unified development of different configurations for different switches and topologies, in the field of high speed switch design. Heidelberg NewYork [5] D. Wu et al., Improving Routing Efficiency for Network-on-Chip through Contention-Aware Input Selection, Proceedings of Asia and South Pacific Conference on Design Automation (2006), pp. 36-41. [6] S.C. Woo et al., The Splash-2 Programs: Characterization and Methodological Considerations, Proceedings of International Symposium on Computer Architecture (1995), pp. 24-36. [7] C. Lee et al., Mediabench: a tool for evaluating and synthesizing multimedia and communications systems, Proceedings of the International Symposium on Microarchitecture (1997), pp. 330-335. [8] The Standard Performance Evaluation Corporation. Available [online]: http://www.spec.org/. ACKNOWLEDGEMENTS The results reported in this paper are part of a research project DO02-115/2008, supported by the National Science Fund, Bulgarian Ministry of Education and Science. References: [1] Borovska, P. (2009) Computer systems. Sofia; Bulgaria: Ciela, ISBN 954-649-633-2 (in Bulgarian) [2] Duato, J., Yalamanchili, S., Lionel M., (2002) Interconnection networks: an engineering approach. Morgan Kaufmann Publishers, ISBN 1-55860-852-4 [3] Varga, A., OMNeT++ version 4.0 User Manual http://omnetpp.org/doc/omnetpp40/manual/usman. html/ [4] DietmarTutsch, (1998) Performance Analysis of Network Architectures, Library of Congress Control Number: 2006929315, ISBN-103-540- 34308-3Springer Berlin Heidelberg NewYork ISBN-13978-3-540-34308-0 Springer Berlin ISSN: 1790-2769 222 ISBN: 978-960-474-188-5