Simulation Modelling Practice and Theory

Transcription

1 Simulation Modelling Practice and Theory 7 (9) 5 5 Contents lists available at ScienceDirect Simulation Modelling Practice and Theory journal homepage: Adaptive routing in wormhole-switched necklace-cubes: Analytical modelling and performance comparison Sina Meraji, Hamid Sarbazi-Azad * Department of Computer Engineering, Sharif University of Technology, Tehran, Iran School of Computer Science, Institute for Research in Fundamental Sciences, Tehran, Iran article info abstract Article history: Available online June 9 Keywords: Interconnecting network Hypercube Necklace hypercube Adaptive routing Wormhole switching Performance evaluation Analytical modelling The necklace hypercube has recently been introduced as an attractive alternative to the well-known hypercube. Previous research on this network topology has mainly focused on topological properties, VLSI and algorithmic aspects of this network. Several analytical models have been proposed in the literature for different interconnection networks, as the most cost-effective tools to evaluate the performance merits of such systems. This paper proposes an analytical performance model to predict message latency in wormholeswitched necklace hypercube interconnection networks with fully adaptive routing. The analysis focuses on a fully adaptive routing algorithm which has been shown to be the most effective for necklace hypercube networks. The results obtained from simulation experiments confirm that the proposed model exhibits a good accuracy under different operating conditions. Ó 9 Elsevier B.V. All rights reserved.. Introduction A large number of interconnection networks have been proposed and studied for highly parallel distributed-memory multicomputers [6]. Among them the hypercube has been one of the most famous ones which has many desirable properties such as logarithmic diameter and fault-tolerance. It is not, however, scalable from hardware cost point of view, i.e. when adding some few nodes to it, we have to duplicate the network size to reach the next specified network size. Other drawback with hypercubes was reported by Patel et al. [6] when considering VLSI layout. They showed that the minimum number of tracks for VLSI layout of an n-cube (n-dimensional hypercube) using a one-dimensional implementation has an order of network size. Their investigation revealed that for an example network of k processors (a -cube), at least 687 tracks are required for VLSI implementation. The necklace hypercube, introduced in [9], is a new interconnection network based on the hypercube network. While preserving most of properties of the hypercube, it has some other desirable properties such as hardware scalability and efficient VLSI layout that make it more attractive than an equivalent hypercube network [9]. There are three main approaches for performance evaluation of interconnection networks. The first one is monitoring the behaviour of the actual system; it can capture the effects of low-level design choices, but restricts experimentation with different router policies since it can be prohibitively expensive and time-consuming to change these features. Simulation is the second approach for performance evaluation of interconnection network. We may implement different routing algorithms, different switching methods, and different interconnection topologies with simulation environments, but simulation is time consuming especially when we study large networks. The last way is to using mathematical approaches for performance analysis of interconnection networks. Mathematical models are cost-effective and versatile tools for evaluating system * Corresponding author. address: azad@mail.ipm.ir (H. Sarbazi-Azad) X/$ - see front matter Ó 9 Elsevier B.V. All rights reserved. doi:.6/j.simpat.9.6.8

2 S. Meraji, H. Sarbazi-Azad / Simulation Modelling Practice and Theory 7 (9) performance under different design alternatives. The significant advantage of analytical models over simulation is that they can be used to obtain performance results for large systems and their behaviour under network configurations and working conditions which may not be feasible to study using simulation on conventional computers due to the excessive computation and memory demands. Several researchers have recently proposed analytical models of popular interconnection networks, e.g. k-ary n-cubes, tori, hypercubes, and meshes [ 5,7,8,,7]. The most difficult part in developing any analytical model of adaptive routing is the computation of the probability of message blocking at a given router due to the number of combinations that have to be considered when enumerating the number of paths that a message may have used to reach its current position in the network. Almost all studies on necklace hypercube interconnection networks focus on topological properties and algorithmic issues. There has been hardly any study on performance evaluation of such networks and no analytical model proposed for necklace hypercubes. In this paper, we discuss performance issues of necklace hypercube graphs by introducing a reasonably accurate mathematical model to predict the average message latency in wormhole necklace hypercubes using a high-performance routing algorithm proposed in [,]. An early version of this paper appeared in []. The rest of this paper is organized as follows. In Section, the structure of the necklace hyeprcube is described. In Section, adaptive wormhole routing in the necklace hypercube is discussed. Section proposes a mathematical performance model for adaptive routing in wormhole necklace hypercube. Validation of the proposed performance model is realized in Section 5 using results obtained from simulation experiments. In Section 6, using the proposed analytical the performance of necklace hypercubes is compared to equivalent hypercube under different implementation constraints. Finally, Section 7 concludes the paper.. The necklace hypercube graph The necklace hypercube is an undirected graph that is based on a hypercube by appending a necklace of processors (or nodes, interchangeably) to each edge. That is, besides connecting adjacent nodes (according to the base hypercube topology), we connect i-th dimension neighbors by an array of nodes, as a necklace. The necklace length may be fixed or variable for different edge necklaces. In the former, there are fixed number of nodes between each two adjacent neighbors on a necklace. The network is called regular necklace hypercube, and can be defined as ðn; kþ-rnh, where n is the number of dimensions and k is the necklace size. With fixed length necklaces, however, scaling up the network is limited to specified network sizes indicated by n and k as n ðnk þ Þ. In the latter form of necklace hypercubes, named as irregular necklace hypercube, each necklace associated to channel i in the base hypercube, contains k i nodes. Hence, the network can be defined by a vector of n n elements, i.e. k ¼ðk ; k ;...; k n n Þ. Such a network, denoted as ðk ; k ;...; k n n Þ INH, has excellent scalability with theoretically no limitations to scale. The network size for a ðk ; k ;...; k n n Þ INH is given as n þ P n n i¼ k i. The second factor in network size expression (the sigma part) implies the scalability property as we can have any desirable value for k i ; 6 i 6 n n ; in the network. Fig. a and b shows some examples of regular and irregular necklace hypercubes. Dark nodes are the nodes in the base hypercube and grey nodes are the necklace nodes. The index of each necklace node in its necklace is shown inside the node. Edge number of each edge in the base hypercube is shown inside a grey rectangle which is based on the i th dimension edge of smaller base vertex. Let us now give a more formal definition of the necklace hypercube. () () 7 () () 6 () () () 7 () 6 9 () () () 5 8 () (a) 9 () () () 5 8 () (b) Fig.. Examples of necklace hypercubes: (a) A regular necklace hypercube with n = and k =. (b) An irregular necklace hypercube.

3 5 S. Meraji, H. Sarbazi-Azad / Simulation Modelling Practice and Theory 7 (9) 5 5 Definition. A necklace hypercube is an undirected graph G ¼ ðv; EÞ, where u V is defined as u ¼ ðb; d; iþ with b (denoting the Base Vertex) being the address of hypercubic node (the one with smaller address in its dimension, d 6 d 6 nþ; d is the dimension number of the necklace (containing the node), and i (as the index) is the index of the node in its necklace. Note that d for the hypercube vertices (or base vertices) is zero. For simplicity, we index the nodes of a necklace from zero (zero for the base vertex) up to the number of nodes on the necklace. A ðn; kþ-rnh has n þðn n Þk nodes ( n nodes of degree n and n n k nodes of degree ), n n þ n n ðk þ Þ edges, and a diameter of n þ k. The bisection width of the necklace hypercube is twice that of its base hypercube. As the bisection width of an n-dimensional hypercube is n, the bisection width of the necklace hypercube based on an n-dimensional hypercube is then n [9]. One of the best features of necklace networks is their efficient VLSI layouts. According to the result given in [9], the number of wiring tracks sufficient and necessary for the single-row wiring layout of the n-dimensional hypercube (or n-cube), Q n, when the numeric node order is used is given by [9]: t num ðq n Þ¼ ð=þ n þ bn=c. For the ðn; kþ-rnh, in general, we can extend this expression. Note that each necklace can be placed in just two extra tracks. So, the number of wiring tracks sufficient and necessary for the -dimensional wiring layout of necklace hypercube ðn; kþ-rnh [9], when the numeric node order is used is t num ððn; kþ-rnhþ ¼ ð=þ n þ bn=c that is independent of k [9]. Hence, the VLSI layout for a (,)-RNH requires six tracks only. Note that (,)-RNH contains nodes and the corresponding hypercube (-cube as the nearest size) needs tracks, i.e. is twice the number of tracks needed for its equivalent necklace network.. Adaptive wormhole routing in necklace hypercubes Deterministic routing algorithm for the necklace hypercube was already proposed in []. It can be used with both packet switching and wormhole switching techniques. To define a fully adaptive routing algorithm, we must first define a deadlockfree routing (with any level of adaptivity). We can then use the deadlock-free routing algorithm as described in [] to construct a deadlock-free fully adaptive routing algorithm. In order to have a deadlock-free routing algorithm in the necklace hypercube, we use two classes of virtual channels: class A and class B. Virtual channels of class A are used when the message is at the source necklace; we use virtual channels of this class until we reach the first base vertex of the source necklace. Virtual channels of class B are used in the destination necklace; it means that when we enter the destination necklace, we use this class of virtual channels until we reach to the destination node. Each of these classes can be used for a deadlock-free routing algorithm in the base hypercube, e.g. e-cube routing. The minimum number of virtual channels in each class is one. Thus, we need at least two virtual channels per physical channel to implement a deadlock-free routing algorithm in necklace hypercubes. According to Duato s methodology, since the base deadlock-free routing algorithm requires two virtual channels, we can have a fully adaptive deadlock-free routing algorithm in the necklace hypercube using at least three virtual channels, two of which used by the base routing algorithm and the remaining one used in any possible way that can bring the message closer to its destination. We call the virtual channels used for the base deadlock-free routing base virtual channels and the remaining virtual channels as adaptive virtual channels. When there are more than three virtual channels per physical channel, the network performance is maximized when the extra virtual channels are used as adaptive virtual channels. Thus, with V virtual channels per physical channel, the best performance is achieved when we have V- adaptive virtual channels and two base virtual channels. This is of course for routing in necklaces; in the base hypercube we can use Duato s fully adaptive routing algorithm [6] which uses V- adaptive virtual channels and virtual channel for e-cube deterministic routing. Fig. shows the routing algorithm between nodes u ¼ðb ; d ; i Þ and v ¼ðb ; d ; i Þ in pseudo code. Note that the algorithm contains three main steps:. Move towards the nearest base neighbor using any of the V- adaptive virtual channels. If all V- adaptive virtual channels are busy use the virtual channel of class A from base virtual channels to move toward the nearest base neighbor in the source necklace.. Move towards the nearest neighbor of the destination using fully adaptive routing algorithm in the hypercube with V- virtual channels []. If all V- virtual channels are busy use the remaining virtual channel with e-cube routing in the base hypercube.. Now the current node is one of the base vertices of the destination necklace. Move towards the destination node using any of the V- adaptive virtual channels. If all V- virtual channels are busy use the virtual channel of class B from base virtual channels to move toward the destination node along the destination necklace.. The analytical model In this section, we derive an analytical performance model for wormhole adaptive routing in a necklace hypercube. Our analysis focuses on the routing algorithm which was introduced in previous section but the modelling approach used here can be equally applied for other routing schemes.

4 S. Meraji, H. Sarbazi-Azad / Simulation Modelling Practice and Theory 7 (9) Routing (u, v): // u = ( b, d, i) and v = ( b, d, i) are the source and destination. u = b, u = b Location ( d); /* Location is a function that calculates the difference bit for the other extreme base vertex of a necklace. */ v = b, v = b Location( d ); CC = min( difference ( i, u), difference( i, u)) ; If there is any free virtual channel vc in V- adaptive virtual channels then set vc = vc else set vc = vc (class A base virtual channels); Send message through physical channel CC and virtual channel vc; Set the current node C; HD = H( C, v ), HD = H( C, v) ; // H(x,y) gives the Hamming distance between x and y If ( HD < HD) then Goto v using Duato fully adaptive routing algorithm else Goto v using Duato fully adaptive routing algorithm If there is any free virtual channel vc in V- adaptive virtual channels then set vc = vc else set vc = vc (class B virtual channels); Send message through physical channel i and virtual channel vc; End. Fig.. Fully adaptive routing algorithm in the necklace hypercube. The measure of interest in our model is the average message latency as a representative for network performance. The following assumptions are made when developing the proposed performance model. These assumptions have been widely used in similar modelling studies [ 5,7,8,,7,8]. (a) Messages are broken into some packet of fixed length of M flits. The flit transfer time between any two neighbouring nodes is assumed to be equal to one cycle. (b) Message destinations are uniformly distributed across the network nodes. (c) Nodes generate traffic independently of each other, which follow a Poisson process, with a mean rate of k g messages/ cycle. (d) Messages are transferred to the local processor through the ejection channel once they arrive at their destination. (e) V virtual channels per physical channel are used. These virtual channels are used according to the routing algorithm described in the previous section. According to the proposed fully adaptive routing algorithm, a message must cross three sub-networks from the source node to reach the destination node. At first, it moves from the source node to the nearest base node; then moves to the nearest base node of the destination node in the base hypercube; and finally crosses the destination necklace to reach the destination node. Therefore, we have message latencies: the mean message latency at the source necklace, Latency N, the mean message latency at the base hypercube, Latency H, and finally the mean message latency at the destination necklace, Latency N. Hence, the total mean message latency can be predicted as: Latency ¼ Latency H þ Latency N In order to compute the mean message latency we must consider three parameters: the mean network latency, S, that is the time to cross the network, the mean waiting time seen by a message in the source node to be injected into the network, W s. To model the effect of virtual channels multiplexing, the mean message latency is then scaled by a factor, V, representing the average degree of virtual channels multiplexing that takes place at a given physical channel []. Therefore, the mean message latency in each sub-network can be approximated as ðþ Latency ¼ðS þ W s ÞV: The average number of hops that a message makes across the necklace sub-network, d N, is given by d N ¼ k b X k=c i¼ i A value of dk=e must be added d k N, if the necklace length is odd. The average number of hops that a message makes across the base cube network, d H, is given by [] n d H ¼ n n ðþ ðþ ðþ

5 56 S. Meraji, H. Sarbazi-Azad / Simulation Modelling Practice and Theory 7 (9) 5 5 Fig. shows a necklace of (,5)-RNH. In this figure all messages generated in node cross edge (,) except those that their destination is node, or. If we consider the messages that are generated in node and node 5 and their corresponding destinations are nodes, and 5 (most crossing edge (,)) the exact message traffic rate over this edge can be expressed as k g N N þ k g N þ k g N ¼ k g ð5þ We can compute the traffic over edge (,) as follows. All messages generated in node cross edge (,) except those that their destination is node, or 5. Also, all messages generated in node, for destinations not being node, or 5, cross this edge. Considering messages generated in node for node, we can write the average traffic rate over this channel as N k g N þ k N g N þ k g N ffi k g ð6þ Using a similar approach, the traffic rate over the channels of a necklace with k nodes can be written as k g ; k g ; k g ;...; k=k g. We have similar results for the edges in the other direction []. Therefore, we can compute the average traffic rate received by each necklace channel, k cn,as k cn ¼ k g P k= i¼ i k= ¼ðk= þ Þ k g The traffic rate over the edge that connects the last necklace node to the base node is k k g. As each base node has n necklaces and the traffic rate generated in a base node itself is k g, we can compute the traffic rate injected to the base hypercube by a base node as ð7þ k b ¼ k g þ n k k g ð8þ Fully adaptive routing in the base hypercube allows a message to use any available channel that brings it closer to its destination node, resulting in almost evenly distributed traffic rate over hypercubic channels. A router in the base hypercube has n output channels. Since each message travels, on average d hops to cross the network, the rate of messages received by each hypercubic channel, k ch, can be expressed as [] k ch ¼ k bn ð9þ ðn Þ Let us follow a typical message which makes d hops to reach at its destination. The average network latency, S, seen by the message crossing from the source to destination node, consists of two parts: one is the time of actual message transmission, and the blocking time in the network. Therefore, S, can be expressed as S ¼ M þ d þ dt b where M is the message length, and T b is the average blocking time seen by the message at each hop. The term T b is given by T b ¼ P block w with P Block being the probability that a message is blocked at the current channel and w is the mean waiting time to acquire a channel in the event of blocking. A message is blocked at a given channel in the necklace sub-network when all adaptive and deterministic virtual channels of the current physical channel are busy. Let P a be the probability that all adaptive virtual channels of a physical channel are busy and P a&d denote the probability that all adaptive and deterministic virtual channels of a physical channel are busy. In necklace sub-networks, we have only one path for the messages, so no adaptivity exists for messages moving across a necklace. In order to compute P a&dn (probability P a&d for necklaces), we must consider two cases: (a) The probability that all of V virtual channels of a physical channel are busy, P V, and (b) the probability that V- virtual channels of the V virtual channels associated to a physical channel are busy. In this case only one combination of V- virtual channels of the total V virtual channels can result in blocking. ðþ ðþ Fig.. A (,5)-RNH network.

6 S. Meraji, H. Sarbazi-Azad / Simulation Modelling Practice and Theory 7 (9) So, the probability that all adaptive and deterministic virtual channels of a physical channel are busy can be expressed as P a&dn ¼ P V þ P V V V In order to compute the probability of blocking, P block, we average over all the probabilities that a message may be blocked crossing d hops in the network as P blockn ¼ d N Xd N P a&dn i¼ A message is blocked at a given channel in the hypercube network when all the adaptive virtual channels of the remaining dimensions to be visited and also the deterministic virtual channel of the current dimension are busy. We can then write P ah ; P a&dh and P blockh as []: P ah ¼ P V þ P V ðþ V V P a&dh ¼ P V ð5þ P blockh ¼ d XH dh i P a&dh ð6þ d H i¼ P ah where d H is the average number of hops that a message might take in the hypercube sub-network. To determine the mean waiting time, w, to acquire a virtual channel when a message is blocked, a physical channel is treated as an M/G/ queue with a mean waiting time of [9] qs þ C S w ¼ ð7þ ð qþ q ¼ k c S ð8þ C ¼ r S ð9þ S S where k c is the traffic rate on the channel (given by Eqs. (7) or (9) for necklace or hypercubic channels), is its service time calculated by Eq. (), and is the variance of service time distribution. Since the minimum service time at a channel is equal to the message length, M, following a suggestion given in [5], the variance of the service time distribution can be approximated as. Hence, the mean waiting time becomes ðþ ðþ w ¼ k cs ð þð M=SÞ Þ ð k c SÞ ðþ Similarly, modelling the local queue in the source node as an M/G/ queue, with the mean arrival rate k g =V and service time S with an approximated variance ðs MÞ, yields the mean waiting time seen by a message at the source node as [] k g v W s ¼ S ð þð M=SÞ Þ ð kg SÞ ðþ V The probability, P v, that v virtual channels are busy at a physical channel can be determined using a Markovian model. Let state p v ð 6 v 6 VÞ correspond to v virtual channels being busy. The transition rate out of state p v to state p v þ is the traffic rate k c (given by Eqs. (7) and (9)) while the rate out of state p v to state p v is ( is given by Eq. ()). The transition rates out of state p v are reduced by k c to account for the arrival of messages while a channel is in this state. The Markovian model results in the following steady state probability [], in which the service time of a channel has been approximated as the network latency of that channel, as ( P v ¼ ð k csþðk c SÞ v ; 6 v < V ðþ ðk c SÞ v ; v ¼ V: When multiple virtual channels are used per physical channel, they share the physical bandwidth in a time-multiplexed manner. The average degree of multiplexing of virtual channels, that takes place at a given physical channel, can then be estimated by [] P V v¼ V ¼ v p v P V v¼ vp : ðþ v

7 58 S. Meraji, H. Sarbazi-Azad / Simulation Modelling Practice and Theory 7 (9) 5 5 The above equations reveal that there are several inter-dependencies between the different variables of the model. For instance, Eqs. () and () reveal that S is a function of w while Eq. (7) shows that w is a function of S. Given that closed-form solutions to such inter-dependencies are very difficult to determine, the different variables of the model are computed using an iterative technique. Fig. shows different steps of the iterative algorithm. 5. Validation of the model The proposed analytical model has been validated through a discrete-event simulator (Xmulator [5]) that mimics the behaviour of the described routing algorithms in the necklace hypercube at flit level. The simulator uses the same assumptions as the analysis, and some of these assumptions are detailed here with a view to making the network operation clearer. The network cycle time is defined as the transmission time of a single flit from one router to the next. Messages are generated at each node according to a Poisson process with a mean inter-arrival rate of k g messages/cycle. Message length is fixed at M flits. Destination nodes are determined using a uniform random number generator. The mean message latency is defined as the mean amount of time from the generation of a message until the last data flit reaches the local processor at the destination node. The other measures include the mean network latency, the time taken to cross the network, the mean queuing time at the source node, and the time spent at the local queue before entering the first network channel. Numerous validation experiments have been performed for several combinations of network sizes, message lengths, and number of virtual channels to validate the model. Figs. 5 8 depict latency results predicted by the model explained in the previous section plotted against those provided by the simulator for different scenarios. The horizontal axis in the figure shows the traffic generation rate at each node while the vertical axis shows the mean message latency. In Fig. 5, we consider two very large networks with more than,. We consider virtual channels per physical channel, and two different message lengths of M = and 6 flits (larger message size, e.g. 8 flits, took the network to saturation state for small values of traffic generation rates). In Fig. 6, we consider two large networks (more than nodes) with V = 8 virtual channels per physical channel, and three different message lengths of M =, 6 and 8 flits. Fig. 7 represents the same results for medium sized networks (about 56 nodes); here, we use V = 6 virtual channels per physical channel and message length M =, 6 and 8 flits. Finally, in Fig. 8, we have shown a comparison between the results given by the model and those gathered from the simulation experiments for two small networks (about nodes); here the number of virtual channels is V = per physical channel and message length is M =, 6 and 8 flits. The figures reveal that in all cases the analytical model can predict the mean message latency with a good degree of accuracy in the steady-state regions. Moreover, model predictions are still good even when the network operates in the heavy traffic region, and when it starts to approach the saturation region. However, some discrepancies around the saturation point are apparent. These can be accounted for by the approximations made to ease the derivation of different variables of the model, e.g. the approximation made to estimate the variance of the service time distribution at a channel. Such an approximation greatly simplifies the model as it allows us to avoid computing the exact distribution of the message service time at a given channel, which is not a straightforward task due to inter-dependencies between service times at successive channels as wormhole routing relies on a blocking mechanism for flow control. 6. Analytical performance comparison of hypercubes and necklace hypercubes In this section, we compare the performance of the necklace hypercube with its equivalent hypercube network. We use Boura s model [] to predict hypercube s behaviour and the proposed model here to predict the performance of equivalkent Iterative algorithm for solving the proposed model Procedure: for λ g =. until saturation point, do: Step ) Let S be initialized to M. Step ) Compute λ C N, λ C H, and w using equations (7), (9), and (). Step ) Compute P v using equation () and V using equation (). Step 5) Compute new S using equations () through (). If it is different from the old value by grater than ε (a predefine error limit) then go to step. Step 6) Compute W s, V, S and at last Latency using equations (), (), (), and (). Step 7) λg = λg + Step _ Rate End of procedure. Fig.. The pseudo code of iterative approach to solving model parameters.

8 S. Meraji, H. Sarbazi-Azad / Simulation Modelling Practice and Theory 7 (9) Simul Simul Model Simul6 Model6 8 6 Model Simull6 Model E Fig. 5. The average message latency predicted by the model against simulation results when virtual channels are used per physical channel and message length is M =, 6 flits, in (,6)-RNH (left) and (,)-RNH (right) networks. Simul Simul Model Model 6 Simul6 Model6 5 Simul6 Model6 5 Simul8 Model8 Simul8 Series Fig. 6. The average message latency predicted by the model against simulation results when eight virtual channels are used per physical channel and message length is M =, 6, 8 flits, in (7,8)-RNH (left) and (7,)-RNH (right) networks. necklace hypercubes. We first compare the networks with equal number of nodes without considering any technological or implementation constraint. To do so, we consider a -dimensional hypercube and a (7, )-RNH, both with nodes. Fig. 9a shows the average message latency of these networks when message size is fixed at M = 6 flits and number of virtual channels per physical channel is V = 8. As can be seen in the figure, the hypercube performs better than the equivalent necklace network. One reason is the excessive number of channels existing in the hypercube which enables the network to tolerate higher traffic rates. Moreover, the average distance for the ðn; kþ-rnh is sþt, where 8 ðk ÞðkþÞðkþÞ ; if k isodd n ( s ¼ v þ n nþ k ð þ nkþ ; v ¼ 8 ðk þ k þ Þþ ; if n isodd >< ; t ¼ ðþ n 8 ðk ðk ÞðkþÞ þ k þ k mod Þ; if n iseven ; if k iseven n >:

9 5 S. Meraji, H. Sarbazi-Azad / Simulation Modelling Practice and Theory 7 (9) 5 5 Avg total Message latency 5 Simul model simul6 model6 simul8 model8 Avg total Message latency 5 simul model simul6 model6 simul8 model Fig. 7. The average message latency predicted by the model against simulation results when six virtual channels are used per physical channel and message length is M =, 6, 8 flits, in (,8)-RNH (left) and (,)-RNH (right) networks. Simul Simul Model Simul6 5 Simul6 5 Simul8 Model6 Simul8 Model Model Model6 Model Fig. 8. The average message latency predicted by the model against simulation results when four virtual channels are used per physical channel and message length is M =, 6, 8 flits, in (,8)-RNH (left) and (,)-RNH (right) networks. Comparing it to the average distance in a binary n-cube (i.e. n/) reveals a big difference especially when the necklace size, k, is big. When an interconnection network is implemented on a chip or over a board, the bisection bandwidth has the most important effect on the performance. Thus, many studies, e.g. [,8], consider a constant bisection bandwidth when comparing two equivalent networks (with the same number of nodes). According to [], the multiplication of bisection width of the network (the number of edges to be cut in order to halve the network in two equivalent parts) and channel bandwidth is defined to be the bisection bandwidth of the network. The bisection width of the hypercube is equal to n, and the bisection width of the m-dimensional necklace hypercube is m [9]. Assuming two equivalent networks, a hypercube with bisection width of BW H and channel bandwidth of CW H, and a necklace hypercube with bisection width BW NH and channel bandwidth of CW NH, and considering a fixed network bisection bandwidth, we can write BW H CW H ¼ BW NH CW NH Thus, the flit size of the hypercube equals BW NH BW H CW NH. So, considering that the phit size of the necklace hypercube equals its flit size, the transmission time for a flit in the equivalent hypercube will be BW H BW NH times of the flit transmission time in the necklace hypercube. ð5þ

10 S. Meraji, H. Sarbazi-Azad / Simulation Modelling Practice and Theory 7 (9) HC NH HC- NH a b 9 HC NH c Fig. 9. The average message latency predicted by the model for a dimensional hypercube and (7,)-RNH when eight virtual channels are used per physical channel and message length is M = 6, when (a) no constraints is considered, (b) under bisection bandwidth constraint, and (c) under pin-out constraint. Fig. 9b compares the performance of the -dimensional hypercube and (7, )-RNH under constant bisection bandwidth constraint. The message length is fixed at 6 flits and the number of virtual channels per physical channel is V = 8. As can be seen in this figure, the necklace hypercube outperforms the equivalent hypercube under bisection bandwidth constraint. Pin-out constraint also has an important impact on the performance, when an interconnection network is implemented by wiring between chips, boards, and cabinets []. Constant pin-out constraint has been already used by other researchers [,8] for comparing different networks. Pin-out of a node is defined to be the node degree (the number of channels in a node) multiplied by the channel width. Assuming two equivalent networks, a hypercube with node degree P H and channel width of CW H, and a necklace hypercube with bisection width P NH and channel width of CW NH, and considering pin-out constraint, we can write P H CW H ¼ P NH CW NH Using Eq. (6), we can calculate the ratio between the flit transmission time for the hypercube and necklace hypercube. The node degree in an n-dimensional hypercube is n, but there is a problem when we want to determine the pin-out of necklace hypercube. Here, we have two types of nodes: necklace nodes with degree of and the hypercube nodes with degree of n. As the number of nodes with degree is much larger than those with degree n, we can use the approximate value of for node degree. Fig. 9c shows the performance comparison of the -dimensional hypercube and (7, )-RNH under pin-out constraint. As depicted in the figure, the necklace hypercube performs better than the hypercube under pin-out constraint. ð6þ

11 5 S. Meraji, H. Sarbazi-Azad / Simulation Modelling Practice and Theory 7 (9) Conclusion and future work The necklace hypercube network has recently been introduced as an attractive alternative to the well-known hypercube. In this paper, we introduced a mathematical performance model of adaptive wormhole routing in necklace hypercubes and validated it through simulation experiments. We saw that the proposed model manages to achieve a good degree of accuracy while maintaining simplicity, making it a practical evaluation tool that can be used by the researchers in the field to gain insight into the performance behaviour of fully adaptive routing in wormhole-switched necklace hypercube. We then used the proposed analytical model to compare the performance of same-sized necklace hypercubes and hypercubes with and without implementation constraints. The considered implementation constraints were the bisection bandwidth and pin-out. When no implementation constraints were taken into account, the hypercube was shown to behave better than its equivalent necklace hypercube. However, under both bisection bandwidth and pin-out implementation constraints the necklace hypercube outperformed its equivalent hypercube. Our next objective is to analyse the performance of the necklace hypercube under important non-uniform traffic patterns. Examining other types of necklace networks, with different base networks, can also be another line of research for future work. Considering and analyzing irregular necklace networks can be another interesting research topic. Another network which is based on the hypercube and is very similar to necklace hypercube, is the stretched hypercube []. Conducting similar analytical studies for this network can be another line of future research. References [] A. Agarwal, Limits on interconnection network performance, IEEE Transactions on Parallel and Distributed Systems (99) 98. [] Y. Boura, C.R. Das, T.M. Jacob, A performance model for adaptive routing in hypercubes, in: Proceedings of the International Workshop on Parallel Processing, 99, pp. 6. [] B. Ciciani, M. Colajanni, C. Paolucci, An accurate model for the performance analysis of deterministic wormhole routing, in: Proceedings of the th International Parallel Processing Symposium, 997, pp [] W.J. Dally, Virtual channel flow control, IEEE Transactions on Parallel and Distributed Systems (99) 9 5. [5] J.T. Draper, J. Ghosh, A comprehensive analytical model for wormhole routing in multicomputer systems, Journal of Parallel and Distributed Computing (99). [6] J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks: An Engineering Approach, Morgan Kaufmann, 5. [7] R. Greenberg, L. Guan, Modelling and comparison of wormhole routed mesh and torus networks, in: Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Systems, 997, pp [8] J. Kim, C.R. Das, Hypercube communication delay with wormhole routing, IEEE Transactions on Computers (99) [9] L. Kleinrock, Queueing Systems, vol., John Wiley, 975. [] S. Meraji, H. Sarbazi-Azad, Performance evaluation of deterministic routing in wormhole necklace cubes, in: Proceedings of the ACS/IEEE Conference on Computers and Applications, Amman, Jordan, 7. [] S. Meraji, A. Nayebi, H. Sarbazi-Azad, Performance evaluation of adaptive wormhole routing in necklace hypercubes, in: International Conference on Computing: Theory and Applications, IEEE Press, Kolkata, India, March 5 7, 7. [] S. Meraji, Performance evaluation of necklace hypercube interconnection networks, M.Sc. Thesis, Sharif University of Technology, October 6. [] S. Meraji, H. Sarbazi-Azad, A. Patooghy, Performance modeling of necklace hypercubes, in: 6th International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS 7), held in Conjunction with IEEE IPDPS7, Long Beach, CA, USA, 6 March 7. [] H. Hashemi-Najafabadi, H. Sarbazi-Azad, P. Rajabzadeh, An accurate performance model of fully adaptive routing in wormhole-switched D-mesh multicomputers, Microprocessors and Microsystems (7) [5] A. Nayebi, S. Meraji, A. Shamaei, H. Sarbazi-Azad, Xmulator: a listener-based integrated simulation platform for interconnection networks, in: Proceedings of Asia Modelling Symposium, IEEE Press, Thailand, 7. [6] A. Patel, A. Kusalik, C. McCrosky, Area-efficient VLSI layout for binary hypercubes, IEEE Transactions on Computers 9 () [7] H. Sarbazi-Azad, M. Ould-Khaoua, L.M. Mackenzie, An accurate analytical model of adaptive wormhole routing in k-ary n-cube interconnection networks, Performance Evaluation () [8] H. Sarbazi-Azad, Constraint-based performance comparison of multi-dimensional interconnection networks with deterministic and adaptive routing strategies, Journal of Computers and Electrical Engineering () [9] P. Shareghi, H. Sarbazi-Azad, Topological properties of necklace networks, in: IEEE International Symposium on Parallel Architectures, Algorithms and Networks (IEEE I-SPAN), IEEE Computer Society Press, 5, pp. 5. [] P. Shareghi, H. Sarbazi-Azad, The stretched network: properties, routing, and performance, Journal of Information Science and Engineering (8) 6 78.