Multistage Interconnect Networks

Similar documents

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Interconnection Network Design

Scaling 10Gb/s Clustering at Wire-Speed

Components: Interconnect Page 1 of 18

The Butterfly, Cube-Connected-Cycles and Benes Networks

Interconnection Networks

Chapter 2. Multiprocessors Interconnection Networks

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Interconnection Network

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Topological Properties

MULTISTAGE INTERCONNECTION NETWORKS: A TRANSITION TO OPTICAL

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Performance of networks containing both MaxNet and SumNet links

Load Balancing Mechanisms in Data Center Networks

Midterm Practice Problems

Load Balancing and Switch Scheduling

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Social Media Mining. Graph Essentials

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

Lecture 2 Parallel Programming Platforms

Scalability and Classifications

Discrete Mathematics & Mathematical Reasoning Chapter 10: Graphs

Topology-based network security

Data Center Switch Fabric Competitive Analysis

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

ECE 358: Computer Networks. Solutions to Homework #4. Chapter 4 - The Network Layer

A permutation can also be represented by describing its cycles. What do you suppose is meant by this?

The Goldberg Rao Algorithm for the Maximum Flow Problem

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

How To Provide Qos Based Routing In The Internet

Parallel Programming

Network (Tree) Topology Inference Based on Prüfer Sequence

Graph Theory Problems and Solutions

On the effect of forwarding table size on SDN network utilization

A Scalable Monitoring Approach Based on Aggregation and Refinement

InfiniBand Clustering

Generalized DCell Structure for Load-Balanced Data Center Networks

Load Balancing. Load Balancing 1 / 24

Load balancing Static Load Balancing

Load Balancing and Termination Detection

Analysis of Algorithms, I

QoS issues in Voice over IP

Sistemas Digitais I LESI - 2º ano

Minimizing Probing Cost and Achieving Identifiability in Probe Based Network Link Monitoring

V. Adamchik 1. Graph Theory. Victor Adamchik. Fall of 2005

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

A RDT-Based Interconnection Network for Scalable Network-on-Chip Designs

Optical interconnection networks with time slot routing

Building Alternate Multicasting Trees in MPLS Networks

Policy Distribution Methods for Function Parallel Firewalls

Switched Interconnect for System-on-a-Chip Designs

Cryptography and Network Security. Prof. D. Mukhopadhyay. Department of Computer Science and Engineering. Indian Institute of Technology, Kharagpur

Dual-Centric Data Center Network Architectures

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

Offline sorting buffers on Line

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

A Dynamic Programming Approach for Generating N-ary Reflected Gray Code List

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

Factors to Consider When Designing a Network

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

ENHANCED HYBRID FRAMEWORK OF RELIABILITY ANALYSIS FOR SAFETY CRITICAL NETWORK INFRASTRUCTURE

Applications. Network Application Performance Analysis. Laboratory. Objective. Overview

High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features

Data Center Network Topologies: FatTree

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

Load Balancing and Termination Detection

Asynchronous Bypass Channels

An Efficient Hybrid Data Gathering Scheme in Wireless Sensor Networks

Scheduling Shop Scheduling. Tim Nieberg

Influence of Load Balancing on Quality of Real Time Data Transmission*

Architecture of distributed network processors: specifics of application in information security systems

Load Balancing by MPLS in Differentiated Services Networks

Fault-Tolerant Routing Algorithm for BSN-Hypercube Using Unsafety Vectors

CS 78 Computer Networks. Internet Protocol (IP) our focus. The Network Layer. Interplay between routing and forwarding

Smart Queue Scheduling for QoS Spring 2001 Final Report

Approximation Algorithms

Analysis of traffic engineering parameters while using multi-protocol label switching (MPLS) and traditional IP networks

PCI Express Overview. And, by the way, they need to do it in less time.

Load Balancing in Ad Hoc Networks: Single-path Routing vs. Multi-path Routing

Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

CS 5480/6480: Computer Networks Spring 2012 Homework 4 Solutions Due by 1:25 PM on April 11 th 2012

Load Balancing and Termination Detection

A Practical Scheme for Wireless Network Operation

Datagram-based network layer: forwarding; routing. Additional function of VCbased network layer: call setup.

Single-Link Failure Detection in All-Optical Networks Using Monitoring Cycles and Paths

Networking Topology For Your System

Transcription:

Multistage Interconnect Networks Shaunak Chatterjee Arpit Jain Mayank Jain Udit Sajjanhar 3CS13 3CS39 3CS31 3CS311 Guided By: Prof. Arobinda Gupta

Introduction to Interconnect Networks Interconnection Networks are used for two primary purposes: Connect processors to shared memory Connect processors to each other The type of interconnection network implemented in a distributed system greatly affects the algorithms that will be used in the system to perform standard and general tasks. The main reason behind this is that the interconnection pattern determines how the data will be routed. This is often the complexity-determining factor in several algorithms in this paradigm. With this perspective, there are four important parameters which characterize an interconnection network: 1. The diameter 2. The bisection width 3. The edges per node 4. The edge length Each of these parameters require further enunciation. DIAMETER: It is defined as the largest distance between two switch nodes in the network. If a network has a low diameter, it is better because the worst case time complexity for data to be routed from one node to another will be minimal (it is a function of the diameter). A large diameter automatically increases the data routing overhead. Thus, the diameter often puts a lower bound on the complexity of parallel algorithms which require communication between two arbitrary pairs of nodes. BISECTION WIDTH: It is defined as the minimum number of edges between switch nodes that must be removed in order to divide the network into two almost equal halves. It is a m easure of a system s capacity (or bottleneck) to handle large volum es of data transfer. H igh bisection width in a system is good, as it means that the large volumes of data to be transferred can be divided between a larger number of connections. The size of the data set divided by the bisection width puts a lower bound on the complexity of algorithms, which require large amounts of data movement.

EDGES PER NODE: It is desirable that the number of edges per node is a constant which is independent of the network, as it allows for more scalability. Also, the algorithms used for routing can be more generalized, and not constrained by the number of nodes in the network. EDGE LENGTH: For scalability issues, it is desired that the maximum edge length is a constant independent of the network. This enables the network to be planned and laid out in an organized manner. Some other properties of interconnection networks: Rearrangeable: A network is rearrangeable if and only if it is possible to establish routes for any pre-assigned sourcedestination pairs, provided that no two sources have the same destination. Non-blocking: A network is non-blocking if and only if it is possible to route from any source to any destination, in the presence of other established source-destination pairs, provided no two sources have the same destination. Thus, all non-blocking networks are rearrangeable, but not the other way round. Need for multistage interconnection networks: The crossbar is the limiting case of a single stage interconnect network. Though it has some positive points (Diameter = 1, non-blocking), it is very costly (cost = O(n 2 )). A detailed mathematical example has been shown in the slides to elaborate on how multistage interconnection networks can cut down on the cost, though the diameter might increase. There might be additional trade-offs. The remaining of this report looks at different interconnection networks, and evaluates them based on the parameters we enumerated above (namely diameter, bisection width, edge length, edges per node, rearrangeability, non-blocking character).

We discuss three such networks: CLOS BUTTERFLY SHUFFLE EXCHANGE In the final section, w e present an application of an interconnect netw ork for sorting. T he B atcher s sorting netw ork is a very good example of how we can reduce the time complexity of sorting, by making use of the inherent parallelism in the problem itself (namely sorting).

CLOS Networks Introduction: The Clos networks are a class of multistage switching network topologies that provide alternate paths between inputs and outputs, making it possible to minimize or eliminate the blocking that can otherwise occur in such networks. In his seminal paper in the Bell System Technical Journal in 1953, Charles Clos [1] showed that the class of switching networks that now bears his name was immune to the phenomenon of blocking that was the key performance limitation of the electromechanical telephone switching systems of that era. This was the first class of networks with sub-quadratic complexity to exhibit nonblocking perform ance. C los sem inal paper sparked the development of the theory of interconnection networks and the Clos networks still maintain a central role in the design of practical switching systems in applications ranging from telephone switching, to digital cross connects, to video production switchers, to IP routers. Building a CLOS network General Architecture: A symmetrical 3-stage Clos network is shown in Figure 1. There are three key parameters for this network: the number of switch modules in the first and third stages, the number of switch modules in the middle stage and the number of inputs (outputs) to the first (third) stage switch modules. These parameters are commonly denoted by r, m and n respectively and completely characterize the network. We use the notation C(n,m,r) to denote such a network and we let N=nr be the number of network inputs (and outputs). Figure1: Symmetrical 3 Stage Clos Network

Given the value of N (=nr), how do we decide upon the values of n, m and r to build an efficient CLOS network. The value of m depends upon our requirements. Let us see the analytical derivation of value of m for different cases: Strictly Non-Blocking Networks: A network is said to be strictly nonblocking if, there is no configuration of connections that can prevent the addition of a new connection between an idle input and an idle output. If m>= 2n-1, then C (n,m,r) is strictly nonblocking. The reason for this is that any first stage switch with an idle input has at most n -1 busy links connecting it to the middle stage, so from any idle input, there are at most, n-1 u n reach ab le m id d le stag e m o d u les. W h en m>2(n- 1), this is fewer than half the number of the middle stage switch modules. Similarly, fewer than half of the middle stage switch modules are unreachable from any idle output, meaning that there must be some middle stage switch module that can be reached from both sides. Hence for any configuration of present network connections, there is always a path to connect an idle input and an idle output. Rearrangeably nonblocking networks: A network is said to be rearrangeably nonblocking if it is always possible to add a new connection linking an idle input to an idle output, by rearranging existing connections. Clos networks were rearrangeably nonblocking so long as m>=n. This can be shown most simply be reformulating the problem of routing a set of connections through a Clos network as a graph edge coloring problem, as illustrated in Figure 2. Figure 2: Simultaneous connection routing in C(n,n,r) using graph edge coloring

The graph used to determine the set of routes has one vertex for every first stage switch in the network and one vertex for every third stage switch in the network. An edge is added between a vertex in the first set and a vertex in the second set for every connection that needs to be routed from an input of the switch corresponding to the first vertex to an output of the switch corresponding to the second vertex. This is illustrated in Figure 2, where the (input, output) pairs at the top are used to construct the graph shown at left. Given this graph, the next step is to assign colors to the edges, in such a way that no two edges incident to the same first stage switch are assigned the same color. The colors assigned to the edges correspond to the middle stage switches used to carry the connections and the constraint on the colors corresponds to the constraint that no two connections in the same first (or third) stage switch can pass through the same middle stage switch. In Figure 2, the letters R, G and B denote the colors assigned to the edges and the diagram at right shows the corresponding set of routes. By a classical result from graph theory, n degree bipartite graph can be colored in no more than n colors, implying that the given set of connections can be routed using at most n middle stage switches. Thus the value of m can be decided depending upon the network requirements. We have to decide upon the trade off between connection efficiency and cost incurred while choosing the value of m. To choose the value n and r, we derive an expression that minimizes the number of cross-points (and hence switches) required to build up the network. The derivation goes as follows: number of cross points(cp) = number of switches in the first stage + number of switches in the middle stage + number of switches in the third stage = nmr + r 2 m +nmr = 2nmr + r 2 m But we have N=nr, thus substituting r = N/n, cp = 2Nm + (N/n) 2 m (i) Deriving for m=2(n-1) (limiting condition for the non blocking case) cp=4n(n-1) + 2N 2 /n 2(N/n) 2 differentiating with respect to n, we have d(cp)/dn = 4N - 2 (N/n) 2 4N 2 /n 3 d(cp)/dn =, condition for the minimum number of cross-points, depending on n, we have n = N /2 (ii) substituting the value of n from (ii) to (i) we have cp N 3/2

Thus the number of cross points in the obtained CLOS network is sub-quadratic in the number of inputs. This is what Clos showed in his paper in 1953. This is a significant improvement over cross-points in which the number of crosspoints is square in the number of inputs. Routing in Clos Networks: Routing in strictly non-blocking Clos networks: It is done using Adaptive routing algorithm, which is a hybrid of a single central controller approach and Self routing approach. First let us look into the two basic approaches: Single central controller approach: There is a single central controller which stores the global state of connections, and when requested can easily identify a free switch connecting the idle input with the idle output. But the major disadvantage with this approach is that for every connection to be setup, the request has to be made to the central controller. This produces unacceptable serializing delays. Figure 3: Clos Network with Central Controller Self Routing Approach: In this approach an intermediate-stage switch is chosen in the hope that its link to the desired output stage switch is available. Speculative self-routing, if successful, yields low latency, as the control delay is simply that of setting a cross point in each of the switching stages, plus the logic delay to select an intermediate-stage switch. Under

conditions of moderate uniform traffic, speculative self-routing attempts can succeed with substantial probability, and retries (along some other path) will not be too frequent. However, problems may arise if the traffic is heavily nonuniform. In such a case, it may take multiple tries for a typical connection to a heavily loaded output-stage switch. Each unsuccessful attempt may in turn block others, with cascading effects on the average connection latency. Adaptive Routing Approach: Step1: Self-routing: Upon receipt of a connection request, the input-stage switch controller chooses (at random) some intermediate-stage switch to which it has an available unused link. If the desired output port is busy, the operation is repeated up to T times. Once a connection is established, data transmission can begin. Step 2: Reservations: After T attempts at self-routing, a connection request is transmitted over the central controller, which chooses the appropriate switch to the required output stage. This algorithm adapts to the traffic characteristics in the switch, so that self-routing is generally successful at low traffic loads, but as the network becomes congested global reservations are increasingly utilized. Simulation studies have shown that this approach can significantly alleviate congestion delays, commonly associated with hot spots, which occur in Clos switches that use only self-routing. Routing in rearrangeable non-blocking networks: As explained above, routing the connections through such a graph can be reduced to a problem of edge coloring in bipartite graphs. Such an edge coloring may be required, every time a connection request comes in. It is done using Euler sets. The procedure is as follows: Make the graph regular: Modify the graph so that every vertex has the same degree, D. [combine vertices and add edges; O (E)]. For D=2i, perform i E uler splits and 1-color each resulting graph. This is logd operations, each of O(E).

The Euler splits are formed using the following algorithm: We hop from our start vertex to a vertex in the alternate party with the two vertices connected via an edge. We remove the edge from the graph and put it in the partition list. We then jump from this new vertex back over to the other party of vertices, again via an edge, removing the edge and putting it in the partition list. Which edge is picked is irrelevant. T his process continues. W e only stop w hen w e re forced to, th at is, when our travels bring us to a vertex that has no more edges. Figure 4: Recursive Euler splits on a graph of degree 2 2 T hose fam iliar w ith graph theory and of E uler s K onigsberg bridge problem in particular w ill recognize that this finishing vertex m ust be vertex w e started w ith, since this graph is regular w ith even degree; i.e., each E uler partition defines an Euler cycle. After this partition has been finished, we pick another start vertex which vertex doesn t m atter, so long as it has som e edges left connected to it and begin the process again. We stop forming Euler partitions when there are no more edges left in our original graph. To form the Euler split, we simply iterate through all the partitions we formed. The partitions are little more than lists of edges: we iterate through these lists, and place alternating edges in each of the graphs G1 and G2

Suppose that we have the case where the degree of our original regular bipartite multigraph is a power of two: that is, we have a bipartite multigraph where every vertex has degree 2 n. Then, if we apply an Euler split to this graph, we result in two regular bipartite multigraphs with degree 2 n-1. If we then apply two Euler splits to these two graphs, the result is four regular bipartite multigraphs with degree 2 n-2. Since we started with a graph whose degree was a power of 2, we can recursively apply these splits, and the only time we reach a graph with odd degree, when we have to stop, is when the degree is 1. At this point we have 2 n different graphs, each of degree one. Suppose, for each of these graphs, we color each edge a color unique to that graph. Since each of the colors is unique to the split graphs, and each has degree 1, no vertex can have two incident edges with the same color. If we assign the colors to the corresponding edges in the original graph, we have a 2 n degree graph with a proper 2 n coloring on its edges. This was demonstrated to take time O(E logd), where D is the degree of the graph. Following image shows the implementation of such an algorithm: Figure5: Implementation of routing in rearrangeable blocking networks Evaluating a CLOS network: Diameter: It is defined as the longest shortest distance. Distance between any two nodes in a clos network is the same and is equal to number of stages + 1. Thus Diameter in a clos network is Number of Stages + 1. Number of paths between any two hosts:. Any two hosts in a clos network are connected via middle layer switches. Thus more the number of middle stage switches, more is the number of paths between any two nodes. Apparently, the number of paths between any two nodes is equal to the number of middle stage switches. Bisection Width: N/2 Clos network always allows a new connection to be formed between an idle input and an idle output, either directly or by rearranging, thus ensuring a full bijection.

Shuffle Exchange Networks Definition : Shuffl e-exchange graph The d-dimensional shuffl e-exchange SE(d) is defined as an undirected graph with node set V = [2] d and an edge set E = E1U E2 with E = {{(a,..., a ), (a,...,a )} (a,..., a )[2] d,a = 1 a } 1 d 1 d 1 d 1 and E = {{(a,..., a ), (a,a,..., a )} (a,..., a )[2] d }. 2 d 1 d 1 1 d 1 It is based on an N node hypercube, where N =2 d Figure 1: 8 Node Shuffl e-exchange Graph Shuffle Exchange Graph Connections Each node is labeled with a unique log N bit string. Then Edges are of two kinds Shuffle Edge and Exchange edge which are illustrated below : Shuffle Edge A node labeled a = a log(n 1),..., a is linked to a node labeled b = b log(n 1),..., b by a shuffl e edge if rotating one position to the left or right yields b, i.e., if either b = a, a log(n 1),a log(n 2),..., a 1 or b = a log(n 2),a log(n 3),..., a,a log(n 1). Mathematically we can represent this edge as

E 2 = {{(a d 1,..., a ), (a,a d 1,..., a 1 )} (a d 1,..., a )[2] d }. Exchange Edge Two nodes labeled a and b are linked by an exchange edge if a and b diff er in only the least significant (rightmost) bit, i.e., b = a log(n 1), a log(n 2),..., a. In Figure 1, shuffl e edges are the blue edges, and exchange edges are the horizontal red edges. Mathematically we can represent this edge as : E 1 = {{(a d 1,..., a ), (a d 1,...,a )} (a d 1,..., a )[2] d,a = 1 a } Figure2: Shuffl e-exchange graph: Another view Edge Representation If a node is a d-bit binary number, Exchange edges are between <b d 1,b d 2,...,b 1, > <b d 1,b d 2,...,b 1,1>. Shuffl e edges are from <b d 1,b d 2,...,b 1,b ><b d 2,...,b,b d 1 > Algorithm for Connections on a Shuffle Exchange Graph Number of nodes a power of 2 N odes have addresses, 1,, 2d-1 Two outgoing links from node i Shuffle link to node RotateLeftBinary(i) Exchange link to node j if i is even and j = i + 1

Properties of Shuffle Exchange Graphs Degree = 3 ( Out-degree = 2, In-degree = 2) N um ber of edges 2N (in the case of an undirected graph). This is because, the sum of the in-degrees = number of edges. D iam eter 2 log N. T his corresponds to log N shu ffl e edges and lg N exchange edges. Bisection Width = O (N/log N). It can be seen that every property of the butterfly is also a property of a shuffl e-exchange graph. Operations on Shuffle Exchange Graphs Perfect Out Shuffl e The connection pattern in Figure 3 is called a perfect shuffl e because it is like shuffl ing two halves of a card deck, so that the cards from the two decks interleave perfectly. We notice that the top and bottom cards which correspond to bit patterns and 111 do not change positions, whereas the other positions go a position which is double in value, that also corresponds to a left shift of one bit [1]. For example, 1 goes to 1 that is double of it. Hence Shuffle(X) = (2X + [2X/N]) modulo N Figure3: Perfect out Shuffle Illustrated

Perfect In Shuffl e In a perfect in-shuffl e as shown in Figure 4 the top and the bottom cards are in the deck rather than out, which means that they do not retain their positions. A perfect in-shuffl e can be viewed as a perfect out-shuffl e followed by an exchange, which is illustrated in Figure 5. Figure 4: Perfect in Shuffle Illustrated Figure 5: Perfect In Shuffle = Perfect Out-Shuffle + Exchange

Routing in Shuffle Exchange Graph Consider 8 nodes in the shuffl e-exchange graph. For any given source destination pair we can perform a certain number of perfect shuffl es (in or out) in order that we route from source to the required destination. Example: Suppose source is Node 7(last position) and the required destination position is Node 2 (third position). The claim is that 3 perfect shuffl es are suffi cient to place the last card in the third position. It can be seen from Figure 6 that starting with an in-shuffl e, followed by an out-shuffl e, and lastly performing an inshuffl e again would route the card at 111 to the position 1. Figure 6: Required path in Shuffle Exchange Graph (shown in Green) Let us trace through the sequence of operations illustrated y the green edges in Figure 6. The first in-shuffl e actually corresponds to a shuffl e followed by an exchange. This means that from 111 we go back to 111 and then take the exchange edge to 11. The second operation which is an out-shuffl e just corresponds a shuffl e from 11 which takes us to 11. The last operation which is an in-shuffl e takes us from 11 to 11 following the shuffl e edge and then from 11 to the required position 1 following the exchange edge. Algorithm to find Routing Sequence Take the XOR of the source and destination bit sequence Perform an in-shuffl e for each 1 in the resultant bit string Perform an out-shuffl e for each in the resultant bit-string

Multi-Stage Shuffle Exchange Network The shuffle exchange network connects N processing elements to N memory modules at a bandwidth and cost somewhere between a single bus and a crossbar network. A eight element network is demonstrated below. Cost of the network : O ( N Log N) Algorithm for Routing in Multi-Stage Shuffle Exchange Let P and M, respectively, be the binary codes of the processing element and the memory module it s com m unicating w ith. T he state of a sw itch in the first colum n is set to straight if the m ost sig nificant bits of P and M are the same. O therw ise it s set to exchange. In the second column, the next significant bits are used etc.

Routing in Multi Stage Shuffle Exchange: An example Figure 8: Routing : An Illustration Suppose source is 1 and destination is 111. Now we apply the above algorithm. First as the MSB is different first stage switch is set to exchange. Second we check the next significant bit which is same in both(1) so second stage switch is set straight. Similarly as the third bit is different so we set the third stage switch to exchange and with these states set for switches we route. The above example is illustrated in Figure 8.

Shuffle Exchange - A Blocking Network It is a blocking network - access to a memory module by one processing element may prevent (block) the access of memory module by another processing element, i.e. a contention may occur in the network. An example where two communications are required along the same path is shown above in Figure 8. Here we have two different source destination pairs namely 2(1) to 7(111) and 6(11) to 4(1) which require same path to communicate. Diameter in Multi Stage Shuffle Exchange Clearly it can be observed from Figure 8 that diameter of the Shuffle Exchange Network would be Diameter = Log N (straight/exchange in the switches) + Log N (shuffle paths) = 2 * Log N.

Butterfly Networks Buterfly Network is another multistage network which evolved as a result of demand of Multistage Interconnect Networks. This network, like another multistage networks results in decreases in costs of the interconnect networks by following a specific topology of connections. Like a crossbar, Butterfly switch network is self-routing and serial in nature. But, unlike a crossbar switch whose cost grows as n 2, Butterfly network attaches n processors to n memories at a cost that grows only nlogn. Butterfly network is bound degree network topology. This topology was used in ATM switches. Butterfly network is also known as Banyan Network, or with some modification it becomes Benes Network. Butterfly Networks obeys Indirect topology i.e. ratio of No. of Switches to No. of Processors is greater than 1:1. In this topology, some switches simply connect to other switches. No. of switches in butterfly network are directly related to no. of processors. If number of processors in a network is N, then number of switches will be N(log 2 N + 1). The network will then consist of log 2 N + 1 rows of switches and each row will then contain N switches thereby making a total of N(log 2 N + 1) switches. Rows in a butterfly network are better known as Ranks. Butterfly networks are recursive in nature. If you chose any number of levels/ranks we can find subbuterfly of dimension equal to chosen number. An example of an n-input butterfly (n=8) with depth of log 2 n (log 2 N =3) is show in the Fig 1. on the next page. An n-input butterfly has log n + 1 levels, each with n-nodes. In the network, n processor nodes are connected by n(log n + 1) switching nodes. The bottom switches wrap back around to the processors. The edges of the butterfly are directed from the node in the smaller numbered level/rank to the node in the larger number level/rank. So, while routing, packet routes from lower level to the upper level. The nodes in this graph represent switches ( 2 X 2 switches in this case), and the edges represent communication links. Each node in a butterfly has a label <i,j> where i is the rank of the node and j is a log n-bit binary number that denotes the column of the node.

1 2 3 4 5 6 7 Rank,,1,2,3,4,5,6,7 Rank 1 1, 1,1 1,2 1,3 1,4 1,5 1,6 1,7 Rank 2 2, 2,1 2,2 2,3 2,4 2,5 2,6 2,7 Rank 3 3, 3,1 3,2 3,3 3,4 3,5 3,6 3,7 Fig. 1 : A 2 3 = 8 processor butterfly network with 8*(3+1)=32 switching nodes Connections Building of connection starts from bottom most row i.e. the row with highest rank. Each node with label Node(i,j), for i > is connected to two nodes in the level i-1, namely Node(i-1,j) and Node(i-1,m), where m is the integer found by inverting the i th most significant bit in the binary log 2 n bit representation of j. For example, for network shown in Fig. 1, suppose i = 2 and j = 3. As Node(i,j) is connected to Node(i- 1,j), therefore, Node(2,3) will be connected to Node(1,3). Now for another connection of Node(2,3) we will have to look at the binary 3-bit representation of j i.e. 3. Now 3 is represented as 11 2 in binary. The ith i.e. 2 nd most significant bit of 3 is 1. Flipping that bit, we get 1 2. So Node(2,3) will then be connected to Node(1,1). In this way, progressing from the highest rank and making connections for all ranks > we get the network as shown in Fig. 1.

1 2 3 4 5 6 7 8 1,1 1,3 2,1 2,3 Fig. 2 : A Butterfly in Butterfly Network. Why Butterfly? As shown in the Fig. 2, walk cycles such as Node(i,j), Node(i-1,j), Node(i,m), Node(i-1,m), Node(i,j), where m is determined by flipping the i th most significant bit in the binary representation of j, constitute a cycle. This cycle looks like butterfly and hence the name. Routing in Butterfly Network As complicated as this switching network appears to be, it is really quite simple as it admits a very nice routing algorithm. Number of bits in the destination address is the edge-length of the path to reach destination from source. At every node, protocol to obey is : * means ship left. * 1 means ship right. For, example suppose we have to send a packet from processor 2 to processor 5. Then start from switching node directly connected to processor 2. Routing here is based on the binary representation of the destination s address. In this case that is 11 2. Now, the message to be send to destination is 11Message. At level, pluck off the leftmost bit of the message and forward the message according to

the protocol mentioned above. In this case, leftmost bit is 1, so remove it and ship message 1Message towards right as show in Fig. 3. 1 1 11 1 11 11 111 1 2 3 4 5 6 7 11 Message 1 Message Message 1 1 11 1 11 11 111 Fig. 3 : Routing in Butterfly Network Now, at level 1, pluck off the leftmost bit i.e. and send 1Message towards left ( as means shipping left). Now at level 2, remove the leftmost bit 1 and send message towards right. Thus, after 3 steps message reaches to the switch directly connected to processor 5 and hence to processor 5. Parameters of Butterfly Network Maximum length of the path that is required for routing a node from one processor to another is log 2 N as shown above, where N is the number of processors. Therefore, Diameter of butterfly network is log 2 N. Each node in the network is connected to 4 more nodes, so Edges per Node is 3 for the network. Edge length is not constant in the network, as is clearly visible, when rank decreases there is an exponential growth in the length of the edges.

BATCHER S SORTING NETWORK Batcher s sorting network uses Bitonic sort. Bitonic sort is one of the fastest sorting networks. A sorting network is a special kind of sorting algorithm, where the sequence of comparisons is not data-dependent. This makes sorting networks suitable for implementation in hardware or in parallel processor arrays. Bitonic sort consists of O(n log(n) 2 ) comparisons in O(log(n) 2 ) stages. Although a sorting network with only O(n log(n)) comparisons is known, due to its large constant it is slower than bitonic sort for all practical problem sizes. [Refer to link 1]. In the follow ing, bitonic sort is developed on the basis of the -1-principle. T he -1-principle states that a comparator network that sorts every sequence of 's and 1's is a sorting network, i.e. it sorts every sequence of arbitrary values. Basic concepts Definition: A sequence a = a,..., a n-1 with a i {, 1}, i =,..., n-1 is called -1-sequence. A -1-sequence is called bitonic, if it contains at most two changes between and 1, i.e. if there exist subsequence lengths k, m {1,..., n} such that a,..., a k-1 =, a k,..., a m-1 = 1, a m,..., a n-1 = or a,..., a k-1 = 1, a k,..., a m-1 =, a m,..., a n-1 = 1 In the following figure, different examples of bitonic -1-sequences are outlined, where 's are drawn white and 1's gray. Figure 1: Some examples of bitonic -1-sequences

Definition: Let n, n even. The comparator network B n is defined as follows: B n = [ : n/2] [1 : n/2+1]... [n/2-1 : n-1] (see example of Figure 2) Example: Figure 2: Comparator network B 8 Theorem: Let a = a,..., a n-1 be a bitonic -1-sequence, where n network B n to a yields, n even. Application of comparator B n (a) = b,..., b n/2-1 c,..., c n/2-1 where all b i are less than or equal to all c j, i.e. b i c j for all i, j {,..., n/2-1} and furthermore b,..., b n/2-1 is bitonic and c,..., c n/2-1 is bitonic. Proof: Let a = a,..., a n-1 be a bitonic -1-sequence. Written in two rows, the sequence looks like shown in the following figure. The sequence starts with 's, continues with 1's, and ends with 's (Figure 3a). Or it starts with 1's, continues with 's, and ends with 1's (Figure 3b). The regions of 1's may overlap or not. (a) (b) Figure 3: Bitonic -1-sequences (arranged in two rows) Several other variations are possible (see Figure 4 below). Application of comparator network B n corresponds to a comparison between upper and lower row. In each case, the result stated in the theorem is achieved: all b i are less than or equal to all c j and b is bitonic and c is bitonic (Figure 4):

Figure 4: Application of comparator network B n to bitonic -1-sequences Bitonic sorting network The building blocks of the sorting network BitonicSort are comparator networks B k with different k, where k is a power of 2. By using the divide-and-conquer strategy, networks BitonicMerge and BitonicSort are formed. First, a comparator network BitonicMerge is built that sorts a bitonic sequence. Due to the theorem, B n produces two bitonic subsequences, where all elements of the first are smaller or equal than those of the second. Therefore, BitonicMerge can be built recursively as shown in Figure 5. The bitonic sequence necessary as input for BitonicMerge is composed of two sorted subsequences, where the first is in ascending and the other in descending order. The subsequences themselves are sorted by recursive application of BitonicSort (Figure 6). Figure 5 : BitonicMerge(n)

Figure 6: BitonicSort(n) In the following Figure 7, as an example the sorting network BitonicSort(8) is given. The bitonic sequence e in the middle will be sorted by recursive application of B k. The sequence e is bitonic, since it is composed of two halves d and d' which are sorted in opposite directions. These again are formed from bitonic sequences a and a' etc. Given an arbitrary -1-sequence as input to the comparator network, the assertions stated in the figure will hold. Figure 7: Sorting network BitonicSort for n = 8

At the output of the comparator network, the -1-sequence is sorted. Now the -1-principle can be applied to the comparator network as a whole: since the network sorts every arbitrary -1-sequence, it will also sort every sequence of arbitrary values, hence it is a sorting network. Analysis In order to form a sorted sequence of length n from two sorted sequences of length n/2, there are log(n) comparator stages required (e.g. the 3 = log(8) comparator stages to form sequence i from d and d'). The number of comparator stages T(n) of the entire sorting network is given by: T(n) = log(n) + T(n/2) The solution of this recurrence equation is T(n) = log(n) + log(n)-1 + log(n)-2 +... + 1 = log(n) (log(n)+1) / 2 Each stage of the sorting network consists of n/2 comparators. On the whole, these are O(n log(n) 2 ) comparators.

References: [1] C. C los, A S tudy of N on-b locking S w itching N etw orks, T he B ell S ystem s T echnical Journal, vol.32. No. 2, March 1953, pp. 46-424. [2] R. Melen and JS Turner, N onblocking Multirate Clos Networks, IEEE Communications Magazine, October 23 [3] Franaszek, P.A., Georgiou, C.J, Chung-S heng L i, A daptive routing in C los N etw orks, Proceedings of the International Conference on Computer Design, 1995 [4] F inley, T hom as W., E fficient M yrinet R outing, 22 [5] Scaling Crossbar Switches, Nick McKeown, Electrical Engineering and Computer Science, Stanford University. [6] C om parison of C ole s P arallel S orting A lgorithm to B atcher s O dd -Even Merge Sorting A lgorithm. H ow B atcher s, despite having O (log n 2 ) steps outperforms the former. http://www.idi.ntnu.no/~lasse/publics/sc9.pdf [7] Modified Odd-Even Merge Sort Network for Arbitrary Number of Inputs : Chung J. Kuo, Zhi W. Huang (IEEE Explore). [8] http://www.cs.hmc.edu/~keller/courses/cs156/s98/slides/index.html