Fault Tolerance in the Block-Shift Network



Similar documents
Tolerating Multiple Faults in Multistage Interconnection Networks with Minimal Extra Stages

Fault-Tolerant Routing Algorithm for BSN-Hypercube Using Unsafety Vectors

ON A NEW MULTICOMPUTER INTERCONNECTION TOPOLOGY FOR MASSIVELY PARALLEL SYSTEMS

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Topological Properties

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Novel Hierarchical Interconnection Networks for High-Performance Multicomputer Systems

A RDT-Based Interconnection Network for Scalable Network-on-Chip Designs

Performance of networks containing both MaxNet and SumNet links

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

MULTISTAGE INTERCONNECTION NETWORKS: A TRANSITION TO OPTICAL

Components: Interconnect Page 1 of 18

Interconnection Network Design

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING

A Fast Path Recovery Mechanism for MPLS Networks

The fat-stack and universal routing in interconnection networks

MOST error-correcting codes are designed for the equal

Quality of Service Routing Network and Performance Evaluation*

Load Balancing and Switch Scheduling

Scalability and Classifications

Preserving Message Integrity in Dynamic Process Migration

Single-Link Failure Detection in All-Optical Networks Using Monitoring Cycles and Paths

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

EFFICIENT DETECTION IN DDOS ATTACK FOR TOPOLOGY GRAPH DEPENDENT PERFORMANCE IN PPM LARGE SCALE IPTRACEBACK

AT&T Global Network Client for Windows Product Support Matrix January 29, 2015

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

QUALITY OF SERVICE METRICS FOR DATA TRANSMISSION IN MESH TOPOLOGIES

Competitive Analysis of On line Randomized Call Control in Cellular Networks

Weakly Secure Network Coding

3D On-chip Data Center Networks Using Circuit Switches and Packet Switches

Load Balanced Optical-Network-Unit (ONU) Placement Algorithm in Wireless-Optical Broadband Access Networks

Interconnection Networks

THE DESIGN OF AN EFFICIENT LOAD BALANCING ALGORITHM EMPLOYING BLOCK DESIGN. Ilyong Chung and Yongeun Bae. 1. Introduction


Load-balancing Approach for AOMDV in Ad-hoc Networks R. Vinod Kumar, Dr.R.S.D.Wahida Banu

Interconnection Network

IN THIS PAPER, we study the delay and capacity trade-offs

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Lecture 2 Parallel Programming Platforms

Probe Station Placement for Robust Monitoring of Networks

A NEW APPROACH TO ENHANCE SECURITY IN MPLS NETWORK

The Butterfly, Cube-Connected-Cycles and Benes Networks

Efficient Data Recovery scheme in PTS-Based OFDM systems with MATRIX Formulation

SECRET sharing schemes were introduced by Blakley [5]

DESIGN OF CLUSTER OF SIP SERVER BY LOAD BALANCER

Linear Crossed Cube (LCQ): A New Interconnection Network Topology for Massively Parallel System

Test Coverage Criteria for Autonomous Mobile Systems based on Coloured Petri Nets

Detecting Multiple Selfish Attack Nodes Using Replica Allocation in Cognitive Radio Ad-Hoc Networks

Topology-based network security

Determination of the normalization level of database schemas through equivalence classes of attributes

STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM

Network (Tree) Topology Inference Based on Prüfer Sequence

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Load Balancing between Computing Clusters

Phase Balancing of Distribution Systems Using a Heuristic Search Approach

Implementation of Full -Parallelism AES Encryption and Decryption

A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS

Comparison of WCA with AODV and WCA with ACO using clustering algorithm

Consecutive Geographic Multicasting Protocol in Large-Scale Wireless Sensor Networks

A Graph-Center-Based Scheme for Energy-Efficient Data Collection in Wireless Sensor Networks

Scaling 10Gb/s Clustering at Wire-Speed

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

A Network Flow Approach in Cloud Computing

Load balancing Static Load Balancing

Reliability Guarantees in Automata Based Scheduling for Embedded Control Software

Improved Irregular Augmented Shuffle Multistage Interconnection Network

Load Balancing Mechanisms in Data Center Networks

Entropy-Based Collaborative Detection of DDoS Attacks on Community Networks

An Emulation Study on PCE with Survivability: Protocol Extensions and Implementation

Load Balancing in Structured Peer to Peer Systems

Fairness in Routing and Load Balancing

A novel load balancing algorithm for computational grid

Module 5. Broadcast Communication Networks. Version 2 CSE IIT, Kharagpur

Load Balancing in Structured Peer to Peer Systems

ENHANCED HYBRID FRAMEWORK OF RELIABILITY ANALYSIS FOR SAFETY CRITICAL NETWORK INFRASTRUCTURE

Dynamic Multi-User Load Balancing in Distributed Systems

Scheduling Allowance Adaptability in Load Balancing technique for Distributed Systems

Adaptive Multiple Metrics Routing Protocols for Heterogeneous Multi-Hop Wireless Networks

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

The Goldberg Rao Algorithm for the Maximum Flow Problem

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION

Effect of Remote Back-Up Protection System Failure on the Optimum Routine Test Time Interval of Power System Protection

Transcription:

IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 1, MARCH 2001 85 Fault Tolerance in the Block-Shift Network Yi Pan, Member, IEEE Abstract The Block Shift Network (BSN) is a new topology for interconnection networks in multiprocessor systems. BSN is a class of networks defined by several parameters, and has a constant number of links/node for some given parameters. Many popular networks such as the hypercube, the shuffle-exchange, and the complete networks, are instances of the BSN for different parameters. Performance of BSN has been evaluated through analysis, simulation, and design of typical parallel algorithms on it. The results indicate that BSN surpasses the hypercube in several respects while retaining most of the hypercube advantages, especially when the traffic has the locality property. As the size & complexity of a system increase, however, the reliability aspects become equally important and should be included in the system-performance study. This paper discusses the reliability issue of BSN. Several reliability measures, including network connectivity, network diagnosability, and 2-terminal reliability, are obtained through analysis. This paper shows that the BSN not only surpasses the hypercube in performance as confirmed before, but also has comparable reliability to the hypercube under similar conditions. BSN is also very flexible in balancing its cost and performance. One can increase two parameters to enhance the performance and reliability of the BSN, while it is impossible to do so in the hypercube once its size is fixed. The BSN can be an effective interconnection network for future parallel computer systems. Future research includes more accurate reliability analysis for BSN, development of more efficient fault-tolerant routing algorithms, design and analysis of fault-tolerant broadcast algorithm and multicast algorithms, and comparisons with various augmented or modified hypercubes in terms of reliability and fault tolerance. Index Terms Connectivity, diagnosability, hypercube, interconnection network, routing. BSN block shift network, PE processor element. ACRONYMS 1 I. INTRODUCTION WHEN SEVERAL processors are required to work cooperatively on a single task, one anticipates frequent exchange of data among the several subtasks that comprise the main task. The important factors affecting this intercommunication are: Manuscript received January 17, 1998; revised February 25, 1999 and September 22, 1999. Yi Pan s research was supported in part by the US National Science Foundation under grants CCR-9211621 and CCR-9503882, the US Air Force Office of Scientific Research under grant F49620-93-C-0063, the US Air Force Avionics Laboratory, Wright Laboratory under grant F33615-C-2218, an Ohio Board of Regents Investment Fund Competition grant, and an Ohio Board of Regents Research Challenge grant. The author is with the Computer Science Dep t, Georgia State University, Atlanta, GA 30303 USA (e-mail: pan@cs.gsu.edu). Publisher Item Identifier S 0018-9529(01)06815-4. 1 The singular & plural of an acronym are always spelled the same. 1) amount of data, 2) frequency with which data are transmitted, 3) speed of data transmission, 4) interconnection between processors, 5) route that data take. Factors 1 & 2 depend on the algorithm itself and how well it has been partitioned. Ideally, in a nonshared memory computer, if one processor wants to communicate with another, then it should do so over a link that directly connects the two. A link between every pair of processors would yield a system that is most versatile. Given a sufficient number of processors, there would be fewer or no problems with scheduling. Such a system is undoubtedly the most desirable, but is also prohibitively expensive. A link between every pair of processors would require links for processors. For any large the total cost of these links would swamp all other costs. Therefore, cost must be traded-off for speed and versatility. The trade-off that has been made involves routing data from one processor to another via intermediate processors, where there are no direct links between the two processors. This has four effects on the system performance. 1) There is now an extra delay added in data transmission because of intermediate stages. 2) Perhaps more important than #1, is the added capability that must be built into each processor to allow it to perform this routing intelligently. 3) There could be a larger queuing delay in each processor s output queue, because there is more traffic load in the system due to message relaying. 4) The system reliability becomes a problem. Therefore, to design an interconnection network to: minimize the message delay time, keep the cost low, maximize the reliability, is an important task facing computer scientists. The hypercube network is very popular for parallel computing systems [16]. Variations of the hypercube have also been proposed in [2], [6], [8], [14], [17], [18], [20] by changing or adding some extra links to reduce the communication time or to increase reliability in the hypercube. Combining several networks has been used to construct new networks. Recent examples are Hyperbanyan [9], Hyper-deBruijn [10], Banyan-hypercube [29]. Other researchers use a recursive definition to construct hierarchical networks [1], [5], [11], [13], [15], [19]. The basic idea is to synthesize a network from simple building blocks in an incremental fashion. More recently, the BSN was proposed as an efficient interconnection topology for parallel computers [21], [22], [24]. The BSN has a constant number of links per node for some given parameters. Many popular networks such as the hypercube, the 0018-9529/01$10.00 2001 IEEE

86 IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 1, MARCH 2001 shuffle-exchange, and the complete networks are instances of the BSN for different parameters [21], [22], [24]. The BSN has been designed to eliminate the drawbacks of the hypercube network while retaining its advantages. Because many application algorithms use the links on lower dimensions extensively, connections on lower dimensions are provided in BSN. In BSN, the neighboring nodes are tightly coupled, and remote nodes are loosely coupled, making the topology suitable for localized traffic patterns which are very common in many applications. Unlike the shuffle-exchange, BSN allows the level of fault-tolerance to be set by a pair of parameters. Preliminary results show that BSN: performs much better than existing networks in many aspects, is very promising for massively parallel computers. [21], [24]. However, the reliability aspects become equally important as the size & complexity of a system increase. Definitions: survive: each node can reach all other nodes in a network. -node-fault tolerant: A network can survive the failure of any set of arbitrary node faults. -link-fault tolerant: A network can survive the failure of any set of arbitrary link faults. The survivability criterion can be interpreted as connectivity of the network or of a given part thereof. Hence, in the design of a new network, message pathways should be redundant to provide robustness in the event of component failure. Hypercube reliability has been studied extensively in the literature [17], [27]. This paper examines the reliability and fault tolerance aspects of BSN. Network connectivity, network diagnosability, and terminal reliability of the BSN are derived through rigorous analysis. Comparison with the hypercube is included. Section II describes BSN and related terms used in its definition. Section III discusses the network connectivity of the BSN; link & node connectivities are covered. Section IV describes fault-diameter of a network. Section V analyzes the diagnosability of the network. Section VI analyzes the 2-terminal reliability of the network. Notation 2 : a BSN which has nodes and, in each step only bits can be changed within the section of the rightmost bits in an address, PE with index binary representation of, representation of, representation of, link operation probability, event of having at least 1 path operational, out of paths of length, 2- terminal reliability of a, pair of source & destination nodes in a network. 2 Other, standard, notation is given in Information for Readers & Authors at the rear of each issue. II. BLOCK SHIFT NETWORK In the hypercube, is connected to those PE with index such that, for. The connection between two PE whose addresses differ only in bit is the dimension- connection. For example, the link between PE 000 and 010 is the dimension-1 connection. The connection concept is now generalized. The connections between two processors are called the connections on dimensions to if the two PE differ only in bits from to and are connected by the links on these dimensions. There are several variations for the connections on dimensions to ; this paper considers the following connection methods. 1) If two processors whose addresses differ only in positions to are directly connected, then the connection is a concurrent-connection method. That is, in a concurrent-connection method, bits to in one address can be changed in 1 step to reach another address. When and (covering all dimensions), then the connection scheme corresponds to the fully connected topology. 2) If two processors whose addresses differ only in positions to can reach each other by changing bits to of their addresses one by one, then the connection is a sequentialconnection method. Thus, for 2 processors with addresses differing in all bits from to, one needs unitroutes to send a message from one to the other. If only 1 bit is different in bits to, then 1 step is needed. When and, then the connection scheme corresponds to a hypercube topology. 3) Connection methods #1 & #2 are the extreme cases. Basically, method #1 can change the whole section (bits to ) of the address in 1 step; and method #2 can change the section only 1 bit at a time. Between these two extremes, other methods can be defined. For example, define a connection method which can change the section (bits to ), 2 bits at a time, 3 bits at a time, and so on. Assume that the section (bits to ) has bits, and can be divided into subsections. A connection method which can change a whole subsection at a time is a partial-connection method; i.e., a partial-connection can change a whole subsection into any pattern in 1 step by modifying 1 bit, 2 bits,, or bits in the subsection. Basically, BSN consists of 3 groups of edges: Group #1 connects nodes to their counterparts with addresses shifted cyclically positions left in 1 step; i.e.,, it connects the processor at address, to the processor at,. These connections are called L-Shift links and the data transfers over these links the are L-Shift operations. Group #2 similarly connects nodes to those with addresses shifted cyclically positions right in 1 step; they are called R-Shift links, and the data transfers over these links are R-Shift operations. Group #3 contains the connections over the rightmost dimensions. One of these 3 groups is used to define the connection over the rightmost dimensions. The links in group #3 are called

PAN: FAULT TOLERANCE IN THE BLOCK-SHIFT NETWORK 87 Fig. 1. A BSN(2; 2) with N =16. R-change links; and the data transfers over these links are R-change operations. The remainder of this section assumes that the BSN has nodes and in each step only bits can be changed within the section of the rightmost bits, and is labeled. When:, the network has a concurrent connection method over the last dimensions;, the network has a sequential connection method;, the network has a partial connection method. Define the nodes which are connected by the links in group #3 (R-change links) as a block. Conceptually, in a there are blocks each with nodes. Within each block is a complete graph (thin lines), and blocks are connected by either R-Shift or L-Shift links (thick lines), as shown in Fig. 1. If the sequential connection method is used, for the, then each block is a hypercube with nodes instead of a complete graph, as shown in Fig. 2. The loops in Figs. 1 and 2 are R-Shift or L-Shift links which happen to connect the processors themselves. The BSN is a hierarchical structure with nodes connected tightly within blocks and with blocks connected loosely. This property matches the communication requirements of most parallel application algorithms [5], [11], [15]. The design of the BSN is motivated by the fact that although the hypercube is useful for many parallel algorithms, it lacks flexibility and costs too much when it is large. On the other hand, BSN is flexible because its parameters and can be changed to meet the performance & cost requirements. The BSN is scalable in the sense that changing the size of the network does not require changing the hardware within the nodes. Many existing networks are special cases of the BSN. For example, is the shuffle-exchange network; is the -dimensional hypercube; is the complete network. Fig. 2. A BSN(1; 2) with N =16. Thus, the BSN study is useful for comparing the performance of these networks. Preliminary performance evaluation of BSN indicates that BSN is superior to many existing interconnection networks, including the hypercube [21], [24]. For example, the network degree (the maximum number of ports/node) in the BSN of nodes is. This cost measure compares the BSN favorably with the hypercube because BSN has a constant degree once are fixed, while the degree of the hypercube increases with its number of nodes. The diameter of BSN with nodes and degree is. As increases, the diameter of a BSN decreases, and its number of links increases. Consider which has the same link complexity as a hypercube; the diameter of is smaller than that of the hypercube when [21], [24]. Other performance measures, such as average distance, average delay time, and message termination probability, all indicate that BSN has a better performance than the hypercube of the same size. For more details on the performance, routing algorithms, parallel algorithms on BSN, see [21], [23], [24]. Section III discusses the reliability of the BSN. Several reliability measures are derived for BSN. BSN is not only superior to the hypercube in performance, but also a more reliable network than the hypercube under similar conditions. III. NETWORK CONNECTIVITY Network nodes and communication links do fail and must be removed from service for repair. When components fail, the network should continue to function with reduced capacity; thus communication paths that avoid inactive nodes and links are desirable. This is particularly important for large multiprocessor

88 IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 1, MARCH 2001 systems because the probability of all network components operating correctly decreases as the size of the network increases. This section analyzes BSN with 2 examples: 1) BSN with sequential connection method, and 2) BSN with concurrent connection method. BSN with partial connection method can be analyzed similarly. The parameter network-connectivity measures the resiliency of a network and its ability to continue operation despite disabled components. Informally, connectivity is the minimum number of nodes or links that must fail for the network to be partitioned into 2 or more disjoint subnetworks. Formally, the node-connectivity between two nodes and is the minimum number of nodes that must be removed from the network to disconnect nodes and [28]. The node-connectivity of a network is the minimum nodeconnectivity value of all pairs of nodes. For example, the nodeconnectivity of a ring network is 2, because the failure of any 2 nodes prevents some pair of nodes from communicating. Similarly, the link-connectivity between two nodes and is the minimum number of links that must be removed from the network to disconnect nodes and. The link-connectivity of a network is the minimum link-connectivity of all pairs of nodes. The link-connectivity of a ring is also 2. The remainder of this section simply uses node-connectivity or link-connectivity to refer to the node-connectivity or link-connectivity of a network. An important component of a parallel system is the fault-tolerance and fault-diagnosis capabilities which are directly related to the node- and link-connectivity. As demonstrated by [28], a network is -connected (node-connectivity ) if there exist at least disjoint paths between every pair of nodes in the network. In other words, is an -connected network if there is a path between any pair of nodes after the removal of no more than nodes. With this a basis, determine the BSN connectivity. Consider the BSN as a 2-level structure. All the nodes whose addresses differ in only the least significant bits are grouped in a block; i.e., all the nodes in the same block have the same bit pattern in the most significant bits of their addresses. Lemma 1: There exists a set of disjoint paths between any two nodes within a block in a which are links apart. Proof: Because each block is a hypercube of size, which has been proved to have disjoint paths [3], the lemma is proved. Lemma 2: There exists a set of disjoint paths between any two nodes within a block in a, which are 1 link apart. Proof: The fact that each block is a complete network of size proves the lemma. To derive the BSN node-connectivity, transform the BSN into a new network by considering a block in as a node in. Thus, the address of a node in (a block in is the address of any node in the corresponding block in with the least significant bits discarded. If there is a connection between any 2 nodes in two different blocks of, then connect the two corresponding nodes in. Hence, according to the definition of BSN and this transformation of to, it is easy to generate the new network as follows: Let be the radix- representation of, ; let be the radix- representation of,. Nodes and are connected if or (1) (2) Equation (1) corresponds to the R-Shift connections defined in a BSN; (2) corresponds to the L-Shift connections defined in a BSN. Next consider the node-connectivity of the new graph. Lemma 3: The node-connectivity of is at least. Proof: Consider the paths between nodes and in Fig. 3 where nodes are represented in radix-. These paths are node-disjoint because an intermediate node in path has as the least significant digit. Lemma 3 implies that removal of any nodes from does not disconnect. Theorem 1: The node-connectivity of a is. Proof: Let there be faulty nodes. All the nodes within the same block are connected because there are disjoint paths between them, according to lemma 1. On the other hand, faulty nodes cannot disconnect the condensed network because the node-connectivity of is (lemma 3) and for any. Hence, all nodes in remain connected even if there are faulty nodes in the network removed. Thus, the theorem is proved. Theorem 2: The node-connectivity of a is. Proof: Similar to the proof of theorem 1, this theorem is a direct consequence of lemmas 2 & 3. In any network, the node-connectivity is smaller than or equal to the minimum degree since removal of a node whose degree is equal to the minimum degree results in a disconnected network. Also, the node-connectivity must always be smaller than or equal to the link-connectivity[4], because removing a node effectively removes all links connected to that node. Thus, a node failure is more damaging to network-connectivity than a link failure, and fewer node failures could be necessary to disconnect the network. Thus, corollaries 1 & 2 follow. Corollary 1: The link-connectivity of a is. Corollary 2: The link-connectivity of a is. A network with node-connectivity,, provides disjoint paths between every pair of nodes in the network and it can tolerate failures. Hence, higher node or link-connectivity increases the resiliency of the network to failure, provided appropriate mechanisms exist to recover. New routing paths can be established based on the proofs of lemmas 1, 2, and 3. Of course, longer paths have to be used to avoid faulty nodes & links as shown in Fig. 3. In addition to being a measure of network reliability, connectivity is also a measure of performance. Greater connectivity reduces the number of links that must be crossed to reach a destination node. Since technology barriers, such as pinout constraints, limit the number of connections per node to a small constant, designing a network with higher connectivity and a constant number of connections per node is especially important.

PAN: FAULT TOLERANCE IN THE BLOCK-SHIFT NETWORK 89 complete network with nodes, and its -fault diameter is 2. Using a similar analysis, bit updates and shifts are needed. Therefore, theorem 4 follows. Theorem 4: The -fault diameter of a is. Fig. 3. 2 paths in network G. Theorems 1 & 2 and corollaries 1 & 2 show that both BSN with sequential connection method and BSN with concurrent connection method have optimal connectivity in the sense that the minimum degrees of the two networks are equal to the connectivity: and, respectively. To see this, consider a special node such as node. A BSN with, has neighbors since L-Shift or R-Shift result in a loop to the node itself. Similarly, a BSN with, node has neighbors. Hence, if both the BSN and the hypercube have the same network degree, they have the same network-connectivity. However, higher network-connectivity can be achieved by adjusting in the BSN to satisfy different reliability requirements, while it is impossible in the hypercube once its size is fixed. IV. FAULT DIAMETER A fault diameter of a graph is defined as the diameter of a new graph generated after the faulty nodes & links are removed from the old graph. An -fault diameter of a graph is defined to be the maximum of distances over all possible graphs that can occur with at most faults [20]. In a regular graph of degree, -faulty diameter is equal to since if all neighbors of any node fail, the graph becomes disconnected. Hence, of particular interest is -fault diameter. Now, calculate the fault diameter for BSN. Given a source address and a destination address, in a reliable BSN shift the source address positions right using the shifting link, then change these positions so that they match the least significant bits in the destination address. At most such shifts and at most bit updates are needed. Hence the total number of steps involved is at most. In a faulty with faults, shifts are needed instead of as shown in Fig. 3. Similarly, at most bit updates are needed because traversing in a block in requires at most steps; a block in a is a hypercube and its -fault diameter is [20]. Therefore, theorem 3 follows. Theorem 3: The -fault diameter of a is. Since the connectivity of the is, its -fault diameter is important. For a, a block in is a V. NETWORK DIAGNOSABILITY Diagnosability is an important characteristic for fault-tolerance in a multiprocessor system. A system of processors is -diagnosable (1-step) if all faulty processors can be identified without replacement, provided that the number of faults present does not exceed. The results in Section IV for connectivity can also be used to determine the 1-step diagnosability of the BSN. [25] proved that following 2 conditions are necessary for a system of processors to be -diagnosable (1-step): 1), 2) All processors are tested by at least other processors. [12] proved that the following 2 conditions are sufficient to ensure that a system of processors is fault diagnosable: 1) 2) The node-connectivity of the network is greater than or equal to. In BSN, each processor is presumed to be tested by each of its neighbors. Section IV showed that the node-connectivity of a is. On the other hand, since for all, then, and since for, then, theorem 5 is proved. Theorem 5: For any,a of size of is -diagnosable (1-step). Section IV also showed that the node-connectivity of a is. In addition, when, then. Thus, theorem 6 is true, based on the results in [12]. Theorem 6: For any,a of size is -diagnosable (1-step). Once more, BSN is more flexible than the hypercube in terms of diagnosability, because it relates directly to its degree. The diagnosability of BSN is higher than the hypercube when the BSN has a higher network degree than the hypercube. VI. 2-TERMINAL RELIABILITY 2-Terminal reliability (also referred to as path reliability) is an important criterion to evaluate the reliability of an interconnection network. Let be a pair of source and destination nodes in a network. The 2-terminal reliability between and is defined as the probability of finding a path entirely composed of operational links between and. This section determines this factor for both the hypercube and the BSN. Unfortunately, for these 2 networks, the number of paths between 2 nodes can be quite large, e.g., there are source-destination paths for a hypercube of size. Thus, the numbers of links and paths for the hypercube grow exponentially with the number of nodes in it [27]. Moreover, many paths can have one or more links in common, making the reliability analysis intractable. Therefore, this section derives a lower bound on 2-terminal reliability by considering a subset of all available paths between two nodes in

90 IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 1, MARCH 2001 TABLE I RELIABILITY COMPARISONS OF BSN AND HYPERCUBE the two networks. A lower bound on 2-terminal reliability offers an important insight into the value of 2-terminal reliability of a network. As long as the lower bound is quite tight, it can be used to estimate the 2-terminal reliability of a network. A lower bound on 2-terminal reliability is simply called reliability, in this section, because the lower bound used in the calculation is quite tight [20], [27]. The analysis assumes that link failures are s-independent and occur randomly in time. Consider any two arbitrary nodes in the BSN. Let each address have sections in the. As in the proof of lemma 3, network has paths of length. Based on theorem 1, the number of disjoint paths is in a. The length of these paths can be calculated as follows. Since each block in is a complete graph with nodes, then a message has to traverse links from a source to a destination using these disjoint paths, where links are links in the network as shown in Fig. 3, and links are internal links within the blocks of the BSN. A path of length consists of links and has reliability,. To derive, note that the probability of failure of all paths of length is. Thus,. Hence, the 2-terminal reliability of a can be derived: In the hypercube, only of the shortest paths are disjoint [26]. Using a similar argument, 2-terminal reliability of a hypercube of size is: Now, calculate the reliability of. Each block in the is a hypercube with nodes. Thus, in the worst case, a node needs to traverse links within a block to reach another node in the same block. As shown in theorem 1, the contains disjoint paths. Since the graph contains disjoint paths of length as shown in Fig. 3, the length of the disjoint paths in is at most, where links are external in the network as shown in Fig. 3, and links are internal within the blocks of. Hence, the 2-terminal reliability of a is: The number of links in a hypercube of nodes is. The number of links in depends on both and and is [21]. Table I shows the values for the 2-terminal reliability of the hypercube and BSN of moderate size for various. The link operational rate of changes from 0.90 to 0.98. The reliability of the depends on. Table I shows that when, has a lower reliability than the hypercube because it uses much less links than the hypercube. The reliability of is almost the same as the hypercube. These two networks have almost the same link complexity. The reliability of is much better than the hypercube for all link operational rates. This is at the expense that uses more links than the hypercube. has the least reliability because it uses the least number of links. REFERENCES [1] S. Abraham and E. S. Davidson, A communication model for optimizing hierarchical multiprocessor systems, in Proc. Int l Conf. Parallel Processing, Aug 1986, pp. 467 474. [2] S. Abraham and K. Padmanabhan, An analysis of the twisted cube topology, in Proc. Int l Conf. Parallel Processing, vol. 1, Aug 1989, pp. 116 120. [3] J. R. Armstrong and F. G. Gray, Fault diagnosis in a Boolean n-cube array of microprocessors, IEEE Trans. Computers, vol. C-30, pp. 587 590, Aug 1981. [4] J. A. Bondy and U. S. R. Murty, Graph Theory with Applications: North Holland, 1979. [5] C. Chen, D. P. Agrawal, and J. R. Burke, A class of hierarchical networks for VLSI/WSI based multicomputers, in Proc. VLSI Design, Jan 1991, pp. 267 272. [6] H. L. Chen and N. F. Tzeng, Enhanced incomplete hypercubes, in Int l Conf. Parallel Processing, vol. 1, Aug 1989, pp. 270 277. [7] C. R. Das, W. Lin, and T. Feng, Reliability evaluation of hypercube multicomputers, in IEEE Trans. Reliability, vol. 38, Apr 1989, pp. 121 129. [8] K. Efe, A variation of the hypercube with lower diameter, IEEE Trans. Computers, vol. 40, pp. 1312 1316, Nov 1991. [9] C. S. Ferner and K. Y. Lee, Hyperbanyan networks: A new class of networks for distributed-memory multiprocessors, in Fourth Symp. Frontiers of Massively Parallel Computation, Oct 1992, pp. 254 261. [10] E. Ganesan and D. K. Pradhan, The hyper-debruijn networks: scalable versatile architecture, IEEE Trans. Parallel and Distributed Systems, vol. 4, pp. 962 978, Sep 1993. [11] K. Ghose and K. R. Desai, The HCN: a versatile interconnection network based on cubes, Supercomputing, pp. 426 435, Nov 1989. [12] S. L. Hakimi and A. T. Amin, Characterization of the connection assignment of diagnosable systems, IEEE Trans. Computers, pp. 86 88, Jan 1974. [13] M. Hamdi, A class of recursive interconnection networks: Architectural characteristics and hardware cost, IEEE Trans. Circuits and Systems I: Fundamental Theory and Applications, vol. 41, pp. 805 816, Dec 1994. [14] W. T. Hsu, P. C. Yew, and C. Q. Zhu, An enhancement scheme for hypercube interconnection networks, in Int l Conf. Parallel Processing, Aug 1987, pp. 820 823. [15] K. Hwang and J. Ghosh, Hypernet: a communication-efficient architecture for constructing massively parallel computers, IEEE Trans. Computers, vol. C-36, pp. 1450 1467, Dec 1987. [16] K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability: McGraw Hill, 1993. [17] H. P. Katseff, Incomplete hypercube, IEEE Trans. Computers, vol. 37, pp. 604 608, May 1988. [18] K. H. Kwon, S. Latifi, and S. Q. Zheng, Average data communication performance of twisted hypercubes, in Proc. Int l Conf. Finite Fields, Coding Theory, and Advances in Communication and Computing, Dec 1991, pp. 329 340. [19] S. Laksmivarahan and S. K. Dhall, A new hierarchy of hypercube interconnection schemes for parallel computers, J. Supercomputing, pp. 81 107, Sep 1988. [20] S. Latifi and A. El-Amaway, Properties and performance of folded hypercube, IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 1, pp. 31 42, Jan 1991. [21] Y. Pan, The block shift network: interconnection strategies for large parallel systems, Ph.D., Dep t Computer Science, University of Pittsburgh, 1991. [22] Y. Pan and H. Y. H. Chuang, The block shift network: a new interconnection network for efficient parallel computation, in Proc. Int l Conf. Parallel Processing, vol. 1, Aug 1991, pp. 688 689.

PAN: FAULT TOLERANCE IN THE BLOCK-SHIFT NETWORK 91 [23], Computations for some matrix and graph problems on the block shift network, in Proc. High Performance Computing, Apr 1994, pp. 314 319. [24], Properties and performance of the block shift network, IEEE Trans. Circuits and Systems I: Fundamental Theory and Applications, vol. 44, pp. 93 102, Feb 1997. [25] F. Preparata, G. Metze, and R. T. Chien, On the connection assignment problem of diagnosable systems, IEEE Trans. Computers, vol. C-16, pp. 848 854, Dec 1967. [26] Y. Saad and M. H. Schultz, Topological properties of hypercubes, IEEE Trans. Computers, vol. 37, pp. 867 872, Jul 1988. [27] S. T. Soh, S. Rai, and J. L. Trahan, Improved lower bounds on the reliability of hypercube architectures, in Proc. Fifth Int l Conf. Parallel and Distributed Computing and Systems, Oct 1992, pp. 182 187. [28] R. S. Wilkov, Analysis and design of reliable computer networks, IEEE Trans. Communications, vol. 20, pp. 660 678, Jun 1972. [29] A. Youssef and B. Narahari, Banyan-hypercube networks, IEEE Trans. Parallel and Distributed Systems, vol. 1, pp. 160 169, Apr 1990. Yi Pan received his B.Eng. (1982) in Computer Engineering from Tsinghua University, PR China, and his Ph.D. (1991) in Computer Science from the University of Pittsburgh. He is an Associate Professor in the Dep t of Computer Science at Georgia State University. He was a faculty member in the Dep t of Computer Science at the University of Dayton. He has published more than 100 research papers, co-edited 6 books (including conference proceedings), and contributed several book chapters. He has received many awards including the Outstanding Scholarship Award of the College of Arts and Sciences at University of Dayton (1999), the Japanese Society for the Promotion of Science Fellowship (1998), AFOSR Summer Faculty Fellowship (1997), NSF Research Opportunity Award (1994, 1996), Andrew Mellon Fellowship from Mellon Foundation (1990), the best paper award from PDPTA 1996 (1996), and Summer Research Fellowship from the Research Council of the University of Dayton (1993). Dr. Pan is on the editorial boards of four international journals. He has served as conference chair n, program chair n, vice program chair n, publicity chair n, session chair n, steering committee member, and program committee member for numerous international conferences.