Master s Project Report June, Venugopal Duvvuri Department of Electrical and Computer Engineering University Of Kentucky

Transcription

1 Design, Development, and Simulation/Experimental Validation of a Crossbar Interconnection Network for a Single-Chip Shared Memory Multiprocessor Architecture Master s Project Report June, 2002 Venugopal Duvvuri Department of Electrical and Computer Engineering University Of Kentucky Under the Guidance of Dr. J. Robert Heath Associate Professor Department of Electrical and Computer Engineering University of Kentucky 1

2 Table of Contents Topic Page Number ABSTRACT 3 Chapter 1: Introduction, Background, and Positioning of Research 4 Chapter 2: Types of Interconnect Systems 8 Chapter 3: Multistage Interconnection Systems Complexity 16 Chapter 4: Design of the Crossbar Interconnect Network 28 Chapter 5: VHDL Design Capture, Simulation, Synthesis and Implementation Flow 35 Chapter 6: Design Validation via Post-Implementation Simulation Testing 39 Chapter 7: Experimental Prototype Development, Testing, and Validation Results 61 Chapter 8: Conclusions 65 References 66 Appendix A: Interconnect Network and Memory VHDL Code (Version 1) 67 Appendix B: Interconnect Network VHDL Code (Version 2) 76 2

3 ABSTRACT This project involves modeling, design, Hardware Description Language (HDL) design capture, synthesis, HDL simulation testing, and experimental validation of an interconnect network for a Hybrid Data/Command Driven Computer Architecture (HDCA) system, which is a single-chip shared memory multiprocessor architecture system. Various interconnect topologies that may meet the requirements of the HDCA system are studied and evaluated related to utilization within the HDCA system. It is determined the Crossbar topology best meets the HDCA system requirements and it is therefore used as the interconnect network of the HDCA system. The design capture, synthesis, simulation and implementation is done in VHDL using XILINX Foundation CAD software. A small reduced scale prototype design is implemented in a PROM based Spartan XL Field Programmable Gate Array (FPGA) chip which is successfully experimentally tested to validate the design and functionality of the described crossbar interconnect network. 3

4 Chapter 1 Introduction, Background, and Positioning of Research This project is first, the study of different kinds of interconnect networks that may meet the requirements of a Hybrid Data/Command Driven Architecture (HDCA) multiprocessor system [2,5,6] shown in Figure 1.1. The project then involves Vhsic Hardware Description Language (VHDL) [10] description, synthesis, simulation testing, and experimental prototype testing of an interconnect network which acts as a Computing Element (CE) to data memory circuit switch for the HDCA system. The HDCA system is a multiprocessor shared memory architecture. The shared memory is organized as a number of individual memory blocks as shown in Figures 1.1, 3.4a, and 4.1 and is explained in detail in later chapters. This kind of memory organization is required by this architecture. If two or more processors want to communicate with memory locations within the same memory block, lower priority processors have to wait until the highest priority processor gets its transaction done. Only the highest priority processor will receive a request grant and the requests from other lower priority processors must be queued and these requests are processed only after the completion of the first highest priority transaction. The interconnect network to be designed should be able to connect requesting processors on one side of the interconnect network to the memory blocks on the other side of the interconnect network. The efficiency of the interconnect network increases as the possible number of parallel connections between the processors and the memory blocks increases. Interconnection networks play a central role in determining the overall performance of a multiprocessor system. If the network cannot provide adequate performance for a particular application, nodes (or CE processors in this case) will frequently be forced to wait for data to arrive. In this project, different types of interconnect networks, which may be applicable to a HDCA system are addressed, and advantages and disadvantages of these interconnects are discussed. Different types of interconnects, their routing mechanism, and the complexity factor in designing the interconnects is also described in detail. This project includes design and VHDL description and synthesis of a interconnect network 4

5 based on the crossbar topology, the topology which best meets HDCA system requirements. Inputs Input FIFOs RAM Data Memory... Outputs CE-Data Memory Interconnect. CE 0 CE 1 CE n-1 Q Q Q CE-Mapper Control Token Router FIFO CE-File Interconnect Control Token Mapper (CTM)... File Large File Memory File Q => Muiltifunctional Queue Figure 1.1: Single-Chip Reconfigurable HDCA System (Large File Memory May Be Off-Chip) 5

6 The crossbar topology is a very popular interconnect network in industry today. Interconnects are applicable to different kinds of systems having their own requirements. In some systems, such as distributed memory systems, there should be a way that the processors can communicate with each other. A crossbar topology (single sided topology) [1] can be designed to meet the requirement of inter-processor communication and is unique to distributed memory systems, because in distributed memory systems, processors do not share a common memory. All the processors in the system are directly connected to their own memory and caches. Any processor cannot directly access another processor's memory. All communication between the processors is made possible through the interconnection network. Hence, there is a need for inter-processor communication in distributed memory architectures. The crossbar topology suitable for these architectures is the single-sided crossbar network. All the processors are connected to an interconnection network and communication between any two processors is possible. The HDCA system does not need an interconnect that supports inter-processor communication, as it is a shared memory architecture. For the shared memory architectures, a double-sided crossbar network can be used as the interconnect network. This design needs some kind of priority logic, which prioritizes conflicting requests for memory accesses by the processors. This also requires a memory organization which is shared by all processors. The HDCA system requires the memory to be divided into memory blocks, each block containing memory locations with different address ranges. The actual interconnect design is a combination of a crossbar interconnect (double sided topology) [1], priority logic, and a shared memory organization. Another interconnect architecture has been implemented as the interconnect for the CE to Data memory circuit switch in an initial prototype of the HDCA system [2]. The initial HDCA prototype assumes no processor conflicts in accessing a particular memory block, which can be handled in the design presented here by the priority logic block. The input queue depth of individual CE processors of a HDCA system is used by the priority logic block of the proposed interconnect network in granting requests to the processor having the deepest queue depth. The presented design is specific to the CE to Data memory circuit switch for a HDCA system. The detailed crossbar interconnect network design is described in 6

7 Chapter 4. VHDL design capture, synthesis and implementation procedures are discussed in Chapter 5. Chapter 6 includes the VHDL simulation testing setup and results. A test case is described in chapter 6 which was tested during pre-synthesis HDL simulation, post-synthesis HDL simulation, and the post-implementation HDL simulation process. In Chapter 7 an experimental prototype of the crossbar interconnect network is developed and tested to validate the presented interconnect architecture, design, and functionality. 7

8 Chapter 2 Types of Interconnect Systems Interconnect networks can be classified as static or dynamic [11]. In the case of a static interconnection network, all connections are fixed, i.e. the processors are wired directly, whereas in the latter case there are routing switches in between. The decision whether to use a static or dynamic interconnection network depends on the type of problem to be solved by the computer system utilizing the interconnect system. Generally, static topologies are suitable for problems whose communication patterns can be predicted a priori reasonably well, whereas dynamic topologies (switching networks), though more expensive, are suitable for a wider class of problems. Static networks are mainly used in message passing networks and are mainly used for inter-processor communications. Types of Static Networks: 1. Star connected network: Figure 2.1: Star Connected Network In a star topology there is one central node computer, to which all other node computers are connected; each node has one connection, except the center node, which has N-1 connections. Routing in stars is trivial. If one of the communicating nodes is the center node, then the path is just the edge connecting them. If not, the message is routed from the source node to the center node, and from there to the destination node. Star 8

9 networks are not suitable for large systems, since the center node will become a bottleneck with an increasing number of processors. A typical Star connected network is shown in Figure Meshes: Figures 2.2 and 2.3 show a typical 1-Dimensional (D) mesh and 2-D mesh respectively. The simplest and cheapest way to connect the nodes of a parallel computer is to use a one-dimensional mesh. Each node has two connections and boundary nodes have one. If the boundary nodes are connected to each other, we have a ring, and all nodes have two connections. The one-dimensional mesh can be generalized to a k- dimensional mesh, where each node (except boundary nodes) has 2k connections. In meshes, the dimension-order routing technique is used [12]. That is, routing is performed in one dimension at a time. In a three-dimensional mesh for example, a message's path from node (a,b,c) to the node (x,y,z) would be moved along the first dimension to node (x,b,c), then, along the second dimension to node (x,y,c), and finally, in the third dimension to the destination-node (x,y,z). This type of topology is not suitable to build large-scale computers, since there is a wide range of latencies (the latency between neighboring processors is much lower than between not-neighbors), and secondly the maximum latency grows with the number of processors. Figure 2.2: 1 - D Mesh Figure 2.3: 2 - D Mesh 9

10 3. Hypercubes: The hypercube topology is one of the most popular and used in many large-scale systems. A k-dimensional hypercube has 2 k nodes, each with k connections. Figure 2.4 shows a 4-D hypercude. Hypercubes scale very well, the maximum latency in a k- dimensional (or "k-ary") hypercube is log 2 N, with N = 2 k. An important property of hypercube interconnects is the relationship between node-number and which nodes are connected together. The rule is that any two nodes in the hypercube, whose binary representations differ in exactly one bit, are connected together. For example in a four-dimensional hypercube, node 0 (0000) is connected to node 1 (0001), node 2 (0010), node 4 (0100) and node 8 (1000). This numbering scheme is called the Gray code scheme. A hypercube connected in this fashion is shown in Figure 2.5. A k-dimensional hypercube is nothing more than a k-dimensional mesh with only two nodes in each dimension, and thus the routing algorithm is the same as for meshes; apart from one difference. The path from node A to node B is calculated by simply calculating the Exclusive-OR, X = A XOR B, from the binary representations for node A and B. If the i th bit in X is '1' the message is moved to the neighboring node in the i th dimension. If the i th bit is '0', the message is not moved. This means, that it takes at most log 2 N steps for a message to reach its destination (where N is the number of nodes in the hypercube). Figure 2.4: 4-D Hypercube 10

11 Figure 2.5: Gray Code Scheme in Hypercube Types of Dynamic Networks: 1. Bus-based networks: They are the simplest and an efficient solution when the cost and a moderate number of processors are involved. Their main drawback is a bottleneck to the memory when the number of processors becomes large and also a single point of failure hangs the system. To overcome these problems to some extent, several parallel buses can be incorporated. Figure 2.6 shows a bus-based network incorporated with a single bus. P0 P1 Pn-1 Figure 2.6: Bus-based Network 11

12 2. Crossbar switching networks: Figure 2.7 shows a double-sided crossbar network having n processors (P i ) and m memory blocks (M i ). All the processors in a crossbar network have dedicated buses directly connected to all memory blocks. This is a non-blocking network, as a connection of a processor to a memory block does not block the connection of any other processor to any other memory block. In spite of high speed, their use is normally limited to those systems containing 32 or fewer processors, due to non-linear (n x m) complexity and cost. They are applied mostly in multiprocessor vector computers and in multiprocessor systems with multilevel interconnections. M1 M2 Mm P1 P2 Pn Figure 2.7: Crossbar Network 3. Multistage interconnection networks: Multistage networks are designed with several small dimension crossbar networks. Input/Output connection establishment is done in two or more stages. Figures 12

13 2.8 and 2.9 shown below are Benes multistage and Clos multistage networks. These are non-blocking networks and are suitable for very large systems as their complexity is much less than that of Crossbar networks. The main disadvantage of these networks is latency. The latency increases with the size of the network. II B N/d,d P 0 P 1 I III M 0 M 1 X d, d X d, d P d-1 M 0d-1 P d P d+1 I II B N/d,d III M d M d+1 X d, d X d, d P 2d-1 M 2d-1 P n-d P n-d+1 I III M n-d M n-d+1 P n-1 X d, d II X d, d M n-1 B N/d,d Figure 2.8: Benes Network (B N,d ) 13

14 P 0 P 1 I (1) II (1) III (1) M 0 M 1 P d-1 M d-1 P d M d P d P d+1 I (2) III (2) M d+1 P 2d-1 II (2) M 2d-1 P n-d M n-d P n-d+1 I (C1) III (C2) M n-d+1 P n-1 M n-1 II (K) Figure 2.9: Clos Network 14

15 Some of the multistage networks are compared with crossbar networks, primarily from a stand-point of complexity, in the next chapter. Table 2.10 shows some general properties of bus, crossbar and multistage interconnection networks. Property Bus Crossbar Multistage Speed Low High High Cost Low High Moderate Reliability Low High High Complexity Low High Moderate Table 2.1: Properties of Various Interconnection Network Topologies 15

16 Chapter 3 Multistage Interconnection Network Complexity For the HDCA system, the desired interconnect should be able to establish nonblocking, high speed connections from the requesting Computing Elements (CEs) to the memory blocks. The interconnect should be able to sense if there are any conflicts such as two or more processors requesting connection to the same memory block and give only a highest priority processor the ability to connect. The multistage Benes network and Clos network, their complexity comparison with a crossbar network, and advantages and disadvantages of these candidate networks is discussed in this chapter. Crossbar Topology: A crossbar network is a highly non-blocking, very reliable, very high-speed network. Figure 2.7 is a typical single stage crossbar network with N inputs/processors and M outputs/memory blocks. It is denoted by X N,M. The complexity (Crosspoint count) of a Crossbar network is given by N x M. Complexity increases with an increase in number of inputs or number of outputs. This is the main disadvantage of the Crossbar network. Hence there is less scope for scalability of crossbar networks. The crossbar topology implemented for shared memory architectures is referred to as a double-sided crossbar network. Benes Network: A Benes network is a multistage, non-blocking network.. For any value of N, d should be chosen so that Log d N is an integer. The number of stages for a N x N Benes network is given by (2log d N-1) and it has (N/d) crossbar switches in each stage. Hence B N,d is implemented with [[(N/d).(2log d N-1)] crossbar switches. The general architecture of a Benes network (B N,d ) is shown in Figure 2.8. In the figure, N: Number of Inputs or Outputs, 16

17 d: Dimension of each crossbar switch ( X d,d ), I: First stage switch = X d,d, II: Middle stage switch = B N/d,d III: Last stage switch = X d,d, The complexity (crosspoint count) of the network is given by [(N/d).(2log d N-1). d 2 ]. Network latency is a factor of (2log d N-1), because of (2log d N-1) stages between input stage and output stage. There are different possible routings from any input to output. It is a limited scalability architecture. For a B N,d implementation, N has to be a power of d. For all other configurations, a higher order Benes network can be used, but at the cost of some hardware wastage. The main disadvantage of this network is the network latency and limited scalability. For very large networks, a Benes network implementation is very cost effective. Clos Network: Figure 2.9 shows a typical N x M Clos network represented by C N,M. The blocks I, III are always crossbar switches and II is a crossbar switch for a 3 stage Clos network. In implementations of higher order Clos networks, II is a lower order Clos network. For example, for a 5 stage Clos implementation, II is a three-stage Clos network. N: Number of processors M: Number of Memory blocks K: Number of Second stage switches C1: Number of First stage switches C2: Number of Third stage switches For a three stage Clos network, I = X, II = X N/C1,K C1,C2, III = X K,M/C2 and the condition for non-blocking Clos implementation is K = N/C1 + M/C2-1. A three stage Clos implementation for N = 16, M = 32, C1 = 4, C2 = 8 has K = 16/4 + 32/8-1 = 7. Each 1st stage switch becomes a 4 x 7 crossbar switch and the 2nd stage 17

18 switch becomes a 4 x 8 Crossbar switch and each third stage switch becomes a Crossbar switch of size 7 x 4. (I = X 4,7 II = X 4,8 III = X 7,4 ). The complexity of a Clos network is given by C clos = [K(N +M) + K(C1.C2 ) ]. Using the non-blocking condition, K = N/C1 + M/C2-1. For N = M & C1 = C2, K = 2N/C1-1 and hence C clos = (2N/C1-1) {2N + C1 2 }. For an optimum crosspoint count for non-blocking Clos networks, N/C1 = (N/2) 1/2 = > C1 2 = 2N => C clos = ((2N) 1/2-1). 4N. (Approximately) The main advantage of a Clos network implementation is its scalability. A Clos network can be implemented for any non-prime value of N. The disadvantages of this implementation are network latency and implementation for small systems. The network latency is a factor of the number of intermediate stages between the input stage and the output stage. From the complexity comparison shown in Table 3.1 and charts shown in Figures 3.2 and 3.3, it can be analyzed that the crossbar topology for small systems and the Benes network for large systems, match the requirements of the interconnect network for a HDCA system. The number of processors on the input side and number of memory blocks on the output side are assumed to be N for simplicity in comparison of the topologies. This assumption holds for any rectangular size implementations of these topologies, which is not possible in the Benes network. The complexity comparison table for the three topologies studied so far is given in Table 3.1. In the table I is the complexity, and II is the corresponding network implementation for the values of N for the respective topologies. Chart 1, shown in Figure 3.2, is the graph of complexity of the three topologies versus N, the number of processors or memory blocks, for lower values of N (N <= 16). Chart 2, shown in Figure 3.3, is the graph of complexity of the three topologies versus N, the number of processors or memory blocks, for higher values of N (N >= 16). 18

19 Table 3.1: Complexity Comparison Table N Crossbar Benes Clos I II I II I II 2 4 X(2,2) 4 B(2,2) 4 C(2,2) 3 9 X(3,3) 9 B(3,3) 9 C(3,3) 4 16 X(4,4) 24 B(4,2) 36 C(4,2) 5 25 X(5,5) 25 B(5,5) 25 C(5,5) 6 36 X(6,6) 80 B(8,2) 63 C(6,3) 7 49 X(7,7) 80 B(8,2) 96 C(8,4) 8 64 X(8,8) 80 B(8,2) 96 C(8,4) 9 81 X(9,9) 81 B(9,3) 135 C(9,3) X(10,10) 224 B(16,2) 135 C(10,5) X(11,11) 224 B(16,2) 180 C(12,6) X(12,12) 224 B(16,2) 180 C(12,6) X(13,13) 224 B(16,2) 189 C(14,7) X(14,14) 224 B(16,2) 189 C(14,7) X(15,15) 224 B(16,2) 275 C(15,5) X(16,16) 224 B(16,2) 278 C(16,8) X(32,32) 576 B(32,2) 896 C(32,8) X(64,64) 1408 B(64,2) 2668 C(64,16) X(81,81) 1701 B(81,3) 4131 C(81,9) X(128,128) 3328 B(128,2) 7680 C(128,16) 19

20 Complexity Chart Crossbar Benes Clos Figure 3.2: Complexity Chart for N <= 16 N Chart Complexity Crossbar Benes Clos N Figure 3.3: Complexity Chart for N >= 16 20

21 The following equations are used to calculate the complexity of all three topologies for the different configurations given in Table 3.1. C clos = (2N/C1-1) {2N + C1 2 }, For N/C1 = (N/2) 1/2, taken to the closest integer value. C benes = [(N/d).(2log d N-1). d 2 ], C crossbar = N 2 From Table 3.1 and the Charts in Figures 3.2 and 3.3, the crossbar topology has the lowest complexity for the values of N < = 16. Hence the crossbar network is the best interconnect implementation for systems that have not more than 16 processors/memory blocks since the hardware required for the implementation is less than all other possible implementations, it is faster than any other network, and it is a non-blocking network as every input has connection capability to every output in the system. And for the systems having more than 16 x 16 configurations, and less than 64 x 64 configurations, the designer has to tradeoff between speed and complexity. Because, for the multistage networks such as the Benes network, complexity is less than that of the crossbar network but at the cost of speed, as speed of multistage networks is much lower than that of the crossbar network. For systems having more than 64 x 64 configurations, the Benes network proves to be the best implementation. The HDCA system normally requires an interconnect with a complexity less than 256. The crossbar implemented interconnect best suits the system as it has minimum complexity for the sizes of interconnect needed by the HDCA system, it is highly nonblocking as no processor has to share any bus with any other processor and it is a very high speed implementation as it has only one intermediate stage between processors and memory blocks. 21

22 Multiprocessor Shared Memory Organization: P[0] PI[0] IM[0] MB[0] P[1] P[N-1] PI[1] PI[N-1] I N T E R C O N N E C T IM[1] IM[M-1] MB[1] MB[M-1] Figure 3.4a: Multi-processor, Interconnect and Shared Memory Organization Figure 3.4a shows the organization of a multiprocessor shared memory organization. Figure 3.4b shows the organization of the shared memory used in the HDCA system. By making 2 c = M, the shared memory architecture in Figure 3.4b can be used as the shared memory for the HDCA system. Figure 3.4c shows the organization of each memory block. In each memory block there are 2 b addressable locations each of a bits width. 22

23 MB [0] MB [1] MB [2] MB [2 c -1] Figure 3.4b: Shared Memory Organization 23

24 a b - 1 Figure 3.4c: Organization of each Memory Block Related Research in Crossbar Networks: The crossbar switches (networks) of today see a wide use in a variety of applications including network switching, parallel computing, and various telecommunications applications. By using the Field Programmable Gate Arrays (FPGAs) or the Complex Programmable Logic Devices (CPLDs) to implement the crossbar switches, design engineers have the flexibility to customize the switch to suit their specific design goals, as well as obtain switch configurations not available with offthe-shelf parts. In addition, the use of in-system programmable devices allows the switch 24

25 to be reconfigured if design modifications become necessary. There are two types of implementations possible based on the crossbar topology. One is the single-sided crossbar implementation as shown in Figure 3.5 and the other is the double-sided crossbar implementation as shown in Figure N- 2 N- 1 Figure 3.5: Single-sided Crossbar network The single-sided crossbar network is usually implemented and utilized in distributed memory systems, where all the nodes (or processors) connected to the interconnect need to communicate to each other. Whereas in the double-sided crossbar networks which are usually utilized as the interconnect between processors and memory blocks in a multiprocessor shared memory architecture as shown in Figures 3.4a, 3.4b, and 3.4c, processors need to communicate with memory blocks but not processors with processors and memory blocks with memory blocks. A crossbar network is implemented with serial buses or parallel buses. In a serialbus, crossbar network implementation, addresses and data are sent by the processor through a single-bit bus in a serial fashion, which are fetched by the memory blocks at the rate of 1-bit on every active edge of the system clock. Some of the conventional crossbar switches use this protocol for the crossbar interconnect network. The other implementation is the parallel-bus crossbar network implementation. This implementation is much faster than the serial-bus implementation. All the memory blocks 25

26 fetch addresses on one clock and data on the following clocks. This implementation consumes more hardware than the serial-bus crossbar network implementation but is a much faster network implementation and is hence used in some high performance multiprocessor systems. The main issue in implementing a crossbar network is arbitration of processor requests for memory accesses. The processor request arbitration comes into picture when two or more processors request for memory access within the same memory block. There are different protocols that may be followed in designing the interconnect. One of them is a round robin protocol. In the case of conflict among processor requests for memory accesses within the same memory block, requests are granted to the processors in a round robin fashion. A fixed priority protocol assigns fixed priorities to the processors. In case of conflict, the processor having the highest priority ends up having its request granted all the time. For a variable priority protocol, as will be used in an HDCA system, priorities assigned to processors dynamically vary over time based on some parameter (metric) of the system. In the HDCA system of Figure 1.1, all processors are dynamically assigned a priority depending upon their input Queue (Q) depth at any time. In the case of conflict, the processor having the highest (deepest) queue depth at that point of time gets the request grant. A design engineer has to choose among the above mentioned protocols and various kinds of implementations depending on the system requirements in designing a crossbar network. The HDCA system will need some kind of arbitration in case of processor conflicts. The interconnect design presented in this project is closely related to the design of a crossbar interconnect for the distributed memory systems presented in [3]. Both the designs address the possibility of conflicts between the processor requests for the memory accesses within the same memory block. Both the designs use parallel address and data buses between every processor and every memory block. The design presented in this project is different from the interconnect network of [3] in two ways. Firstly, the crossbar interconnect network presented in this project is suitable for the shared memory architecture. The crossbar topology used in this project is a double-sided topology whereas a single-sided topology is used in the design of the crossbar interconnect network of [3]. The priority arbitration scheme proposed in the interconnect network of 26

27 [3] uses a fixed priority scheme based on the physical distance between the processors and gives the closest processor the highest priority and farthest processor the lowest priority. The priority arbitration scheme presented in this design uses the input queue depth of the processors in determining the priorities. The HDCA system requires a double-sided, parallel bus crossbar network, with variable priority depending upon input queue depth of the processors. In this project, a double-sided, parallel-bus crossbar network using variable priority protocol is designed, implemented and tested as the interconnect for the HDCA system. The detailed design of the implementation is described in the next chapter. 27

28 Chapter 4 Design of the Crossbar Interconnect Network This chapter presents the detailed design of a crossbar interconnect which meets the requirements of the HDCA system of Figure1.1. The organization of processors, interconnect and memory blocks is shown in Figure 3.4a. Shared Memory Organization: The shared memory organization used in this project is shown in Figures 3.4b and 3.4c. From Figure 3.4b, there are (2 c = M) memory blocks and the organization of each memory block is shown in Figure 3.4c. In each memory block, there are 2 b addressable locations, of a bits width. Hence the main memory which includes all the memory blocks has 2 b+c addressable locations of a bits width. Hence the width of the address bus of each processor is (b + c) bits wide and the data bus of each processor is a bits wide. Signal Description: The schematic equivalent to the behavioral VHDL description of the interconnect is shown in Figure 4.1. In general, a processor i of Figure 3.4a has CTRL[i], RW[i], ADDR[i], QD[i] as inputs to the interconnect. CTRL[i] of processor i goes high when it wants a connection to a memory block. RW[i] goes high when it wants to read and goes low when it wants to write. ADDR[i] is the (b + c) bit address of the memory location and ADDR_BL[i] is the c bit address of the memory block with which the processor wants to communicate. The memory block is indicated by the c MSBs of ADDR[i]. QD[i] is the queue depth of the processor. FLAG[i], an input to the processor goes high granting the processor s request. FLAG[i] is decided by the priority logic of the interconnect network. PI[i], DM[i][j] and IM[j] of Figure 4.1 are different types of buses used in the interconnection network. The bus structure of these buses is shown in Figures 4.2 and 4.3. At any time, processors represented by i, can request access to memory blocks represented by MB[j]. That means CTRL[i] of those processors go high and have memory block address, ie ADDR_BL[i] = j. Hence in Figure 4.2, the bus PI[i] of the 28

29 DEC[0] D M [0][0] P[0] PI[0] PRL[0] IM [0] MB[0] MB_ ADDR[ 0 ] DEC[1] P [ 1 ] PI[1] PRL[1] IM [1] MB[1] MB_ADDR[1] DEC[N-1] P[N-1] PI[ N-1] PRL[M -1] IM [M -1] MB[M-1] MB_ADDR[N-1] D M [N -1][M -1] Figure 4.1: Block Diagram of the Crossbar Interconnect Network 29

30 C T R L [ i ] R W [ i ] A D D R [ i ] B + C D A T A [ i ] A Q D E P [ i ] F L A G [ i ] Figure 4.2: PI[i] and DM[i][j] Bus Structures. (The PI[i] Bus and DM[i][j] Bus Have the Same Set of Signal Lines as Shown in This Figure) C T R L [ i ] R W [ i ] A D D R [ i ] B + C D A T A [ i ] A F L A G [ i ] Figure 4.3: IM[j] Bus Structure processors gets connected to the bus DM[i][j], through the decode logic DEC[i] of Figure 4.1 and shown again in Figure 4.4. As shown in Figure 4.4, ADDR_BL[i] of the requesting processor is decoded by decoder DEC[i], and connects PI[i] to the DM[i][j] output bus of DEC[i]. Every memory block has a priority logic block, PRL[j], as shown in Figure 4.5. The function of this logic block is to grant a request to the processor having the deepest queue depth among the processors requesting memory access to the same memory block. As shown in Figure 4.5, once processor i gets a grant from the priority logic PRL[j] via the FLAG[i] signal of the DM[i][j] and PI[i] busses shown in Figures 4.1 and 4.2, the DM[i][j] bus is connected to the IM[j] bus by MUX[j] of Figure 4.5. Thus a 30

31 connection is established via PI[i], DM[i][j] and IM[j] between processor i and memory block j. This connection remains active as long as the processor holds deepest queue depth or CTRL[i] of the processor is active. A priority logic block gives a grant only to the highest DM[i][0] PI[i] DM[i][1] DEC[i] DM[i][M-1] ADDR_BL[i] Figure 4.4: Decode Logic (DEC[i]) DM[0][j] DM[1][j] MUX[j] IM[j] DM[N-1[j] PROC SEL[j] PRL_LOGIC[j] Figure 4.5: Priority Logic (PRL[j]) 31

32 F ctrl[0] = '1' & mbaddr[0] = j T max=0 max=qd[0] i = 0 ctrl[1] = '1' & mbaddr[1] = j & qd[1] >= max F T flag[i] = '0' max = qd[1] i = 1 max = max i = i F ctrl[2] ='1' & mbaddr[2] =j & qd[2] >= max T flag[i] = '0' max = qd[2] i = 2 max = max i = i ctrl[3] = '1' & mbaddr[3] = j & qd[3] >= max F T flag[i] = '0' max = qd[3] i = 3 max = max i =i F ctrl[i] = '1' T flag[i] = '0' flag[i] = '1' Figure 4.6: Priority Control Flow Chart for PR_LOGIC[j] in Figure

33 priority processor. The queue depth of the processors is used in determining the priority. In cases of processors having the same queue depth the processor having highest processor identification number gets the highest priority. A processor can access a particular memory block as long as it has highest priority to that memory block. If some other processor gets highest priority for that particular memory block, the processor, which is currently accessing the memory block gets its connection disconnected. It will have to wait until it gets the highest priority for accessing that block again. The flow chart showing the algorithmic operation of the j th priority logic block as shown in Figure 4.5 is shown in Figure 4.6 above. To fully follow the flow chart of Figure 4.6, we must reconcile signal names used in Figure 4.2 and Figure 4.6. The c MSBs of ADDR[i] of Figure 4.2 correspond to the mbaddr[x] of Figure 4.6 where x has an integer value ranging from 0 to (2 C 1). QDEP[i] of Figure 4.2 is the same as qd[i] of Figure 4.6. The number of processors in the figure are assumed to be 4 but the algorithm holds true for any number of processors. The PR_LOGIC[j] block of the priority logic of Figure 4.5 compares the current maximum queue depth with the queue depth of every processor starting from the 0 th processor. This comparison is done only for those processors whose CTRL is in the logic 1 state and are requesting memory access to that memory block where the priority logic operation is performed. The integer value i shown in Figure 4.6, is the identification number of the processor having deepest queue depth at that time. After completion of processor prioritizing, the processor i which has the deepest queue depth gets its request granted (FLAG[i] goes high) to access that memory block. This logic operation is structurally equivalent to the schematic shown in Figure 4.5, in which PROC_SEL[j] ( = i in the flowchart shown in Figure 4.6) acts as the select input to the multiplexer MUX[j]. This condition is achieved in VHDL code (Appendix A and Appendix B) by giving memory access to any processor (it s FLAG[i] is set equal to logic 1 ) only if its CTRL[i] is in the logic 1 state and it has the deepest queue depth among the processors requesting access within the same memory block. The Interconnect gives all the processors the flexibility of simultaneous reads or writes for those processors that are granted requests by the priority logic. In the best case all processors will have their requests granted. This is the case when CTRL of all processors is 1 and no two processors have the same ADDR_BL. In this case the binary 33

34 value of FLAG is 1111, after the completion of all iterations of the priority logic. The VHDL description of the crossbar interconnect network implementation has a single function which describes the processor prioritization done in all the memory blocks (ie the corresponding priority logic blocks). The described function works for any number of processors or memory blocks. Figures 4.1 through 4.6 show the block level design of the interconnect network. In the best case when all processors access different memory blocks, all the processors receive request grants (FLAGs) in the logic 1 state and all get their connections to different memory blocks. In this case the Interconnect is used to the fullest of its capacity, having different processors communicating with different memory blocks simultaneously. 34

35 Chapter 5 VHDL Design Capture, Simulation, Synthesis, and Implementation Flow VHDL, the Very High Speed Integrated Circuit Hardware Description Language, became a key tool for design capture and development of personal computers, cellular telephones, and high-speed data communications devices during the 1990s. VHDL is a product of the Very High Speed Integrated Circuits (VHSIC) program funded by the department of defense in the 1970s and 80s. VHDL provides both low-level and highlevel language constructs that enable designers to describe small and large circuits and systems. It provides portability of code between simulation and synthesis tools, as well as device-independent design. It also facilitates converting a design from a programmable logic to an Application Specific Integrated Circuit (ASIC) implementation. VHDL is an industry standard for the description, simulation, modeling and synthesis of digital circuits and systems. The main reason behind the growth in the use of VHDL can be attributed to synthesis, the reduction of a design description to a lower-level circuit representation. A design can be created and captured using VHDL without having to choose a device for implementation. Hence it provides a means for a device-independent design. Device-independent design and portability allows benchmarking a design using different device architectures and different synthesis tools. Electronic Design Automation (EDA) Design Tool Flow: The EDA design tool flow is as follows: Design description (capture) in VHDL Pre-synthesis simulation for design verification/validation Synthesis Post-synthesis simulation for design verification/validation Implementation (Map, Place and Route) Post-implementation simulation for design verification/validation Design optimization 35

36 Final implementation to FPGA/CPLD/ or ASIC technology. The inputs to the synthesis EDA software tool are the VHDL code, synthesis directives and the device technology selection. Synthesis directives include different kinds of external as well as internal directives that influence the device implementation process. The required device selection is done during this process. Field Programmable Gate Array (FPGA): The FPGA architecture is an array of logic cells (blocks) that communicate with one another and with I/O via wires within routing channels. Like a semi-custom gate array, which consists of an array of transistors, an FPGA consists of an array of logic cells [8,10]. A FPGA chip consists of an array of logic blocks and routing channels as shown in Figure 5.1. Each circuit or system must be mapped into the smallest square FPGA that can accommodate it. Figure 5.1: FPGA Architecture Each logic block contains or consists of a number of RAM based Look Up Tables (LUTs) used for logic function implementation and D-type flip-flops in addition to several multiplexers used for signal routing or logic function implementation. FPGA 36

37 routing can be segmented and/or un-segmented. Un-segmented routing is when each wiring segment spans only one logic block before it terminates in a switch box. By turning on some of the programmable switches within a switch box, longer paths can be constructed. The design for this project is implemented (prototyped) to a Spartan XL FPGA chip which is a XILINX product [8]. It is a PROM based FPGA. The one that is used for this project is an XCS10PC84 from the XL family. The FPGA is implemented with a regular, flexible, programmable architecture of Configurable Logic Blocks (CLBs), routing channels and surrounded by I/O devices. The FPGA is provided with a clock rate of 50 Mhz. There are two more configurable clocks on the chip. The Spartan XCS10PC84XL is an 84- pin device with 466 logic cells and is approximately equivalent to 10,000 gates. Typically, the gate range for the XL chips will be from ,000. The XCS10 has a 14x14 CLB matrix with 196 total CLBs. There are 616 flip-flops in the chip and the maximum available I/Os on the chip are 112. Digilab XL Prototype Board: Digilab XL prototype boards [8] feature a Xilinx Spartan FPGA (either 3.3V or 5V) and all the I/O devices needed to implement a wide variety of circuits. The FPGA on the board can be programmed directly from a PC using an included JTAG cable, or from an on-board PROM. A view of the board is shown in the figure below. The board has one internally generated clock and two configurable clocks. Figure 5.2: Digilab Spartan XL Prototyping Board 37

38 The Digilab prototype board contains 8 LEDs and a seven-segment display which can be used for monitoring prototype input/output signals of interest when testing the prototype system programmed into the Spartan XL chip on the board. VHDL Design Capture: Behavioral VHDL description was used in design capture and coding of the crossbar interconnect network design and logic. The structural equivalents of two behavioral VHDL descriptions is shown in Figures 6.1 and 6.2 of the next chapter. The scenario of a processor trying to access a particular memory block and whether its request is granted (FLAG = 1 ) or rejected (FLAG = 0 ) can be generalized to all processors and all memory blocks. Hence, the main VHDL code has a function flg, an entity main and a process P1 and it is possible to increase the number of processors, memory blocks (and address bus), memory locations in each memory block (and address bus), width of data bus, and the width of the processor queue depth bus. Appendix A contains the VHDL code, structured as shown in Figure 6.1, which describes the crossbar interconnect network assuming the number of processors, the number of memory blocks, and the number of addressable locations in each memory block to be 4. The input queue depth of each processor is 4-bits wide. This VHDL code is described considering the crossbar interconnect network and the shared memory as a single functional unit as shown in Figure 6.1. This code is tested for correct design capture and interconnect network functionality via the pre-synthesis, post-synthesis and post-implementation VHDL simulation stages and is downloaded onto the XILINX based Spartan XL FPGA [7,8] for prototype testing and evaluation. Appendix B contains the VHDL code describing only the crossbar interconnection network as a single functional unit as depicted in Figure 6.2. This code has more I/O pins than the previous code. This code was tested via pre-synthesis and post-synthesis VHDL simulation. With the exceptions of the I/O pins and shared memory, the functionality of both VHDL interconnect network descriptions is the same and is identical to the description of the crossbar interconnect network organization and architecture design described in Chapter 4. 38

39 Chapter 6 Design Validation via Post-Implementation Simulation Testing There are two VHDL code descriptions of the interconnect network, one in Appendix A and the other in Appendix B. Both have the same interconnect network functionality but different modular structures. The VHDL code in Appendix A is described considering both the crossbar interconnect network and the shared memory as a single block as shown in Figure 6.1. The VHDL code described in Appendix A has only processors to interface to the interconnect. data_in addr bus MB0 qdep ctrl rw MODULE main MB1 clk rst MB2 MB3 flag 4 data_out 16 Figure 6.1: Block Diagram of VHDL Code Described in Appendix A. 39

40 16 8 addr_prc addr_mem data_in_prc data_in_mem 16 4 qdep main_ic rw_mem ctrl 4 rw 4 16 clk data_out_mem rst 16 data_out_prc 4 flag Figure 6.2: Block Diagram of VHDL Code Described in Appendix B. 40

41 The VHDL code in Appendix B describes the crossbar interconnect network as a single block as shown in Figure 6.2. The VHDL code in Appendix B has two interfaces, the processors to interconnect network interface and interconnect to memory blocks interface. In both cases (Appendix A and Appendix B) the VHDL code describes a crossbar interconnect network interfaced to four processors and four memory blocks and each memory block has 4 addressable locations. The data bus of each processor is taken to be 4-bits wide. Entity 'main' corresponds to the interconnect module described in Appendix A and entity 'main_ic' corresponds to the interconnect described in Appendix B. The VHDL descriptions of the interconnect network in Appendices A and B are written in a generic parameterized manner such that additional processors and memory modules may be interfaced to the interconnect and also, the size of the memory modules may be increased. The size of the prototyped interconnect was kept small so that it would fit into the Xilinx Spartan XL chip on the prototype board shown in Figure 5.2. No functionality of the interconnect network was compromised by keeping the prototype to a small size. The VHDL description in Appendix A has a 16-bit data_in input port to the interconnect in which 4-bit data buses of all the 4 processors are included as shown in Figure 6.3. The same is the case of addr_bus, qdep, ctrl, rw, data_out, flag. Various scenarios such as: 1. all the processors writing to different memory locations in different memory, 2. all the processors reading from different memory locations in different memory, 3. two or more processors requesting access to different memory locations within the same memory block, and only the highest priority processor getting the grant to access the memory block, 4. two or more processors requesting access to the same memory location, and only the highest priority processor getting the grant to access the memory block, 5. two or more processors requesting access to the same memory block, and some of the processors having the same queue depth, with only the highest priority processor getting the grant to access the memory block, 41

42 D A T A D A T A 1 D A T A DATA_IN[15:0] D A T A Figure 6.3: Input Data Bus Format 6. one or more processors in idle state, are tested in the module 'main' during the pre-synthesis, post-synthesis and postimplementation simulations. This module is also downloaded onto a Xilinx based Spartan XL FPGA chip and is tested under the various scenarios described above. Figures 6.4, 6.5, and 6.6 show the post-implementation simulation tracers, which show the behaviour of the interconnect network and shared memory under different scenarios, described in VHDL code in Appendix A. The same coding style is followed in the VHDL code for main_ic described in Appendix B. The above mentioned scenarios of processor requests are tested on the module 'main_ic' also during the pre-synthesis, post-synthesis simulations. Figures 6.7 and 6.8 show the post-synthesis simulation tracers, which shows the behaviour of the interconnect network under different scenarios. 42

43 The simulation tracers in Figure show the behaviour of the interconnect module output data_out and shared memory in different scenarios, which are explained below. A testcase top is developed to generate input stimulus to the module main and to display the control signals, address, data of all processors on LEDs of the Spartan XL FPGA chip. scnr, pid, addr and data signals observed on the simulation tracer are used in developing the testcase top, which is described in detail in Chapter 7. In this chapter, the input stimulus and output (data_out and data in memory locations) observed on the simulation tracers is discussed. (All the data mentioned in different scenarios and that are shown on the simulation tracers are represented in hexadecimal system) Scenario 0: Input stimulus: data_in <= x"4321" ; addr_bus <= x"fb73" ; qdep <= x"1234" ; ctrl <= x"f" ; rw <= x"f" ; In this case, processor '0' is requesting memory access within memory block '0' (location '3'), processor '1' within memory block '1' (location '3'), processor '2' within memory block '2' (location '3') and processor '3' within memory block '3' (location '3'). Hence there is no conflict between the processors for memory accesses. Processors '0', '1', '2' and '3' (from 16-bit 'data_in' bus ) are writing '1', '2', '3' and '4' to the corresponding memory locations. As all the processors get the memory access and hence the data '1', '2', '3' and '4' is written to the memory location '3' in each of the four memory blocks. This data can be observed in the corresponding memory locations, on the simulation tracer 1 in Figure 6.4, for scnr = 0. Scenario 1: Input stimulus: data_in <= x"26fe" ; 43

44 addr_bus <= x"37bf" ; qdep <= x"1234" ; ctrl <= x"f" ; rw <= x"0" ; In this case, processor '0' is requesting memory access within memory block '3' (location '3'), processor '1' within memory block '2' (location '3'), processor '2' within memory block '1' (location '3') and processor '3' within memory block '0' (location '3'). Hence there is no conflict between the processors for memory accesses. Processors '0', '1', '2' and '3' are reading '4', '3', '2' and '1' from the corresponding memory locations. As all the processors get the memory access and hence the data '4', '3', '2' and '1' is read from to the memory locations 'F', 'B', '7' and '3'. This data can be observed on the data_out bus, on the simulation tracer 1 in Figure 6.4, for scnr = 1. Scenarios '0' and '1' test the case of data exchange between processors. In the scenario '0', processor '0' writes '1' to memory location '3' in memory block '0' and processor '3' writes '4' to memory location '3' in memory block '3'. In scenario '1' processor '0' reads '4' (Data written by processor '3' in scenario '0') from memory location '3' in memory block '3' and processor '3' reads '1' (Data written by processor '0' in scenario '0') from memory location '3' in memory block '0'. Similarly data exchange between processors '1' and '2' is also tested. Scenario 2: Input stimulus: data_in <= x"aaaa" ; addr_bus <= x"cd56" ; qdep <= x"efff" ; ctrl <= x"f" ; rw <= x"5" ; 44