Multicore Reconfiguration Platform A Research and Evaluation FPGA Framework for Runtime Reconfigurable Systems

Transcription

1 Multicore Reconfiguration Platform A Research and Evaluation FPGA Framework for Runtime Reconfigurable Systems Dipl.-Inf. Dominik Meyer 18. März 2015

2

3 Multicore Reconfiguration Platform A Research and Evaluation FPGA Framework for Runtime Reconfigurable Systems Von der Fakultät Elektrotechnik der Helmut-Schmidt-Universität/ Universität der Bundeswehr Hamburg zur Erlangung des akademischen Grades eines Doktor-Ingenieurs genehmigte DISSERTATION vorgelegt von Diplom-Informatiker Dominik Meyer aus Rendsburg Hamburg 2015 iii

4 Gutachter Prof. Dr. Bernd Klauer Prof. Dr. Udo Zölzer Vorsitzender der Prüfungskommission Prof. Dr. Gerd Scholl Tag der mündlichen Prüfung Gedruckt mit freundlicher Unterstützung der HSU-Universität der Bundeswehr Hamburg. iv

5 Curriculum Vitae Personal information Surname(s) / First name(s) (s) Nationality(-ies) Meyer, Dominik dmeyer@hsu-hh.de German Date of birth June 17, 1976 Education Dates Title of qualification awarded Abitur Name and type of organisation Helene Lange Gymnasium Rendsburg/ Germany providing education and training Dates Title of qualification awarded Diplom in Computer Science Name and type of organisation Christian-Albrechts-Universität zu Kiel providing education and training Work experience Dates Occupation or position held technical advisor/manager Main activities and responsibilities Name and address of employer Dates Occupation or position held technical manager Main activities and responsibilities Name and address of employer Dates Occupation or position held Main activities and responsibilities Buildup and management of the server infratructure of an internet service provider and webhoster. PcW KG Buildup and management of the server infratructure of a webhoster. Development of firewall solutions. die Netzwerkstatt now research assistant research in runtime reconfigurable systems Name and address of employer Computer Engineering/ Helmut Schmidt University Hamburg v

6 Publications [1] Dominik Meyer. Runtime reconfigurable processors. Presentation at the Chaos Communication Camp, [2] Dominik Meyer. Introduction to processor design. Presentation at the 30th Chaos Communication Congress, [3] Dominik Meyer and Bernd Klauer. Multicore reconfiguration platform an alternative to rampsoc. SIGARCH Comput. Archit. News, 39(4): , December v

7 Acknowledgments This thesis is the result of my work at the Institute of Computer Engineering at the Helmut Schmidt University/ University of the Federal Armed Forces Hamburg. I want to thank Prof. Dr. Bernd Klauer, my chair, for his support and the opportunity to work on this thesis. I also want to thank the remaining members of my dissertation committee Prof. Dr. Scholl and Prof. Dr. Zölzer. The discussions of my research results with my current and former colleagues at the Helmut Schmidt University helped a lot. Therefore, I want to thank Marcel Eckert, Rene Schmitt, Klaus Hildebrandt, Christian Richter and Jan Haase. Finally, I want to thank my girl friend, Sarah Zingelmann, for her understanding and support during the last years. vii

8 viii

9 Acronyms Acronyms AES ALU AMBA API BRAM CAN CDC CEB CLB CMT CPLD CPU CSMA/CD CSN DDR DIP DNF DSP FF FFT FIFO FPGA FSM GPIO GPU HDL HSTL HTTP I2C IC ICAP ILP IOB IP Advanced Encryption Standard. Arithmetical Logical Unit. Advanced Microcontroller Bus Architecture. Application Programming Interface. Block RAM. Controller Area Network. Clock Domain Crossing. Configurable Entity Block. Configurable Logic Block. Clock Management Tiles. Complex Programmable Logic Device. Central Processing Unit. Carrier Sense - Multiple Access / Collision Detection. Circuit Switched Network. Double Data Rate. Dual Inline Package. Disjunctive Normal Form. Digital Signal Processor. FlipFlop. Fast Fourier Transformation. First In First Out. Field Programmable Gate Array. Finite State Machine. General Purpose Input Output. Graphical Processing Unit. Hardware Description Language. High-Speed Transceiver Logic. Hypertext Transfer Protocol. Inter-Integrated Circuit. Integrated Circuit. Internal Configuration Access Port. Instruction Level Parallelism. Input/Output Block. Intellectual Property. ix

10 Acronyms ISA ISO ITU LAN LED LUT LVDS LVTTL MAC MPSoC MPU MRP NOC OCSN OS OSI Instruction Set Architecture. International Organization for Standardization. International Telecommunication Union. Local Area Network. Light Emitting Diode. LookUpTable. Low-Voltage Differential Signaling. Low-Voltage Transistor Transistor Logik. Media Access Control. Multi-Processor System-on-Chip. Multiplyer Unit. Multicore Reconfiguration Platform. Network On Chip. On Chip Switching Network. Operating System. Open Systems Interconnection Model. PAL Programmable Array Logic. PCI Peripheral Component Interconnect. PCIe Peripheral Component Interconnect Express. PE Processing Element. PLA Programmable Logic Array. POP3 Post Office Protocol Version 3. PR Partial Reconfiguration. PRHS Partial Reconfiguration Heterogenous System. RAM RampSoC RC RM RO RS RTL SATA SCI SoC SPI SRAM Random Access Memory. Runtime adaptive multiprocessor system-on-chip. Reconfigurable Computing. Reconfigurable Module. Ring Oscillator. Reconfigurable System. Register Transfer Layer. Serial Advanced Technology Attachment. Scalable Coherent Interface. System on Chip. Serial Peripheral Interface. Static Random Access Memory. x

11 Acronyms TCP UART UDP USB VA VHDL VR WAN XDL XML Transmission Control Protocol. Universal asynchronous receiver/transmitter. User Datagram Protocol. Universal Serial Bus. Virtual Architecture. Very High Speed Integrated Circuits HDL. Virtual Region. Wide Area Network. Xilinx Description Language. Extensible Markup Language. xi

12

13 List of Figures 1.1 History of the ic processing size[1] partitioning of an FPGA for the Xilinx PR design flow[2] and/or Matrix Halfadder implemented in an and/or Matrix to 1 Multiplexer Cascaded 4 to 1 Multiplexer Simple structure of an FPGA without interconnects Structure of two Virtex5 CLBs[3] simple PR example[2] example RAMPSoC Configuration[4] PRHS System Overview[5] Overview of the Convey HC1 architecture[6] Structure of an Intel Stellarton Processor, combined with an Altera FPGA Structure of the Xilinx Zynq architecture[7] COPACOBANA and RIVYERA interconnection overview Example mobile phone SystemOnChip (SoC) graphical representation of the ISO/OSI Model direct and indirect interconnection networks Example Ring network with eight nodes Example bus with 4 nodes Example grid networks with 16 nodes Example tree networks Example 4 4 crossbar networks Example granularity problem Example grouping solution configuration Example granularity solution configuration Area requirements of the different usage patterns Example MRP System Overview OCSN frame description OCSN network structure overview OCSN address structure xiii

14 LIST OF FIGURES 7.5 Example support platform Example reconfiguration platform CEB Signal Interface CSN group full MRP design flow reduced MRP design flow Clock Domain Crossing (CDC) component interface Dual Port Block RAM interface SimpleFiFo interface Reception of one OCSN Frame OCSN physical transmission component OCSN physical reception component Flowchart of OCSN identification protocol Flowchart of OCSN flow control protocol OCSN IF signal interface OCSN IF implementation schematic Graph of the OCSN IF FSM signal interface of an OCSN Switch signal interface of the addr compare component OCSN switch implementation schematic OCSN application component basic schematic OCSN Ethernet Bridge FSMs OCSN Ethernet Discovery Protocol Crossbar Interconnection Schema CSN Crossbar Switch Signal Interface CSN Crossbar Switch Implementation Schematic CSN2OCSN Bridge Signal Interface MRP Measurement Configuration for Setup Floorplan of the reconfiguration platform Floorplan with interconnects of the reconfiguration platform MRP CPU Configuration xiv

15 List of Tables 1.1 Configuration speed and -time for a Xilinx xc5vlx330 FPGA Configuration speed and -time for a Xilinx xc5vlx330 FPGA with 0,25MB Data Truth table of a Halfadder different Boolean functions implemented with a 4 to 1 multiplexer Example LUT implementing, and Classification of a bidirectional ring Classification of a bus Classification of an open grid (mesh) with 4 4 nodes Classification of a closed grid (illiac) with 4 4 nodes Classification of a tree Classification of a crossbar network with n nodes variable speed of the OCSN Address to register mapping Area usage of the MRP Maximum clock rates within each switch Propagation Delay Matrix for all CEBs in ns A.1 used OCSN frame types xv

16

17 Contents List of Figures List of Tables xiii xv 1 Introduction Reconfigurable Hardware Runtime Reconfiguration Hybrid Hardware Approaches Datapath Accelerators Bus Accelerators Multicore Reconfiguration Thesis Objectives Thesis Structure Reconfiguration Fundamentals Matrix Approach Multiplexer Approach Look Up Table Approach Field Programmable Gate Arrays Input/Output Blocks Configurable Logic Blocks Block RAM Special IO Components Interconnection Network Partial Reconfiguration Example Reconfigurable Systems Research Systems RampSoC PRHS Dreams Commercial Systems Convey HC Intel Stellarton Xilinx Zynq Architecture COPACOBANA and RIVYERA xvii

18 Contents 4 Interconnection Networks Open Systems Interconnection Model Application Layer Presentation Layer Session Layer Transport Layer Network Layer Data Link Layer Physical Layer Topology Interconnection Type Grade and Regularity Diameter Bisection Width Symmetry Scalability Interface Structure Direct Networks Indirect Networks Operating Mode Synchronous Connection Establishment Synchronous Data Transmission Asynchronous Connection Establishment Asynchronous Data Transmission Mixed Mode Communication Flexibility Broadcast Unicast Multicast Mixed Control Strategy Centralised Control Decentralised Control Transfer Mode and Data Transport Conflict Resolution Example Network On Chip Architectures Ring Bus Bus-Arbitration Data Transmission Protocol Classification Grid xviii

19 Contents 5.4 Tree Crossbar Granularity Problem of Runtime Reconfigurable Design Flow Solutions Grouping Solution Granularity Solution Granularity Problem and Hybrid Hardware Multicore Reconfiguration Platform Description On Chip Switching Network Physical Layer Data-link Layer Network Layer Transport Layer Session Layer Presentation Layer Application Layer Support Platform GPIO BRAM DDR3 RAM UART Bridge Ethernet Bridge Soft-core SoC Reconfiguration Platform ICAP CEB CSN IOB Operating System Support Design Flow Implementation of the Multicore Reconfiguration Platform General Components Clock Domain Crossing Dual Port Block RAM FiFo Queue Component OCSN OCSN Physical Interface Components OCSN Data-Link Interface Component OCSN Network Component OCSN Application Components xix

20 Contents 8.3 CSN Physical Layer Implementation Network Layer Components Application Layer Components Operating System Support Implementation OCSN Network Driver OCSN Network Device Driver Evaluation Area Usage Maximum CSN Propagation Delay Measurement RO-Component ReRouter-Component Measuring Setup Measurement Results Example Microcontroller Implementation for MRP Conclusion Outlook Appendix 113 A OCSN Frame Types Bibliography 115 xx

21 1 Introduction Gordon E. Moore[8] stated in 1965 in the growing Integrated Circuit (IC) market context: The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. The main conclusion of his paper is that the density of transistors on a IC periodically doubles. This prediction still holds after 48 years, according to Intel employees Mark T. Bohr, Robert S. Chau, Tahir Ghani and Kaizad Mistry[9]. ICs, such as general-purpose processors, are now produced in a 14nm technology process. Figure 1.1 displays the history of processing sizes for ICs of the last decades. With every doubling of the transistor density, more logic components can be placed onto one IC. Processor designers are using this newly available space to add more and more Central Processing Unit (CPU) and Graphical Processing Unit (GPU) cores to processors. For example the OpenSPARC T2 processor[10] has 8 CPU cores, and the NVIDIA Fermi device[11] even has 512 GPU cores. This development is expected to continue for a while, equipping general-purpose processors with more parallel computing power. System on Chips (SoCs) are another product of the available space on ICs. They feature single and multicore processors combined with a GPU and additional accelerator hardware. This accelerator hardware improves the computing power with Digital Signal Size in nm Year Figure 1.1: History of the ic processing size[1] 1

22 1 Introduction File Size (MB) Interface Bit-width Clk (MHz) Speed (Mb/s) Time (ms) 9,6 SelectMap ,6 SelectMap ,6 SelectMap Table 1.1: Configuration speed and -time for a Xilinx xc5vlx330 FPGA Processors (DSPs) or other mathematical functions implemented in hardware. Beyond exploiting the available space with more and more static hardware, it can also be used for adding reconfigurable hardware. 1.1 Reconfigurable Hardware Reconfigurable hardware has the ability to change its function after chip assembly and allows the configuration of every digital circuit, such as Advanced Encryption Standard (AES)-, Fast Fourier Transformation (FFT) accelerators, other DSP like instructions and even some specialised CPU cores. The industry has already reacted to the importance of reconfigurable hardware and produces different types of standalone ICs with this feature. One example is the Field Programmable Gate Array (FPGA). It features a large reconfigurable hardware area, some accelerator components like Arithmetical Logical Unit (ALU) and Multiplyer Unit (MPU), and distributed Random Access Memory (RAM). Chapter 2 gives a more detailed introduction to reconfigurable hardware and commercially available ICs. From now on, we will use FPGA as a synonym for reconfigurable hardware. One important limitation of FPGAs was that they had to be reconfigured completely, even for small system changes. Every computation taking place in hardware had to be stopped and a programming file, representing the changed functionality, was loaded into the FPGA. Even, if only half of the reconfigurable area was computing and the other half was without functionality, the whole area had to be replaced. This was and still is a very time intensive task. It takes many milliseconds for the reconfiguration process to complete, depending on the size of the file and the configuration channel. This process erases the internal states of all configured hardware components. Table 1.1 presents the calculated minimal configuration times for a Xilinx FPGA and a 9,6MB configuration file using the fastest available configuration interface Runtime Reconfiguration Because of the configuration time limitation and to enable replacing one part of a design while other parts are still doing computations, hardware vendors introduced the concept of runtime reconfiguration. Runtime reconfiguration is also often referenced as dynamic reconfiguration or partial runtime reconfiguration. Such a runtime reconfigurable project is developed by dividing the FPGA into some Reconfigurable Modules (RMs) during 2

23 1.2 Hybrid Hardware Approaches RM03.bit RM02.bit RM01.bit RM00.bit RM0 FPGA,,static Logic RM1 RM13.bit RM12.bit RM11.bit RM10.bit Figure 1.2: partitioning of an FPGA for the Xilinx PR design flow[2] the design phase. Figure 2.7 shows an example partitioning of a FPGA for use with the Xilinx Partial Reconfiguration (PR) design flow[2]. This design flow targets partial reconfiguration for Xilinx FPGAs. Two different sized RMs are available, each connected to some special static control hardware. This feature does not speed up the configuration process itself, but through the partitioning of the reconfigurable area the size of the individual configuration stream shrinks, which reduces the time for the reconfiguration process of one RM. For example, if you can reduce the size of the configuration stream for one RM to 0,25 MB, you achieve the configuration times of Table 1.2. This is an enormous speed up, but it can only be achieved, if the design is apportionable and the RMs can be reconfigured individually rather than all at once. The partitioning of a FPGA can only be altered by a full replacement of the configured logic. More benefits of PR are summarized by Kao[12]. 1.2 Hybrid Hardware Approaches Systems combining a general-purpose von Neumann[13] CPU with some kind of configurable or reconfigurable area are often called Hybrid Hardware Systems. The industry has already produced some hybrid systems, such as the Xilinx Zynq architecture[7], the Intel Atom processor E6X5C series[14] and the Convey HC1/HC2[6]. The first combines an ARM Cotex A9 processor core with a Xilinx FPGA on the same File Size (MB) Interface Bit-width Clk (MHz) Speed (Mb/s) Time (ms) 0,25 SelectMap ,25 SelectMap ,5 0,25 SelectMap ,25 Table 1.2: Configuration speed and -time for a Xilinx xc5vlx330 FPGA with 0,25MB Data 3

24 1 Introduction chip, but not on the same die. The next combines an Intel Atom processor with an Altera FPGA in the same manner. The last interconnects one Intel Xeon processor with four Xilinx FPGAs through the Intel co-processor interface. Still missing are hybrid hardware systems combined on a single die. Extending a static processor core with some kind of reconfigurable hardware has already been the focus of research. The following classes of combining strategies have already been evaluated Datapath Accelerators Hallmannseder[15], Dales[16], Hauser et al. [17] and Razdan[18] added reconfiguration directly into processor cores by adding reconfigurable accelerator units to the datapath of the processor. These units are small and cannot be merged to form larger ones. They improve the processor performance by exploiting Instruction Level Parallelism (ILP) through additional computational datapath units, or by extending the Instruction Set Architecture (ISA) with special instructions. Examples of these special instructions are cryptograhic accelerators for AES and mathematical accelerators for FFT. Datapath accelerators can improve the performance the most, if they are tightly integrated into the processor core without long interconnects Bus Accelerators Bus accelerators are small to medium-sized reconfigurable components and can be configured with specialised hardware to improve the runtime of a specific part of a program. They are connected through a bus or a network to the processor. These accelerators have to work independently on some part of data because of the high bus/network latency. This can release the static core(s) of some portion of parallel computable data. Because of the independent nature of these accelerators, they have an internal state and sometimes a connection to the main memory of the system. Bus Accelerators are a very simple form of extending the performance of processor cores because existing Busses, like Peripheral Component Interconnect (PCI) or Universal Serial Bus (USB), can be used, but more tightly coupled interconnects are also possible Multicore Reconfiguration The Runtime adaptive multiprocessor system-on-chip (RampSoC) framework of Gohringer et al.[4, 19] evaluates the multicore reconfiguration approach. With Multicore Reconfiguration, multiple processor cores can be configured at system runtime. The system can adjust itself to the nature of the current problem to solve. Some kind of dynamic or runtime reconfiguration design flow implements RMs, each containing one processor core. These processor cores are called softcores because they are not staticly implemented. The size of the largest one defines the size of the smallest RM, if every processor core shall fit into every RM. An alternative to defining some different sized RMs for different 4

25 1.3 Thesis Objectives sized processor cores, but this reduces the number of usable processor cores of the same size. 1.3 Thesis Objectives Most of the research about hybrid hardware systems focuses on one combining class only, is always using a fixed number of static sized cores or units, and includes only high performance computing applications. This is also true for industrial products. These restrictions limit the number of application scenarios for each architecture. To deploy hybrid hardware in a general-purpose environment and to support many applications, the number and the size of the components has to be variable. Example applications benefiting from hybrid hardware in general-purpose computing are: image processing applications, simulation of electromagnetic fields, solid state physics and computer games. Image processing applications could use hybrid hardware to accelerate certain filter and transformation algorithms by uploading accelerator units into the reconfigurable hardware. The simulation of electromagnetic fields and solid state physics can accelerate their computations by offloading certain calculations to the reconfigurable hardware. Both fields already use modern graphic cards to accelerate their computations on general-purpose hardware. Reconfigurable hardware would enable developers to use more specialised hardware and increase the calculation power even more. Computer games also use modern graphic cards to accelerate physical calculations for their simulated world. Hence, with reconfigurable hardware, each computer game could bring its own hardware for doing such calculations. All these reconfigurable hardware can be implemented as an accelerator unit or multiple streaming processor cores. Individualising hardware for each computer application can increase the processing power or reduce the power consumption of the whole system. Often, applications in a general-purpose environment are running concurrently, inducing the requirement of a variable number and a variable size of reconfigurable modules. This all-purpose computing capabilities requires more flexible design rules than systems supporting just one combination class. Computer systems are divisible into single-purpose computers, multipurpose computers and general-purpose computers. Single-purpose computers are designed for a specific calculation. In this systems reconfiguration is used to update the system and to fix development mistakes. This is already very common. Multipurpose computers are specialised for a group of computations, such as audio and video processing. A typical multipurpose computer is a DSP. In some DSPs reconfigurable accelerator units are available. They enable developers to extend the functionality or integrate new algorithms. The last computation class, the general-purpose computers, lacks support for reconfigurable hardware at the moment. This situation shall be changed by this thesis. As mentioned earlier, the FPGA has to be partitioned into multiple modules to support runtime reconfiguration. This partitioning is fixed after the initial system design stage. This early stage floorplaning leads to the granularity problem of runtime reconfigurable design flow because different sized components shall be runtime reconfigurable with maximum flexibility and good area usage ratio. During floorplaning, the maximum 5

26 1 Introduction sized component determines the size of one module. This module size and the size of the FPGA determines the number of available reconfigurable modules, which leads to a very inefficient design, if components with very different sizes are used. This granularity problem, and the solution proposed in this thesis, are described more in Chapter 6. Deploying hybrid hardware into general-purpose computing leads to another problem. At the moment it is relatively easy to write platform-independent programs by using a higher level programming language like C. Languages like Java are ignored because the programs are running in a runtime virtual machine, not on the bare hardware[20]. Virtual machines could be another target for hardware support in general-purpose computers. One advantage of current general-purpose CPUs is, that all of them are based on the von Neumann architecture[13]. This simplyfies the development of platform independent code because a compiler can be written for all architectures, with the same base assumptions, only differing in the ISA. Writing platform independent programs for hybrid hardware is much more complicated because these programs consist of software and hardware parts. The reconfigurable hardware in such a system is called configware. While the software part can still be written in C and is based on the von Neumann architecture, the different FPGA and CPU vendors have not agreed upon an architecture for the hardware part yet. It cannot be expected that all these companies decide for the same reconfiguration approach for their hybrid hardware system. This complicates the development of the configware because developers have to describe hardware for different reconfiguration approaches. Both problems the granularity problem and the development of platform independent code are addressed in this thesis by implementing a multi FPGA framework called Multicore Reconfiguration Platform (MRP). This framework uses a new floorplaning technique for partitioning the FPGAs, and a Circuit Switched Network (CSN) for interconnecting all the RMs. This combination of floorplaning and interconnection network enables the framework to support a variable number of different sized reconfigurable components, only limited by FPGA size, in contrast to all other, at the moment available systems. This is achieved by dividing larger components into multiple smaller components, which fit into the RMs and interconnecting them through the CSN. This framework also simplifies the development of platform independent software and configware because the framework can be synthesised for any FPGA. It abstracts from the underlying FPGA and provides the same Application Programming Interface (API) for every hybrid hardware developer. The proposed floorplaning technique of the MRP and the CSN generate a medium sized hardware overhead. Because of this overhead, the FPGA size is a limiting factor in the evaluation process. To overcome this restriction, the MRP supports a flexible and easily extensible packet switched network, called On Chip Switching Network (OCSN). It allows intra FPGA communication for configuring the RMs and programming the CSN, and also inter FPGA communication, to combine multiple FPGAs to form a larger hybrid hardware system. This feature is also a novelty, like the solution to the granularity problem and the platform independence of configware. 6

27 1.4 Thesis Structure 1.4 Thesis Structure The thesis is organised in eleven chapters. The introduction in Chapter 1 briefly describes the frame and the objectives of the thesis. To understand hybrid hardware, the principles of reconfigurable hardware, FPGAs, and runtime/dynamic reconfiguration are introduced in Chapter 2 and some example Reconfigurable Systems (RSs), related to the MRP, are presented in Chapter 3. The MRP uses two different kinds of Network On Chips (NOCs), the CSN and the OCSN. Chapter 4 introduces the principles of NOCs. It describes the Open Systems Interconnection Model (OSI) and presents a network classification based on work done by Schwederski et al. [21] and Feng[22]. Some important interconnection networks are described and rated according to this classification in Chapter 5. After the introduction of all basic principles, Chapter 6 explains the granularity problem of runtime reconfigurable design flow, occurring, if FPGAs are divided into multiple RMs to support flexible PR designs and describes possible solutions to the problem. The main work of the thesis, the MRP, is presented in Chapter 7. It introduces the CSN, OCSN and the design of the RMs. Chapter 8 describes the implementation of the MRP in more detail. Because the MRP is designed as a hybrid system it needs support from the Operating System (OS). The required device drivers are described in Chapter 9. The verification, proving that the MRP is usable and allows the reconfiguration of multiple different sized computing elements, is presented in Chapter 10. It evaluates the MRP according to area usage, maximum clock speed and example implementations. The conclusion of the thesis results and an outlook to future work is given in Chapter 11. 7

28

29 2 Reconfiguration Fundamentals Reconfigurable hardware describes some kind of electronic circuit, whose Boolean function can be changed or reconfigured after production of the circuit. Such hardware supports the creation of variable and specialised components the moment they are required. Different approaches exist to build basic elements of reconfigurable hardware. These basic elements can be combined to form larger systems and are produced as ICs, such as FPGAs, Programmable Logic Arrays (PLAs), Complex Programmable Logic Devices (CPLDs) and Programmable Array Logics (PALs). The most important difference between these systems is their basic reconfigurable component. FPGAs are build out of LookUpTables (LUTs), while PLAs, PALs and CPLDs use arrays of and/or matrices to configure Boolean functions. Another approach on reconfigurable hardware uses multiplexers. All the reconfigurable ICs can be used to build RSs or hybrid hardware systems. These systems often combine a general-purpose processor with some reconfigurable hardware to improve the computational power of the processor. This approach is called Reconfigurable Computing (RC). The following sections give a short introduction to reconfigurable hardware. Compton et al.[23] provides a more detailed overview of reconfigurable hardware and related software. 2.1 Matrix Approach The basis for the matrix approach is the and/or matrix. Figure 2.1 shows an example a a b b c c d d 1 0 & & & y0 y1 Figure 2.1: and/or Matrix 9

30 2 Reconfiguration Fundamentals matrix. On the left side, the and matrix prepares the connection of input signals, the negated input signals, a zero and a one signal to some and-gates. None of the vertical signals are connected to the horizontal ones at the moment. The intersections of these signals are connected to a programmable switch, such as an electronic fuse or a Static Random Access Memory (SRAM) cell. An electronic fuse will make the matrix onetime programmable, while the SRAM or other memory types will cause a multiple programmable matrix. On the right side, the or-matrix prepares the connection of the and-gates to some or-gates. The intersections of the signals are used the same way as the intersections of the and-matrix. To configure a Boolean function of type f : B n B into this and/or matrix, the function is required in Disjunctive Normal Form (DNF). A DNF is the normalisation of a logical function, displayed as a disjunction of conjunctive clauses. Every logical function, without quantifiers, can be converted to DNF[24]. a b S Cout Table 2.1: Truth table of a Halfadder a a b b c c d d 1 0 & & & Cout S Figure 2.2: Halfadder implemented in an and/or Matrix Figure 2.2 displays an example implementation of a HalfAdder with the truth table given in Table 2.1. The formulas for S and Cout can be read out of the truth table: S = (a b) ( a b), Cout = a b 10

31 2.2 Multiplexer Approach Both are in DNF and can be directly implemented into an and/or Matrix. The nodes in Figure 2.2 represent connections at the intersection points of the signals. Three forms of expressions exist for the matrix approach. The and and the or matrix are programmable. Only the and matrix is programmable, the or matrix has a fixed programming. Only the or matrix is programmable, the and matrix has a fixed programming. Different ICs use different expressions of the matrix approach. 2.2 Multiplexer Approach A multiplexer is a small digital selector device. It routes one of n input signals to its output. The number of input signals depends on the number of selection signals. If x selection signals are available, the multiplexer can process 2 x input signals. Figure 2.3 shows a 4 to 1 multiplexer with data inputs e 0... e 3 and selection inputs s 0 and s 1. s0 s1 e0 00 e1 01 e MUX y e3 11 Figure 2.3: 4 to 1 Multiplexer Simple Boolean functions f : B B B can be build out of this multiplexer by using s 0 and s 1 as the input variables and assigning each of the data inputs the results of the function. Table 2.2 shows how to implement the logic functions, and with a mule 0 e 1 e 2 e 3 function f(s 0, s 1 ) = s 0 s f(s 0, s 1 ) = s 0 s f(s 0, s 1 ) = s 0 s 1 Table 2.2: different Boolean functions implemented with a 4 to 1 multiplexer tiplexer. To make this approach reconfigurable to different Boolean functions, FlipFlops (FFs) can be connected to e 0,..., e 3. By saving new values into these FFs, different 11

32 2 Reconfiguration Fundamentals e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e s0 s1 4-1 MUX s0 s1 4-1 MUX s0 s1 4-1 MUX s0 s1 4-1 MUX s2 s3 4-1 MUX y Figure 2.4: Cascaded 4 to 1 Multiplexer functions can be configured. This pattern can be extended to implement functions of type f : B n B by cascading multiplexers. An example is given in Figure 2.4. There are two additional input variables available: s 2 and s 3. Hower, this pattern does not scale because for every two input variables the required number of multiplexers quadruples. Another method to increase the number of input variables is to increase the number of selection signals, but this will not scale either due to signal fanning. For x selection signals 2 x input signals are required. Functions of type f : B n B m have to be split in m functions of type f : B n B to be implementable with the multiplexer pattern. 2.3 Look Up Table Approach A better solution to implement reconfigurable functions of type f : B n B is to use a small RAM or LUT. The address signals of the RAM are used as the input parameters and the data words hold the result of the function. Table 2.3 displays the implementation of the simple Boolean functions, and in a LUT with an address width of three and a data width of eight. Because only two operands are required for these operations, a 1 and a 2 are selected as the input variables. The result is encoded in the dataword, starting from the first left bit for. It is obvious that the LUT approach supports the calculation of multiple functions of type f : B n B concurrently by using different bits of the data-word as the result. This approach is better suited for the calculation of f : B n B m functions than any other presented approach because it only requires one LUT, as long as m is less or equal the size of one data word. For functions with m greater the size of one data word, LUTs can easily be chained together. 12

33 2.4 Field Programmable Gate Arrays a 0 a 1 a 2 Dataword (8bit) Table 2.3: Example LUT implementing, and 2.4 Field Programmable Gate Arrays To extend boolean functions as explained in previous subsections to Finite State Machines (FSMs) or even more compley circuits it is necessary to have memory and interconnects. Many IC provide the required resources to configure digital circuits, such as FPGAs, PLAs, CPLDs and PALs. This section describes the general structure of FPGAs because they are used for the prototype system in this thesis. Many books provide this information, but this section is based on the book by Urbanski et al. [25]. In contrast to the name, a FPGA is not an array of gates, but an array of configurable basic elements, as there are Configurable Logic Blocks (CLBs), Input/Output Blocks (IOBs), Block RAM (BRAM), small DSPs and Clock Management Tiless (CMTs). Figure 2.5 displays the IOB IOB IOB IOB IOB IOB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB IOB IOB IOB IOB IOB IOB Figure 2.5: Simple structure of an FPGA without interconnects basic FPGA structure with CLBs and IOBs, and without interconnects. They are organised in an array structure to simplify the interconnection of the blocks. All components of the FPGA are vendor and device specific. The focus here is on Xilinx Virtex5 FPGAs. The following information is taken from the Xilinx Vitex5 User Guide[3]. 13

34 2 Reconfiguration Fundamentals Input/Output Blocks IOBs are the interface from the configured hardware to the input and output pins of the FPGA. They are also configurable by the developer to support different voltage levels and input/output signal standards, such as Low-Voltage Transistor Transistor Logik (LVTTL), Low-Voltage Differential Signaling (LVDS), and High-Speed Transceiver Logic (HSTL) Configurable Logic Blocks CLBs are the main reconfigurable elements of the Virtex5 FPGAs. Figure 2.6 displays COUT Slice Y1Y1 Slice X1Y0 Switch Matrix COUT SHIFT CIN Slice X0Y1 Slice X0Y0 Fast Connects to neighbors CIN Figure 2.6: Structure of two Virtex5 CLBs[3] the structure of two CLBs. The switch matrix is already part of the FPGAs interconnection network. One CLB consist of two slices. These slices are tightly interconnected through carry lines to increase the operand size of Boolean functions. Always two CLBs are connected through a shift line to form large shift registers. Every slice contains four LUTs, which are the basic reconfigurable elements of FPGAs, four storage elements, wide-function multiplexers, and carry logic[3]. The used LUTs have six independent inputs and two independent outputs. This structure supports the configuration of one Boolean function of type f : B 6 B or two Boolean functions of type f : B 5 B if the two functions share the same input parameters. Three multiplexers are connected to the four LUTs in one slice to support combining two LUTs to increse the number of possible inputs to seven or eight. Functions with more inputs are implemented by combining slices. D-type FFs provide storage functionality within each slice. Their input can be directly driven from a LUT. Some special slices provide more storage capacity by merging LUTs into a small RAM. Different merging strategies are supported. 14

35 2.4 Field Programmable Gate Arrays Block RAM FPGAs support BRAM to provide reconfigurable hardware with fast and area inexpensive RAM. On Xilinx FPGAs BRAM is provided in 36kbyte blocks. They are placed in columns on the FPGA. The number of available blocks is FPGA dependent. For Virtex5 devices the available BRAM ranges from 144 kbytes up to 2321 kbytes. BRAM can be used as single port, dual port RAM, or as First In First Out (FIFO) queues. Virtex5 FPGAs even provide dedicated hardware for asynchronous FIFO queues, reducing space requirements of the reconfigurable hardware. Access times for BRAM are very fast, compared to off-chip Double Data Rate (DDR) RAM. A dataword is available one clock tick after issueing the address into the RAM, making it a good choice for fast buffers or caches Special IO Components Often, reconfigurable hardware requires special I/O components, such as Ethernet, Serial Advanced Technology Attachment (SATA), PCI, etc.. Implementing these I/O components in reconfigurable hardware is possible, but requires much FPGA space. Therefore, the FPGAs support some special non-reconfigurable I/O hardware. This hardware implements common parts of I/O devices, which can be used to create the required components. The Virtex5 FPGA family supports Ethernet MACs, and RocketIO GTP Transceivers. Ethernet MACs reduce the area usage for Ethernet devices because they implement the Media Access Control (MAC) layer of the Ethernet protocol. RocketIO GTP Transceivers support general components for high speed serial I/O like 8b/10b encoders/decoders and fast serialiser and deserialiser. These transceivers can be used to implement the physical layer of the PCI or SATA bus. The correct working mode can be set through special instructions in the Hardware Description Language (HDL) Interconnection Network The interconnection network and the CLBs are the most important parts of the FPGA. Without the interconnection network the CLBs can not be combined and larger components can not exchange data. FPGAs distinguish three different signal types, which have to be routed through the interconnection network with different priorities and signal latencies. clock signals Clock signals require a fast distribution time throughout the FPGA because they synchronise all the components to its rising or falling edge. reset signals Reset signals are similar to clock signals. Through reset signals components are initialised at the same moment. This also requires a fast distribution throughout the FPGA. I/O signals For I/O signals a fast distribution is also important, but the maximum clock rate a design can work at, is calculated using the I/O signal line latencies. 15

36 2 Reconfiguration Fundamentals Another important requirement for I/O signals is their number. A normal design only has around one to three different clock signals and as much reset signals, but the number of I/O signals are very huge. Therefor, the FPGAs support two different interconnection networks. One for clock and reset signals and one for all the I/O signals, required to exchange data between components. 2.5 Partial Reconfiguration PR is a feature and a design flow of Xilinx Virtex5, Virtex6, and Virtex7 FPGAs[2]. It extends the normal configuration possibility of FPGAs with the ability to modify parts of a running configuration, without interrupting the computation. The design is divided into a static and a reconfigurable part during development. For the static part special entities, called reconfiguration modules, are defined, which hold the reconfigurable components. This definition includes a signal interface declaration for communicating with the static part. There can be different reconfiguration modules in one design with variable number of instances. The reconfigurable part of the design consist of entity descriptions for every component, which should be configurable into one module. RM03.bit RM02.bit RM01.bit RM00.bit RM0 FPGA,,static Logic RM1 RM13.bit RM12.bit RM11.bit RM10.bit Figure 2.7: simple PR example[2] The synthetisis process creates some FPGA configuration files. The main file includes the static design and a component for each instance of a reconfiguration module. For every component and every instance an additional partial configuration file is created. These files can be loaded into the FPGA after the main file to reconfigure certain reconfiguration module instances. Figure 2.7 shows a simple example of a reconfigurable system. It features two reconfiguration module instances and four partial configuration files per module. Instances can only be configured into the RMs for which they have been synthesised, placed, and routed. 16

37 3 Example Reconfigurable Systems 3.1 Research Systems RampSoC A RampSoC is a Multi-Processor System-on-Chip (MPSoC) that can be adapted during run-time by exploiting dynamically and partially reconfigurable hardware[4]. A special design-flow is used, which combines the top-down and bottom-up approach. The bottomup approach is used during design time to set up the basic conditions of a RampSoC according to the problem-space it should be used in. In the top-down approach the software is optimised for this initial setup. Parts of this initial setup can be reconfigured to meet arising needs of applications during runtime, such as a different processor core or a special accelerator unit. Figure 3.1 shows a possible RampSoC configuration at FPGA Switch Switch Micro- Processor (Type 1) Micro- Processor (Type 1) Accelerator Switch Accelerator Switch Micro- Processor (Type 1) Micro- Processor (Type 1) Accelerator Switch Accelerator Switch Micro- Processor (Type 2) Micro- Processor (Type 1) Accelerator Accelerator Accelerator Accelerator Figure 3.1: example RAMPSoC Configuration[4] 17

38 3 Example Reconfigurable Systems some point in time. Two types of processor cores are supported in this configuration, each having at least one accelerator unit. Switches connect the individual cores to the communication network. The implementation of a RampSoC is done using the early access PR concept of Xilinx. This design flow is not supported by the Xilinx toolchain anymore. The early access PR design flow requires, that reconfigurable modules are defined before synthesis of the project. To reconfigure different cores, accelerators and the communication infrastructure all reconfigurable parts have to be defined at the system design stage. The maximum number of accelerators and processor cores is fixed during runtime. The developer has to decide, if each type of core requires its own reconfiguration module defined or if the biggest core size is selected as the size for the reconfiguration unit. He has to balance between space exploitation and flexibility. The RampSoC approach uses proprietary processor cores, such as Pico- and Microblaze cores from Xilinx. To this cores accelerator units are connected, which can change their hardware function while the processor is executing a program. The RampSoC approach is a very flexible improvement compared to normal multicoreprocessors or MPSoCs. Its heterogeneous structure allows the optimal execution of applications with different hardware requirements and can adapt to applications needs during runtime very easily. Processor cores can even be exchanged by special FSMs supporting calculations in special hardware components PRHS The Partial Reconfiguration Heterogenous System (PRHS) developed by Eckert[5] tries to exploit the available new space on ICs also by reconfiguration. The PRHS is a softcore SoC, configured onto a FPGA. It features one RM of the Xilinx PR design-flow. In the available RM different hardware components can be configured. The RM can accelerate computations on the SoC, but its main pupose is virtualisation. Virtualisation in this case means the instantiation of a full SoC running under the supervision of the static core. The virtualised SoC also runs Linux as OS. Figure 3.2 displays this scenario. The static system on the right is running Linux as its OS. It has full access to memory and memory mapped IO hardware components like Universal asynchronous receiver/transmitters (UARTs) or timers. On the left a RM is available and connected to the static system. The SoC configured at runtime into this RM has only partial access to the memory. The accessible memory space is configured from the static system before the virtualised system is started. A memory mapped IO component interconnects the RM and the static system. It supports starting and stopping the virtualised system, but not suspending it. Providing a virtualised hard-disk to the reconfigurable system is another feature of the static system. The PRHS is an interesting way of using tighly couple reconfigurable hardware from a static processor core. The virtualised processor cores can feature different ISAs and run without performance losses, compared to the static processor core. 18

39 3.1 Research Systems data SD bus instruction SD bus systemarbiter (prhssdbusarbiter) static PRHS SD Bus ReconfArbiter (prhssdbusarbiter) <option reconf> PRHS Bus reconfif4prhs icap4prhs <option reconf> uart1 (uart4prhs) basereconf <option base> RS232 Tx/Rx lines 1 staticsys (base) ClockEventTimer (timer4prhs) uart0 (uart4prhs) RS232 Tx/Rx lines 0 ClockSourceTimer (timer4prhs) SysIntChip (intchip4prhs) icnextinterrupts secondary instruction bus bootram (bram4prhs) secondary data bus BCS (buscomponentstatus) 32 timers and uart0 present information 30 PRextension or uart1 present information 28 icbusstatuslines instrcache (Cache) primary instruction bus instrbusctrl (BusCtrl) processor instruction bus nirq processor (prhspa) processor data bus databusctrl (BusCtrl) primary data bus datacache (Cache) reconf PRHS SD Bus <option base> PRHS SD Bus reconfiguration guard reconfigurable module PRextension_inst (PRextension) Figure 3.2: PRHS System Overview[5] 19

40 3 Example Reconfigurable Systems Dreams Dreams is not directly a RS, but it is a tool to build runtime reconfigurable systems. It processes Xilinx Description Language (XDL) files, created by the Xilinx tools, and provides a partial reconfiguration design flow on top of PR. While the Xilinx design flow enforces the developer to run the synthesis, place, and route process for every RM and every implementation of a module, the dreams design flow does not. It supports easy relocation of RMs just synthesised, placed and routed one time. XDL is a human readable language for describing netlists. It is compatible with the ncd netlist file format and Xilinx provides programs for easy conversion. Dreams is developed by Otera et al.[26]. It tries to improve the Xilinx design flow in four different ways: 1. Module relocation in any compatible region in the device 2. Independent design of modules and the static system 3. Hiding low level details from the designer 4. Enhanced module portability among different reconfigurable devices Its design flow targets reconfigurable architectures build out of disjoint rectangular regions. The system architecture, enforced by the Dreams tool, is divided into Virtual Regions (VRs) and Virtual Architectures (VAs). A VA combines FPGA resources for use as a RM or static module. The VA describes the full system, including static and reconfigurable parts and how they are interconnected using the FPGAs interconnect. The VRs and the VA description are provided by Extensible Markup Language (XML) files by the developer. Dreams is a very interesting tool. Very large reconfigurable systems suffer in the Xilinx PR design flow from very long placement and routing times. Dreams could significantly reduce these times and improve the development time of such systems. 3.2 Commercial Systems Convey HC1 One commercially available RS is the Convey HC1[6]. It combines four Xilinx Virtex5 FPGAs with an Intel Xeon processor through the X86 co-processor interface. Figure 3.3 gives an overview of this architecture. The system contains two memories, one connected to the processor cores and another one connected to the four FPGAs. Both are accessible from the processor and the FPGA side. Hardware ensures cache-coherency between them. The memory on the FPGA side is specially partitioned to support concurrent access to different memory banks from different FPGAs to increase the overall memory access speed. 20

41 3.2 Commercial Systems "Commodity" Intel Server Convey FPGA-based coprocessor Intel 5138 Dual Core Processor Intel 5400 MCH Application Engine Hub Virtex5 FPGA Application Engines Virtex5 FPGA Virtex5 FPGA Virtex5 FPGA Intel IO Subsystem Memory Memory Intel x86-64 Server x86-64 Linux FPGA based Shared cache-coherent memory Figure 3.3: Overview of the Convey HC1 architecture[6] Communication with the FPGAs is implemented using the coprocessor interface of Intel processors. Software running on the Xeon processor can trigger hardware operations running on one of the FPGAs by issuing special coprocessor instructions and writing data, required for the operation, to special memory regions. Programs can change configurations in idle times of the FPGA. The Xilinx PR design flow is basically available, but is not supported yet by Convey, enforcing long reconfiguration latencies and very fixed FPGA designs. Still, the Convey HC1 is a very interesting platform for high performance computing. In high performance computing the accelerator hardware seldom changes and one important factor is memory access. Memory access is very fast on the HC1 because of their special memory layout Intel Stellarton Another commercial RS is the Intel Stellarton processor and FPGA SoC[14]. It combines a standard Intel Atom Stellarton processor core with an Altera FPGA on the same chip, but not on the same die. Figure 3.6 gives an overview of its hardware structure. The SoC contains all the standard components of the Intel Atom processor, like DDR interface, graphics adaptor/accelerator, audio component and Peripheral Component Interconnect Express (PCIe) bus interface. The Altera FPGA[27] ist connected to the processor by this PCIe bus. Through this bus the FPGA is configurable and application data can be exchanged between FPGA and processor. The main purpose of this RS was to improve the performance of host programs by accelerator hardware. The production of the system has been discontinued, but a new approach by Intel seems to be on its way, according to Diane Bryant[28]. According to her, Intel is working on combining their Xeon server processors with FPGAs to improve the performance of internet cloud services, such as Ebay, Amazon, etc.. 21

42 3 Example Reconfigurable Systems Intel Atom Processor DDR2 IF SPI, SMBus GPIO Graphics Legacy Intel Audio PCIe Gen 1 PCIe PCIe FPGA Figure 3.4: Structure of an Intel Stellarton Processor, combined with an Altera FPGA Xilinx Zynq Architecture Zynq[7] is a very new hybrid hardware system produced by Xilinx. It features a dual ARM Cortex A9 processor core connected to many peripherals and a FPGA through an Advanced Microcontroller Bus Architecture (AMBA) bus. Figure 3.5 presents the overall system structure. Processor core and FPGA share the same chip, but not the same die, like the Intel Stellarton processor. It supports a lot of static hardware components to connect to common embedded devices, such as Inter-Integrated Circuit (I2C) controller, Serial Peripheral Interface (SPI) controller, or Controller Area Network (CAN) controller. The FPGA is connected to the processor through an AMBA bus. The AMBA bus is a very common bus in embedded devices. It supports general-purpose ports and high performance ports from the processor to the FPGA. The FPGA has access to high speed serial I/O transceivers going offchip and to the AMBA bus. All other features of a Virtex7 FPGA are also supported, including PR. The Zynq architecture is an interesting system for embedded hardware developers. On the ARM processor cores a standard embedded OS can run and the FPGA can improve calculation performance for special applications, like audio and video editing, radio transmissions, and cryptographic algorithms. 22

43 3.2 Commercial Systems Figure 3.5: Structure of the Xilinx Zynq architecture[7] 23

44 3 Example Reconfigurable Systems 3.3 COPACOBANA and RIVYERA FPGA 2 FPGA 3 FPGA 4 FPGA 5 FPGA 1 FPGA 0 FPGA 7 FPGA 6 Host Interface Svc FPGA Backplane Figure 3.6: COPACOBANA and RIVYERA interconnection overview The Copacobana and Revyera systems developed by SciEngines hybrid hardware systems optimized for cryptoanalysis and scientific computing. Both systems consist of many interconnected FPGAs working together to solve a problem. The host system is connected through 10Gbit Ethernet cards, 4Gb Fibre Channel cards, or InfiniBand. The Copacobana can try the complete 56-Bit DES key space within 12.8 days. The Revyera is the advancement of the Copacobana. 24

45 4 Interconnection Networks Modern hardware design often requires the development of some interconnected components. Different interconnection network schemes are available today. If more tightly coupled systems are required these components are combined on a single chip. Such a tightly connected system is called SoC. Figure 4.1 displays an example mobile phone system, with three different interconnection schemes. This system can be developed as a multi-chip system or as a SoC. The shown mobile phone system consist of a CPU, memory, a DSP, a keypad, and a Memory RF Keypad Memory RF Keypad Memory RF Keypad Switch Switch CPU DSP CPU DSP CPU DSP a) bus connection b) P2P connection c) noc connection Figure 4.1: Example mobile phone SystemOnChip (SoC) radio transceiver. These components interact in different ways to get the mobile phone running. The interactions can be implemented using different kinds of interconnection networks. Figure 4.1 shows three possible topologies. In a) all components are connected to a bus with the typical bus communication restrictions, such as exclusive bus access for a single component and poor scaleability. In b) all components are directly connected with all components they are interacting with. This network topology supports a very flexible communication, but requires many interconnection links. The last displayed topology is a packet switched network build out of the components and switches. This kind of networks are called NOCs. NOCs are very similar to the communication infrastructure of inter computer networks, such as Local Area Networks (LANs) or Wide Area Networks (WANs). Much more different network architectures exist. To distinguish these networks and to easily highlight their differences and performance properties a classification is necessary. In this work part of the classification done by Schwederski et al. [21] is used, which is based on research done by Feng[22]. 25

46 4 Interconnection Networks The base for a classification is usually a mathematical representation of the entity of interest. In this case finite graphs are a good representation of interconnection networks. The edges of the graph model the interconnection links and the nodes are the Processing Elements (PEs), connected to the network. A PE is the component doing calculations and using the network for communication purposes, such as a processor core, a DSP, or some other kind of device controller. This chapter is organised as follows: Section 4.1 describes the OSI. It is an industrial standardising model for different communication protocols, simplifying their development. The distinguishing characteristics of NOCs are explained and described from Section 4.2 to Section Open Systems Interconnection Model Communication systems mostly consist of more than just two communication partners. These communication partners can be under the control of the same developer or company, but this is not always the case. Data is transmitted over multiple nodes to reach its destination and the underlying infrastructure can differ from node to node because of different responsibilities. The transmitted data can be divided into a header, enclosing source and destination addresses, payload size, quality of service information, and the actual payload. The position of the header data and the payload has to be defined to help every developer and manufacturer to produce compatible hardware. Later in this work, protocols will be described, using the terminology of the OSI. The International Telecommunication Union (ITU) and the International Organization for Standardization (ISO)[29] developed the OSI model to simplify the definition of communication protocols. Seven functional distinct layers divide the communication process. Figure 4.2 gives a graphical representation of these layers and the expected protocol flow. The flow starts at either side of the network stack. If some data shall be transmitted to another communication partner, the communication usually starts at the application layer. Every layer processes the data and passes it down to the next layer until reaching the physical layer. Each layer adds header information or transforms the data according the network requirements. Sometimes control messages are created, passed down the layers and send to their corresponding layer at the next communication partner, to create a virtual connection between them. The physical layer transmits the data through some kind of medium (wire, air, fibre optic,... ) to the next node. After the transmission, the data passes the layers up. If the node is just an intermediate one the data moves up to the network layer, where it gets formatted for the transmission to the next node. If the data has arrived at its destination, it gets passed up to the application layer. In the following sections each of the seven layers is briefly described. More information about the OSI model can be found in [29] or [30]. 26

47 4.1 Open Systems Interconnection Model Network Stack Protocol Network Stack application layer application layer presentation layer presentation layer session layer session layer Data transport layer transport layer Data network layer network layer data link layer data link layer physical layer physical transmission of bits physical layer Figure 4.2: graphical representation of the ISO/OSI Model Application Layer The application layer is the interface between a program or application running on a PE and the communication infrastructure. It defines the interaction between two or more communication partners, such as how to request some data or how to send the partner data. For this interaction the application does not require any information about the underlying network, the destination address is enough. Very common application layer protocols used in the Internet are Hypertext Transfer Protocol (HTTP) and Post Office Protocol Version 3 (POP3) Presentation Layer Data can be presented in multiple forms. For example some processor cores use big endian or little endian byte encoding for working with structures bigger than one byte. A higher level form ist the language encoding with ISO codes or UTF-8. To allow the application layer to just use the passed data, the presentation layer converts and transforms the data to the required representation. The presentation layer can be used to implement point to point encryption too Session Layer A communication session consists of the connection establishment, the transmission and reception of multiple data and the detachment of the connection. 27

48 4 Interconnection Networks Not every communication requires the establishment of a session. For example in a network, where every information is broadcasted to every network member, it is not possible to establish a session. Sessions are always necessary, if multiple requests, belonging to the same context, have to be transmitted. The Session layer is responsible for connection establishment before the data of session is transmitted and the tear down of the connection, when the session is finished Transport Layer The transport layer defines at least one protocol or method, on how to transmit data to another node in the network. This protocol can be connection less or connection oriented. In a connection oriented protocol the connection establishment, data transmission and the connection tear down has to be described. In this case the data transmission ensures the reception of the data at the communication endpoint. For a connection less protocol only the data transmission is required, without acknowledgement of receipt. Well known transport layer protocols are the User Datagram Protocol (UDP) and the Transmission Control Protocol (TCP) Network Layer Networks can be build with different topologies. How data is transmitted from a start node to a destination node depends on this topology because it specifies if nodes are directly connected, or how many intermidate nodes exist between them. The network layer is responsible for defining routing and path finding algorithms for transmitting data beween the network nodes. If necessary, it creats an abstraction layer over all network nodes with its own distinct address range. In this logical view the nodes seem to be directly connected. Common network layer protocols are IPv4 and IPv Data Link Layer The data-link layer is responsible, that the entities forming the network, can communicate securely with each other. If the underlying physical connection is not very robust, the data link layer ensures error-detection through some kind of checksum and, if possible, error-correction. This is achieved by requesting a retransmission of the data from the data-link layer on the other communication side or by recalculating lost data. If the physical transmission has a maximum number of bits, it can transmit at one time the data-link layer arranges the framing of the data Physical Layer The physical layer of the OSI transmits data from one network entity to another one. The structure of the data is not important at this layer because just bits are transferred. The physical layer describes the electrical and physical specification for transmitting one bit. It determines the modulation of the data and which transfer medium is used. It offers the data-link layer an interface to transmit x bits of data. 28

49 4.2 Topology 4.2 Topology The physical layer of the OSI describes how bits are transferred between network entities. These entities are organised in a specific structure, such as a star, ring or cube. This structure, represented by a finite graph, is called the network topology. Because it is obviously a distinctive feature of a network and influencing the performance significantly, the following topology classification properties are very important. For all the properties we assume that the network N has n interconnected PEs numbered pe 0... pe n Interconnection Type The network entities can be interconnected in different ways when forming a network. The following values describe the interconnection type in this classification: static If entities are statically linked, the link cannot be changed during runtime of the network. The network has to be recreated to change them. Such a network ist called static network. An example static network is a ring. dynamic A dynamically linked network is called dynamic network. It allows the alteration of connection links between two components during runtime of the network. A good example of a dynamic network is a bus. The address signals of a bus allow the selection of different communication partners. direct In a directly connected network (direct network) each network entity or PE is connected to at least one other network entity through some fixed links. No other component is PE PE PE PE SW PE PE PE PE SW PE PE a) direct net b) indirect net Figure 4.3: direct and indirect interconnection networks 29

50 4 Interconnection Networks required to communicate with other entities. If data needs to be transferred through intermediate nodes to its destination, the network entities have to provide this functionality on their own. Figure 4.3 a) shows a direct network of five PEs. indirect The opposite of a directly connected network is an indirectly coupled one (indirect network). In this type of networks the entities or PEs are connected through some kind of network infrastructure, which is responsible for data routing, for example a network switch or hub. The individual entities only possess uni- or bidirectional links to one network infrastructure component. Such a network is displayed in Figure 4.3 b). combination The properties mentioned above rule out each other in pairs. Overall, a static network cannot be a dynamic network at the same time. The same holds for direct and indirect networks. There could be special cases, in which this is not the case, but these will not be considered in this work. The combination of the pairs are possible. For example a static and indirect network is a very common case, looking at the interconnection of computer systems. Another example is a bus, which can be implemented as a dynamic and direct network Grade and Regularity It is always important to know, how many data can be transferred between PEs in parallel and if this value is the same between all network entities. These values always differ between different network topologies. The grade Γ of a PE is defined as: and Γ(pe i ) = number of connections of pe i for i 0... n 1 The grade measures the density of interconnection links in a network. We define: δ(n) = Minimum(Γ(pe i )) i 0... n 1 (N) = Maximum(Γ(pe i )) i 0... n 1 The term regularity describes, if the structure of the interconnection links is the same at all PEs of the network: This implies: N is r-regular if δ(n) = (N) = r Γ(pe i ) = r i 0... n 1 This characteristic is only important for direct networks because usually the PEs of an indirect network just have one bidirectional connection to an infrastructure element. 30

51 4.2 Topology Diameter The network diameter quantifies the maximum distance between network nodes. The classification by Schwederski et al. [21] defines the diameter for direct networks only. But the diameter is such an important characteristic that in this work it is also extended to indirect networks. direct networks Let N be a direct network with n nodes numbered 0,..., n 1. Let d a,b be the minimum number of steps (connection links) between the nodes a and b. The diameter is defined: indirect networks Φ(N) = max(d a,b ), a, b N, 0 a < n, 0 b < n An indirectly coupled network consists of at least one level of coupling elements. These coupling elements take over the routing functions of the nodes in a direct network. Every node or PE in an indirect network has one connection to a coupling element. Let N be an indirect network with s level of coupling elements and n nodes numbered 0,..., n 1. Let a, b N and a connected to coupling element X and b connected to coupling element Y. Let d C x,y the minimum number of steps (connection links) between X and Y. Now, let d a,b = d C X,Y + 2 be the minimum number of steps between the nodes a and b. The diameter is defined again: Dimension of the diameter Φ(N) = max(d a,b ), a, b N, 0 a < n, 0 b < n Sometimes it is not possible to calculate an exact number for the diameter. Still, it is important to know the dimension the diameter can take on. For this case we define: Φ(N) = Θ(f(n)) for a function f and a parameter n. The meaning of this is, that the diameter of a network depends on a function f and the parameter n of this function Bisection Width We still have our network N with n PEs. The bisection width partitions the network into two halves and measures the minimum interconnection links between these halves. The segmentation into M 1 and M 2 is done according to these equations: M 1 = n/2 P Es and M 2 = n/2 P Es 31

52 4 Interconnection Networks. The bisection width W k (M 1, M 2 ) of a single segmentation is given by: W k (M 1, M 2 ) = minimum number of interconnection links between M 1 and M 2 The bisection with of the whole network N is given by: W (N) = Minimum(W k (M 1, M 2 )) segementations M 1, M 2 The bisection width is an important metric for the performance of networks because many algorithms require that the nodes of one halve of the network communicate with corresponding nodes in the other halve Symmetry The symmetry of a network simplifies the writing of distributed algorithms. A network can be asymmetric, node-symmetric or link-symmetric. In a node-symmetric network, the network structure looks the same from every PE. This symmetry allows the deployment of the same algorithm to all PEs in the network. In a link-symmetric network the network is identical, looking from every link. This may simplify the scalability of the network. If the network is asymmetric, every PE has to be considered individually Scalability After deployment of a network, whether it be between some small hardware components or between computer systems, the scalability is always very important. If a SoC is extended for a new revision, new components are added to the system and have to be integrated into the NOC. If the NOC is not scalable, integrating the component will be a very big problem, possibly leading to a complete redesign of the system. A network is scalable if: 1. the topology mostly stays the same, if a new component is integrated. In the best case all existing connections and nodes are fixed and only the new connections for the PE have to be appended. 2. the communication performance does not suffer by increasing the number of nodes. 3. the increase of the network complexity is limited. 4.3 Interface Structure The interface is the bridge between one PE and the network. Its structure determines the communication between PEs. The requirements for such an interface differ in direct and indirect networks, but the implementation varies within each network type too. 32

53 4.4 Operating Mode Direct Networks The requirements for direct networks are very versatile because the PEs are directly responsible for the network access. The interfaces in a direct network have to implement the wire selection, path finding and data forwarding algorithms. These tasks require lots of hardware, such as multiplexers for selecting the correct path or buffers to store data before forwarding it Indirect Networks Interfaces in indirect networks are normally very simple because one PE has only one bidirectional connection to the network. The interface does not require any complex multiplexer or router functionality. The hardware just transmits and receives data from a network infrastructure component. At most a small buffer is necessary. 4.4 Operating Mode The operating mode of networks refers to the connection establishment and the data transmission of PEs. Both task can be executed synchronously or asynchronously Synchronous Connection Establishment In this operating mode all PEs are establishing their network connection or communication link at the same time. The exact point of time is synchronised by a global clock signal Synchronous Data Transmission Data designated for transmission can be divided into individual bits or groups of bits, such as one byte. These groups are transmitted at the appearance of one global clock tick. So every network interface transmits its own group of bits at the same time Asynchronous Connection Establishment The PEs need not wait for a specific global clock signal or a number of clock ticks to be allowed to establish communication. It can happen at any clock tick Asynchronous Data Transmission As with synchronous data transmission, the data can be divided into groups of bits. But in this case, handshake protocols are used, to ensure the transmission of the data. For example, the sender is only allowed to put the next group of bits onto the transmission line, if the receiver has acknowledged the reception of the current group. 33

54 4 Interconnection Networks Mixed Mode All these operating modes can be mixed. A very common mixture is the combination of asynchronous connection establishment with synchronous data transmission. This combination allows a very simple transmission hardware because it is controlled by a central clock signal and a flexible communication pattern because PEs can start a communication at any time. 4.5 Communication Flexibility Communication within a network can follow different strategies or patterns. A network can support all of them or just one. The level of communication flexibility is dependend on how many and which of the strategies the network supports Broadcast The simplest communication strategy in a network is a broadcast. If a PE wants to transmit data to another PE, it sends the data to all the other PEs. The receiving PE recognises the data for himself and can use it. All the other PEs just drop the data. This is not very flexible or efficient, but does not require a very complex routing algorithm Unicast The unicast communication strategy is the opposite of a broadcast. PEs address exactly one other PE and the data is only transmitted to this one. No other element in the network receives the data Multicast A broadcast is often too expensive because the data is transmitted to all PEs in the network. To improve the flexibility and the cost of the communication pattern the multicast strategy was developed. It allows the addressing of a subset of all the PEs in the network. This improves the flexibility much because the network can be divided into different groups, which can be address individually Mixed All the strategies mentioned above can be combined within a network. For example in TCP/IP networks you find all of them. But it is also very common, to combine the unicast and multicast strategy. This combination increases the flexibility of a network a lot because you can on the one hand address individual PEs and on the other hand groups of them. 34

55 4.6 Control Strategy 4.6 Control Strategy As mentioned earlier in this chapter, networks can be divided into static and dynamic ones. If a network is dynamic, the control over the dynamic links can be organised in different ways. This property is inapplicable for static networks because their links are fixed Centralised Control In a centralised controlled dynamic network a single control unit is responsible for the selection of the source and destination of the interconnecting links. This often requires much hardware because the central control unit needs to control all components in the network, which can switch the connection links. The configuration of all the links requires a very complex algorithm too. This strategy is best used in an environment with very few changes. But in such a network all connected resources can be configured at once and in cooperation with all the others to achieve the best possible interconnection pattern for the current work Decentralised Control The opposite of a central controlled network is a decentral controlled network. In this kind of network many network components exist, which organise the connection links for a small part of the network. These networks are called self-routing networks too because if data is transmitted through the network, the decentralised components need to decide how to switch the connection links and route the data without a view onto the complete network. This leads to a network without the optimal interconnection pattern, but is very flexible and adaptable to different communication requirements on the fly. 4.7 Transfer Mode and Data Transport Two network transfer modes are common today. In a circuit switched network a complete link is established between two communicating PEs through every intermediate PE. This can be done in a centralised or decentralised manner, explained earlier in this chapter. In a packet switched network, data is grouped by packets. These packets contain the source and destination address in a header section. In a direct network the PEs and in an indirect network some infrastructure component forwards these packets according to an algorithm, until received by its destination. Detached from the actual hardware implementation, communication within a network can be connection oriented or connection less. In a connection oriented communication the source always establishes a connection with the destination first, which stays active for the whole communication. In packet switched networks this is always done using some kind of virtual connection, where the destination is told when a connection 35

56 4 Interconnection Networks starts and when it ends. In a circuit switched network a real connection can be established between both communication partners. In a connection less communication the source just sends data packets into the network. These packets travel along the cheapest interconnection links. No preferred communication way exists. Connection less communication is only possible in a packet switched network. According to the underlying hardware and the connection type, different routing algorithms have to be used, to get the data to its destination. Store and Forward Routing This kind of routing is used in packet switched networks to forward packets between network entities in a whole. The packet is transmitted completely and is saved at the next component into a buffer. If the link to the next component is ready, it is forwarded again. This routing mechanism is very simple, but very hardware consuming. Much buffer space is required at each network component. Wormhole Routing Wormhole Routing uses the advantages of packet and circuit switched networks in environments, where the data transport is done over intermediate nodes. The data packets are divided into smaller pieces, called flits. The first flit contains the connection information. Each level in the network, builds up the connection link if it receives the first flit. After this connection establishment there is a complete link between source and destination and all flits of the packet are somewhere in between. The last flit tears down the link. The advantage of this strategy is a reduced latency between transmission and reception of a message. The disadvantage is the possibility of deadlocks because one transfer locks multiple network components at a time. Virtual Cut Through Routing This routing schema is related to the wormhole routing. It is used in packet switched networks. In each level of the network there is enough buffer space available for saving the complete data packet. Packets are transferred into the network and each level forwards it to the next level. If the way to the next level is blocked, the packet is detained. If the way is free the forwarding of the packet is immediately started, without waiting for the reception of the full packet. Like in wormhole routing, a packet may distribute through multiple levels of the network. A long blocking of the network is prevented by buffering packets, if the way is blocked. 4.8 Conflict Resolution Networks can differ in the mode, they dissolve conflicts. The two main network conflicts are output conflicts and internal conflicts. output conflict These conflicts occur, if messages are transferred from multiple sources to one destination, but only one connection can be established between source and destination. This conflict cannot be dissolved by changing the network topology because the destination can only support one connection. 36

57 4.8 Conflict Resolution internal conflict Even, if all messages are addressed to different destinations, an internal conflict can occur. In networks consisting of consecutively interconnected links, a message can travel partly the same way as another message, leading to a conflict because only one message can pass this link at a time. This conflict is traffic induced and can be dissolved by changing the network topology, for example, creating redundant links to bridge the part of the network with the bottleneck. To dissolve these conflicts, without changing the topology if possible, three resolution methods are available. Block Method If a message cannot be routed to the destination or the next network level, the message has to wait at the source. This requires the source component to have enough buffer space for at least one message. Drop Method In this case, a non routable message is discarded. No additional attempt to deliver the message will be made, the data is lost. modified Drop Method A small change can reduce the impact of the drop method. In this mode packets are only dropped, if buffer space is exhausted or the network has been blocked a certain duration. 37

58

59 5 Example Network On Chip Architectures Many NOCs exist today. This chapter will introduce the reader to some simple NOCs, which will later be used to compare to the NOCs developed in this work. For information about more complex NOCs the reader can use Schwederski et al. [21] or Bjerregaard et al. [31]. The last are giving a very interesting survey of research in NOC architectures. 5.1 Ring Ring networks are one of the simplest networks available. Its communication can be unidirectional or bidirectional. Figure 5.1 shows an example bidirectional ring with Figure 5.1: Example Ring network with eight nodes eight communication elements. Every one of these can transmit a message at the same moment. A bidirectional ring can transmit data in both directions, a unidirectional ring just in one. The structure of the ring allows very fast local communication between two neighbouring nodes, but only a slow global communication. Table 5.1 presents some classification properties for a bidirectional ring with N nodes. A ring is a static network, Type direct-static Grade Γ = 2 Regularity 2 regular Diameter Φ RING = N/2 Symmetry node & link Scalability Bisektion-Width W RING = 2 Table 5.1: Classification of a bidirectional ring 39

60 5 Example Network On Chip Architectures because the communication partners are always fixed. In this case, the communication infrastructure is located in the PEs and is therefore a direct network. But by moving the communication infrastructure outside the PE, it can become an indirect one. The grade and the regularity explain, that the nodes in the network have a maximum of two communication links and that all of them have the same number. The diameter is N/2 in a bidirectional ring and N 1 in an unidirectional ring. The following are examples of a specific implementation of the ring architecture: Token-Ring[32] Register Insertion Rings[33] Scalable Coherent Interface (SCI) Ring[34] 5.2 Bus A bus is a very simple and flexible network architecture. It is mostly used for accessing components in a memory like manner. The interconnection links are divided into data-, address and control signals and are shared by all network nodes. Figure 5.2 shows an example bus with four interconnected components. Because the network is using a shared 8 bit Data signal 4 bit Address Signal 2 Bit Control Signal Address : 0000 Address : 0001 Address : 0010 Address : 0011 Node 0 Node 1 Node 2 Node 3 Figure 5.2: Example bus with 4 nodes medium for data transfer the maximum number of components is limited. The access to the medium is implemented in a time-multiplexed way. The data transmission between network nodes is more complicated than in a ring. First the access to the interconnection links, the bus arbitration, has to be organised. This can be implemented in a centralised or decentralised style. The true data transmission can be synchronous or asynchronous. The destination of a transmission is selected by the value of the address signals. This explicit address selection allows a direct communication between two components. One of the components, the initiator of the communication is controlling the communication and the other, the responder, is answering the request. 40

61 5.2 Bus Bus-Arbitration The bus arbitration decides, which component is allowed access to the interconnecting links. This is necessary because a bus uses a shared medium and only one active component is allowed on the bus. The access decision can be made by a central control unit. Each network component has a bus-request and a bus-grant line to this central control unit. This unit selects the one bus component with the highest priority out of all components requesting bus access. If no central control unit is available, or not practical, the access decision can be made decentral. An example decentral decision making patter is daisy chaining the network components. With daisy chaining the bus-request signals are combined with the and operation in pairs. The resulting request line is combined with the next bus component in the same way. This physically ordered network nodes determines the access priority. Another decentralised access method is Carrier Sense - Multiple Access / Collision Detection (CSMA/CD). This method requires the network node to listen on the interconnection lines all the time. If the lines are not in use, the node can start a transmission of its own. If multiple components try to access the bus at the same time, the nodes can recognise this, by comparing the data on the bus with the data they transmit. If such a collision is detected the components stop transmitting and wait for a random time, before trying again. These arbitration methods are not fixed to busses. They can be used for any other decentralised network too Data Transmission Protocol While the bus arbitration is responsible for allowing access to the bus, protocols organise the data transfer between two bus nodes. Two different kind of protocols are common. synchronous protocol The synchronous protocol requires the data transmission concurrently to a global clock signal. This clock rate determines the transmission speed for all network components. Because of the synchronicity to a global clock signal this transmission scheme is very fast and very simple. The communication partners save the applied signal values at the rising edge of a clock tick. asynchronous protocol The asynchronous transmission protocol is more complex compared to the synchronous one. The transmission is not controlled by a central clock signal, but by four additional handshake signals. These signals are working in pairs assigned to the communication partners. Each pair consist of a request-start signal applied by the sender of a message and a request-done signal applied by the receiver of the message. The data signal can only be updated if the request-done signal has been applied. This handshaking allows components to have different transmission speeds, but reduces the overall transfer speed. 41

62 5 Example Network On Chip Architectures Classification Table 5.2 displays the classification of the described simple bus. The interconnection type Type direct-dynamic Grade Γ BUS = 1 Regularity 1 regular Diameter Φ BUS = 1 Symmetry node & link Scalability no Bisektion-Width W BUS = 1 Table 5.2: Classification of a bus is direct-dynamic because the bus participants are responsible for the data transmission and the bus arbitration and the connections between two components can be changed through the address signals. All network nodes have only one connection to the bus and, if connected, the transmission is done without any intermediate nodes. The grade of the bus is one and it is 1-regular. The diameter is one. The bus is not scaleable because the medium access gets more and more difficult the more components want to share it. If another component shall be added to an existing bus, the central arbiter has to be extended or the priority in a decentral controlled network has to be changed. 5.3 Grid Grid networks arrange their nodes in a two or more dimensional array. Every node is connected to its neighbours and supports direct communication with them. Figure 5.3 displays two different kinds of grid networks. The difference between both types is, that the mesh network is irregular because the edge and border nodes have a different grade than the other nodes. The Illiac network is based on the famous illiac computer[35]. The simplest versions of grid networks are 2-dimensional. The nodes are arranged in rows and columns with the same number of nodes, as displayed in Figure 5.3. In the more general case the number of nodes per row or column can be different and the dimension can be more than two. The transmission of messages between nodes is much more complex than in a ring or bus. Multiple shortest paths exist between the source and the destination of a message. The selection of the path is a hard decision, but will not be part of this introduction. Closed grids often have the ability to reconfigure the interconnection of their border and edge nodes to adapt to required communication patterns. The disadvantage of grid networks is there long diameter. This disadvantage can be reduced by adding more dimensions to the network, but increasing the complexity of the path finding algorithm. Table 5.3 and Table 5.4 show the classification for the grid networks presented in Figure 5.3. The interconnection type of both networks is direct-static because the nodes 42

63 5.4 Tree (a) open grid (mesh) (b) Illiac network Figure 5.3: Example grid networks with 16 nodes are responsible for all the communication, including path finding, and there is no possibility of reconfiguring the interconnection network. The mesh network is irregular, as Type direct-static Grade Γ MESH = undef Regularity irregular Diameter Φ MESH = 6 Symmetry unsymmetrical Scalability no Bisectionwidth W MESH = 2 Table 5.3: Classification of an open grid (mesh) with 4 4 nodes mentioned earlier because of the different interconnection links at the border nodes. The longest path between two nodes is six intermediate transfers. Because of the irregularity, the network is unsymmetrical. In contrast to the mesh network, the illiac network is 4-regular. Every node has connections to exactly four neighbours. This reduces the network diameter to three. 5.4 Tree A tree is an undirected coherent azyclic graph. It has exactly one root node spreading into multiple child nodes. A node without any children is a leaf node. The depth T of a tree is the maximum number of edges from a leaf node to the root. Many distributed algorithms prefer this topology because the structure of the algorithm can easily be 43

64 5 Example Network On Chip Architectures Type direct-static Grade Γ ILLIAC = 4 Regularity 4 regular Diameter Φ ILLIAC = 3 Symmetry node-symmetric Scalability no Bisectionwidth W ILLIAC = 4 Table 5.4: Classification of a closed grid (illiac) with 4 4 nodes mapped on the nodes in a tree network, such as Divide and Conquer algorithms[36]. Trees can be classified by the number of children per node too. If we name a tree, the maximum number of children per node is given at the beginning. For example a 2-tree is a binary tree with a maximum of two children per node and a 4-tree is a quadruple tree with a maximum of four children per node. Figure 5.4 shows exactly these two tree networks. A tree is called complete, if all nodes have all their edges assigned, except the leafs. Table 5.5 shows the classification of a simple tree. It is a direct-static Type direct-static Grade Γ T REE = undef Regularity irregular Diameter Φ T REE = 2T Symmetry asymmetric Scalability yes Bisection-Width W T REE = 1 Table 5.5: Classification of a tree network because their communication infrastructure is located within each node and the communication partners cannot be changed. The number of connection on the leaf nodes differ from all the other nodes, leading to an irregular and asymmetric network. The diameter is calculated through the maximum path between nodes in the network. The longest path in a tree is from the leaf of the left side of the root node to a leaf node on the right side leading to a diameter of 2T. The Bisection-Width is determined by the path through the root node. 5.5 Crossbar Crossbar networks are indirect networks build out of network nodes and the network infrastructure component, the crossbar. The crossbar interconnects all output signals of the nodes with all their input nodes. Through the crossbar configuration the nodes can be interconnected with each other, supporting all possible permutations. 44

65 5.5 Crossbar (a) binary tree of depth (b) quadruple tree of depth 2 Figure 5.4: Example tree networks Figure 5.5 displays an example crossbar with four nodes. The boxes within the crossbar are configuration elements. By turning them on a connection between the horizontal and the vertical signal lines can be established. Only one active element per vertical signal line is allowed, resulting in a conflict otherwise. Through activating multiple elements per horizontal signal line, broadcast and multicast communication can be implemented. Table 5.6 shows the classification of a n-node crossbar. A crossbar is an indirect-static network because the nodes are not responsible for the routing of data and the nodes are always connected to the crossbar. Each node has only one bidrectional connection to the crossbar, resulting in a 1-regular system. The diameter of the network is calculated according to the definition of the diameter for indirect networks in Section Because the crossbar network has only one level of interconnection infrastructure, the diameter is two. A crossbar is a very flexible and fast interconnection method, but requires many hardware resources to implement. n n configuration elements are required to build the crossbar. These configuration elements are often multiplexer. A 4 4 crossbar requires four 4 1 multiplexer. This does not scale for larger crossbars. Even adding another node is not that simple because you have to replace all n-1 multiplexers with (n+1)-1 45

66 5 Example Network On Chip Architectures Figure 5.5: Example 4 4 crossbar networks Type indirect-static Grade Γ CROSSBAR = 1 Regularity 1 regular Diameter Φ CROSSBAR = 2 Symmetry node-symmetric Scalability no Bisectionwidth W CROSSBAR = n Table 5.6: Classification of a crossbar network with n nodes multiplexers. 46

67 6 Granularity Problem of Runtime Reconfigurable Design Flow Dynamic- or runtime reconfiguration is becoming more and more important in FPGA design. It enables the designer to fit more hardware onto the chip than is physically available by swapping components in and out as required by the system. Another possible use is the optimisation of the configured hardware to runtime requirements. The communication stack within a network switch can be optimised for the negotiated speed (10/100Mbit/1/10Gbit) or CPU cores can be improved by configuring special accelerator units. Section 2.5 gives a more detailed introduction to the Xilinx PR design flow, which is used in this thesis. The general steps to create a partial runtime reconfigurable system with multiple reconfiguration components are: 1. decide for the number of reconfigurable modules 2. decide the size of each reconfigurable module 3. decide where to place each reconfigurable module 4. decide which interconnection network to use 5. decribe the static system and the interconnection network in a HDL 6. describe every reconfigurable system for placing into the reconfigurable modules in a HDL 7. synthesise, place and route the static system 8. synthesise, place and route each reconfigurable system for every reconfigurable module Because of the fixed decision about the size, number and placement of RMs during the first three steps of the design flow, the repositioning or resizing is impossible during runtime. In many designs this fixed decision is not a problem. For example in a one or two RMs design with nearly same sized reconfigurable components it is rarely necessary to resize or reposition the RMs during runtime. But in designs with more RMs and many, different sized components the fixed decision limits the flexibility and creates much slack space in RMs. 47

68 6 Granularity Problem of Runtime Reconfigurable Design Flow The granularity problem describes the difficulty to choose the right size and number of RMs in such a system. If different sized components shall fit into all available RMs, most developers will choose the maximum component size as the RM size. This will reduce the number of configurable smaller components, but allows the configuration of all components into any RM. Figure 6.1 displays an example granularity problem. The FPGA is divided into four FPGA reconf Module reconf Module Small CPU (PIC/ATmega) CPU (ARM,MIPS) reconf Module reconf Module FSM Figure 6.1: Example granularity problem same sized RMs. ARM and MIPS processor cores, PIC and ATmega microcontrollers, FSMs and Boolean functions are available as components to configure into these modules. The displayed system tries to solve a problem by using one ARM/MIPS processor core, one PIC/ATmega microcontroller and one FSM. The components easily fit onto the FPGA, but only the ARM/MIPS core exploits all the available space in its RM. The unused space in the other RMs is wasted because it is linked to the modules and cannot be configured independently. The space on the FPGA could be exploited much more efficient, if the placement of the components would be more flexible and the RM boundaries would not exist. This would possibly allow more than one system doing computations on the FPGA. 48

69 6.1 Solutions 6.1 Solutions The following sections describe two different solutions to reduce the effects of the granularity problem to runtime reconfigurable system design. They use different floorplaning strategies to achieve this goal Grouping Solution A very simple solution, reducing the consequences of the granularity problem, is having groups of different sized RMs on the FPGA. Figure 6.2 presents an example system using the grouping solution. The FPGA is partitioned into three regions, each holding f(b) f(b) FSM f(b) f(b) CPU Core f(b) f(b) FSM f(b) f(b) FSM CPU Core f(b) f(b) FSM f(b) f(b) Figure 6.2: Example grouping solution configuration different sized RMs. In this case the sizes are chosen to fit two CPU sized, four medium sized FSMs and 12 small sized Boolean function components onto the FPGA. The RMs of each group feature the same signal interface and are interconnected staticly. Advantages Because of the same signal interface and interconnection network within each group of RMs, converting a design from the standard PR design flow to the grouping solution is very easy. Every reconfigurable component can be reused without adaptions. The static system requires some small changes to the interconnection and management part to operate the groups concurrently. In comparison to the standard flow the overhead is very small. 49

70 6 Granularity Problem of Runtime Reconfigurable Design Flow The computable outline of the design is another advantage of this solution. An algorithm with the parameters number of groups, size of the RM in each group and number of RMs per group can compute the outline of the RMs groups very fast. This greatly speeds up and simplifies the whole development process. Disadvantages Despite the advantages of this system, the design process requires a decision about the size, number and position of the RMs, leading to the granularity problem at some size of the overall system. A change in these parameters requires a full re-synthesizing of the whole system. After configuring the FPGA with the new partitioning all running computation have been stopped and their current state is lost. Within the regions the design is still bounded by the maximum number of RMs in it. The structure of each RM is regular, but the full system is not. The groups of RMs enforce their own signalling interface. This prevents components to be configured in RMs outside their RM group. This even prevents the development of components fitting into all RMs Granularity Solution The granularity solution partitions the FPGA into many same sized RMs. These RMs have the same signal interface to the interconnection network. They can be combined to form larger components by interconnecting them through the interconnection network. The size of one RM is the only parameter required at design time. During runtime configuration files belonging to a reconfigurable component can be placed into any RM on the FPGA. These RMs are not required to be positioned next to each other. Figure 6.3 presents an example partitioning. The FPGA is devided into 7 6 RMs. The example design contains two different sized CPU cores at the moment, a FSM and two different sized Boolean functions. Still, there is more space available for additional components. Advantages It is obvious, that the placement of the reconfigurable components in this solution is very flexible and does not create as much slack space as the standard PR design flow. The number of RMs is only bounded by the size of the FPGA. At design time the number of reconfigurable components fitting onto the FPGA is unknown. All the RMs can be used for one or two CPUs or for many small Boolean functions. Any component, which is dividable into multiple smaller sub components, is possible. The regular structure of the whole system enables each entity, configurable into a RM, to look at the system the same way from any RM. This promotes the simple development of components. The same interface for all the RM supports this simple development too. 50

71 6.2 Granularity Problem and Hybrid Hardware CPU1 Core FSM f 1 (B) f 2 (B) RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM RM CPU2 Core Figure 6.3: Example granularity solution configuration Disadvantages The disadvantages of the granularity solution starts with the decomposition of the reconfigurable components into smaller components fitting into one RM. The decomposition and the different signal interface prevent the re-usage of the reconfigurable components of a standard PR design. The decomposition is also not a simple task. It is not guaranteed that all components can be divided into smaller parts. Another disadvantage is the interconnection network. It has to span the whole FPGA connecting all RMs. This requires additional FPGA space. The number of RMs and the used interconnection/management space has to be balanced to get a good design. The path delay of the interconnection lines between the RMs can be another problem. They could not be fast enough to support the connection speeds, required within reconfigurable components. 6.2 Granularity Problem and Hybrid Hardware The granularity problem occur on any runtime RS where multiple different sized reconfigurable components shall be used. In the scenario of coupling processor cores and recon- 51

72 6 Granularity Problem of Runtime Reconfigurable Design Flow figurable hardware, introduced in Section 1.2, this is also the case. The standard methods to couple processors with reconfigurable hardware are datapath-, bus-accelerator or multicore reconfiguration. Datapath accelerators commonly use a very small area, while bus accelerator are medium sized, and multicore reconfiguration requires much space on a FPGA. Figure 6.4 gives a graphical overview of this space requirements. Each pattern CPU Core CPU Core reconf reconf reconf (a) Datapath Accelerator (b) Bus Accelerator CPU Core CPU Core CPU Core (c) multicore reconfiguration Figure 6.4: Area requirements of the different usage patterns has its unique type of use. Datapath accelerators are used to increase the instruction flexibility. It allows the appending of different instructions to the processors ISA. Bus accelerators are the most common usage pattern at the moment. It allows the configuration of different kind of accelerators into the reconfigurable area and connect these through a bus to the processor. With the multicore reconfiguration pattern the reconfigurable area is used to instantiate multiple processor cores. These cores can run on their own or form a multicore system. In this work, all these connection methods shall be combined into one system, leading to the granularity problem. 52

73 7 Multicore Reconfiguration Platform Description After introducing the basics of reconfiguration and NOCs and describing the granularity problem of runtime reconfigurable design flows, this chapter presents the main part of this thesis, the Multicore Reconfiguration Platform (MRP). The MRP is a hybrid hardware system. In contrast to the existing research- and commercially available systems, the MRP uses the Xilinx PR design flow to implement its reconfigurability. The use of dynamic- or runtime reconfiguration helps to solve the granularity problem by using the granularity solution presented in Section This granularity solution enables the MRP to support multiple different sized reconfigurable components, without taking component sizes into account at the initial floorplaning stage. Inter FPGA connections are another new feature of the MRP. A packet switched network, called OCSN, can interconnect multiple FPGAs. Figure 7.1 displays an overview to host system OCSN support platform softcore OCSN reconfiguration platform OCSN reconfiguration platform Figure 7.1: Example MRP System Overview of an example MRP system, consisting of three FPGAs. By adding more FPGAs to the OCSN, the reconfiguration area of the MRP is easily extensible. This extensibility helps, if applications require more reconfiguration space during runtime. As Figure 7.1 shows, a MRP system is divided into support- and reconfiguration platforms. The first provides access to system resources through the OCSN, like BRAM, DDR RAM, General Purpose Input Output (GPIO), USB controllers, and mass storage and the second provides many RMs. This setup allows a maximum of reconfigurable space, while still supporting additional hardware resources. The number of platforms is only limited by the addressing space of the OCSN. The platforms and the host system, such as a server or workstation, are also connected through the OCSN. To support high speed connection between the MRP and its host system, the connection is implemented using 1Gbit Ethernet as its physical layer. As an alternative to a full featured host system, the support platform can provide a soft-core SoC connected to the OCSN. This SoC can control the MRP and distribute hardware applications. 53

74 7 Multicore Reconfiguration Platform Description Except for the Convey HC1, most of the other hybrid systems, suffer from direct operating system support. The MRP is directly integrated in the Linux OS. The device drivers provide a network API to communicate with all OCSN components and to configure the RMs. The remainder of this chapter introduces the OCSN in Section 7.1, the support platform in Section 7.2 and the reconfiguration platform in Section 7.3. Furthermore, it describes the OS support in Section 7.4 and the design flow for working with the MRP in Section On Chip Switching Network The requirements for a NOC, which interconnects the support and reconfiguration platform are diverse. First, the NOC has to support the interconnection of multiple FPGAs with different physical connections and variable signal lengths. FPGA boards can be interconnected by Ethernet, CAN, simple wires using some kind of serial protocol like SPI or RS232, or other interconnection schemes. Scaleability is another very important requirement. Adding another platform or component should not lead to the reconstruction of the whole NOC. The network should support broadcast and unicast connections because information has to be distributed through the network very fast and certain components require a lot of data transfer. Because many components participate in this network, the hardware requirements for connecting one component to the network should be as small as possible. Most networks cannot satisfy all these requirements. For example, a bus is not scaleable and does not permit multiple components to communicate concurrently. But a static indirect packet switched network fulfils all the requirements. The OCSN is a static indirect packet switched network. It supports the interconnection of multiple FPGA boards by using bridges through different physical connection and different protocols. It is limited scaleable by adding components to network switches and by increasing their number. Broadcast and unicast packet transmission is supported by routing all broadcast packets to all outgoing connections of a network switch. The usage of network switches for most of the network organisation reduces the interface size in the network devices. The OCSN uses the OSI model to divide functionality into layers to ease the adaption to different hardware and software, and standardise the interconnection points. Therefore, the OCSN description starts with the definition of the physical layer, walking up to the application layer. All these layers are implemented in hardware, without the usage of additional micro-controllers, to save configuration space onto the FPGAs. 54

75 7.1 On Chip Switching Network Clock Bit-width Speed 200MHz Gbit/s 200MHz Gbit/s 200MHz Gbit/s 100MHz Gbit/s 100MHz Gbit/s 100MHz Gbit/s Table 7.1: variable speed of the OCSN Physical Layer At the physical layer always two network interfaces are connected to each other. Each interface transmits a full OCSN frame of 39bytes in one transfer. Using such large frames in one transfer often leads to transmission errors. In this case the network spans mostly over one FPGA, reducing the error probability approximately to zero. The simple approach of transmitting a full frame at once, reduces the area usage for each network interface. In this case, the advantage of reduced area usage outweighs the disadvantage. The 39bytes of each transfer are divided into a configurable number of bits, transmitted concurrently at each clock tick. The allowed bit-widths are {x : 312 mod x = 0}bits because 39bytes 8bits = 312bits. Full duplex mode, by using dedicated transmission and reception lines, is also supported. The typical clock rates at this layer are 100MHz and 200MHz, resulting in the maximum network speed displayed in Table Data-link Layer The data-link layer of the OCSN is responsible for detecting and identifying the remote device. To prevent overflowing of the receive buffer, it implements hardware flow control between the two directly coupled interfaces. If the receive buffer of one interface hits an upper bound, it signals the other interface to stop transmitting. If, after stopping the transmission, a bottom bound is reached, the interface request the continuing of the transmission. The data-link layer of the OCSN does not provide any error detection/correction methods because the error probability, if configured onto a FPGA, is very small. But this feature can easily be added, if required Network Layer The network layer defines everything required for routing OCSN frames through the network to the correct destination. Figure 7.2 displays the structure of one OCSN frame. It is build out of source and destination addresses, additional source and destination port fields, a frame type field and the payload of the frame. For the network layer the 16bit source and destination addresses are of interest. 55

76 7 Multicore Reconfiguration Platform Description 16bit 16bit SRC Address DST Address SRC Port DST Port Frame Type DATA 31 byte DATA Figure 7.2: OCSN frame description The network infrastructure components of the OCSN are OCSN switches. They are organised in a tee structure to reduce routing complexity. A grid network would be faster and more flexible because different routes between two components exist, but would increase the routing overhead. A big disadvantage of a tree is its bisection width of one. Regardless of how you divide a network organised in a tree structure, the maximum connection number between two halves is always one. This leads to a big bottleneck, if components from one side have to communicate intensely with components on the other side. This disadvantage can be reduced by interconnecting all switches of one level in a ring, but this is not applicable in this network because the tree spans over multiple FPGAs. Furthermore, most of the components in this network will communicate with their direct neighbours. This communication will usually be taking place over one switch. All of these OSI layers have to be implemented in hardware, without the usage of additional micro-controllers. To generate this hardware with a very small area footprint, the advantages of simple routing outweighs the bandwidth disadvantages in this case. An example OCSN, consisting out of OCSN switches only, is displayed in Figure 7.3. The example network is organised as a binary tree, but more outgoing edges per OCSN Root Switch: OCSN Switch OCSN Switch OCSN Switch OCSN Switch OCSN Switch OCSN Switch OCSN Switch Figure 7.3: OCSN network structure overview switch are also possible. Switches are only specialised network devices. This flexible design allows replacing switches by any other component and using switch ports for switches and devices without reconfiguring the system. To get routing working in this tree network, the 16bit network addresses have to 56

77 7.1 On Chip Switching Network respond to the tree structure of the network. Therefore, the addresses are divided into the six parts shown in Figure 7.4. To support broadcast and unicast in the network, the first bit (r) of an address selects broadcast or unicast mode. The remaining bits are partitioned into five groups of three bits each. In the figure these groups correspond to the coloured characters a 1 a 2 a 3... e 1 e 2 e 3. If the value of r is one, the address identifies the root node of the tree. Looking at Figure 7.3 the root node is the top switch. The switches generate the tree, while devices are leaves of the tree. Switches always own an address starting with a zero at their group. The second group consisting of the bits a 1 a 2 a 3 and addresses all tree components directly connected to the root switch. They are the second level components of the tree. The bits b 1 b 2 b 3 identify all components directly connected to switches of the second level, like shown in Figure 7.3. This makeup goes on until group e 1 e 2 e 3, which identifies all components connected to switches of the fifth level. The six level cannot hold any more switches because there are no addresses left. This limitation can easily be removed by extending the address space. This addressing scheme enables all switches in the network to identify their uplink and downlink ports by checking the addresses of all connected devices. One advantages of a tree is the existence of only one route from one component to another. This eases the routing decision, to only identify the uplink of a switch and the calculation, to which of the connected switches the address belongs. Frames with a broadcast destination are transmitted to all ports, except the incoming one. Because all frames in the OCSN have the same size of 39bytes, no framing or padding is required Transport Layer To access the interconnected components, the network has to transport frames. In this scenario, the network is required to transmit configuration data, request status information, or access some kind of RAM. Because of the small error probability and the fact, that frames cannot be reordered while transmitted through the network, no connection oriented transport protocol is required. Instead, a connection less, UDP like, protocol is responsible for the data transport within the OCSN. The protocol features 8bit source and destination ports (Figure 7.2) and a 8bit frame-type field to identify the service at the destination. The maximum payload length is 31bytes. The frames are routed from source to destination using the network layer. If a service is listening at the destination on the destination port, the payload is processed and an answer is r a 1 a 2 a 3 b 1 b 2 b 3 c 1 c 2 c 3 d 1 d 2 d 3 e 1 e 2 e 3 r=0 broadcast address r=1 unicast address Figure 7.4: OCSN address structure 57

78 7 Multicore Reconfiguration Platform Description transmitted Session Layer The session layer starts and tears down connections in a connection oriented protocol. Because the transport layer of the OCSN only specifies a connection less protocol the session layer is not required Presentation Layer Like in the TCP/IP suit the presentation layer is merged into the application layer. The main purpose of the merged presentation layer is, to ensure all information in an OCSN frame is in big endian byte order Application Layer Accessing components in the OCSN requires different application layer protocols. The main distinction between these protocols is, if they require an answer frame or not. Usually it is enough to send one frame to a destination device to set registers or to request information. Still, the application layer defines the structure of the payload. Looking at the communication with an OCSN connected RAM the access mode (read, write), the access size (byte, word, double-word,... ) and the data for a write operation has to be encoded into the payload of an OCSN frame. In case of a frame send to a BRAM connected to the OCSN the first byte of the payload identifies the operation to perform. Bytes eight to five encode the RAM address and bytes twelve downto nine encode the dataword. In the answer frame from the BRAM the first byte signals what kind of answer this frame holds and bytes 8 downto 5 encodes the first data word. If more datawords are requested from the BRAM they are encoded after the first word. 7.2 Support Platform The support platform combines all system resources of one FPGA board, including off-board extensions, into one platform. Using a distinct FPGA board, reduces the space requirements for the reconfigurable platforms because no additional hardware is required. The reconfigurable platforms can concentrate on providing reconfigurability. Figure 7.5 presents an example support platform with all supported FPGA resources. These resources are connected through an interface to the OCSN. At the moment the following components are supported: GPIO BRAM DDR RAM 58

79 7.2 Support Platform Uplink FPGA - support plattform Ethernet/ Uart OCSN Switch GPIO BRAM DDR RAM Softcore SoC Downlink Ethernet/ Uart Figure 7.5: Example support platform In addition an uplink and downlink device exist, to connect a host system or other platforms to this FPGA. Two alternative devices are available. One UART and one Ethernet based bridge GPIO For querying and inserting debug data out of/into the OCSN, the GPIO component is very helpful. Outgoing GPIO signals can be set to certain values and drive, for example Light Emitting Diodes (LEDs). By sending status request frames the settings of a connected Dual Inline Package (DIP) switch can be checked using the pulling approach. It would be possible to implement interrupts by sending an OCSN frame out, if a DIP switch changes its status. 59

80 7 Multicore Reconfiguration Platform Description BRAM The FPGA used for the support platform has BRAM resources left, after using much of it for buffers in the OCSN. These BRAM can be combined to form a BRAM OCSN device. It allows access to the RAM from the OCSN with different access modes. The following access modes are supported at the moment: READ{length} read a data word of length bytes WRITE{length} write a data word of length bytes SWAP{length} atomic swap of a data word of length bytes The supported number of bytes for length are: 4,8,16,32,64 and 128 bytes. For initialising the RAM, two commands are available: INIT ZERO initialise the RAM from a given start address and some 4 byte words with INIT ONE initialise the RAM from a given start address and some 4 byte words with The following commands are planed as future extensions to support concurrent access to the RAM from different OCSN devices. LOCK lock the device for use by the source of this command only UNLOCK unlock the device for use by everyone, only possible from same device, which send the lock command or some master device to prevent a deadlock LOCK RANGE lock part of the address space for use by the source of this command only UNLOCK RANGE unlock a previously locked address space LIST LOCKS list all enforced locks DDR3 RAM This component uses the same interface and access model like the BRAM device. The difference is the used DDR RAM controller, instead of a BRAM one UART Bridge To get a very simple option to connect additional off-board components and additional FPGA boards to a support or reconfiguration platform, the UART bridge is used. It is build out of one OCSN interface and a UART. The interface receives an OCSN frame and the UART just transmits every byte of the frame through RS232 to the remote device. In the other direction the UART receives exactly 39bytes and transmits these bytes as a 60

81 7.3 Reconfiguration Platform frame through the OCSN interface. The bridge sends end of frame synchronisation bytes to the remote bridge through the UART by using the parity bit to distinguish between data and control bytes. This interconnection method is very slow (max 2Mbps), but is stable and requires only three wires Ethernet Bridge For connecting the OCSN to the host system and other FPGA boards, a high speed connection is essential. The Ethernet bridge encapsulates an OCSN frame into an Ethernet frame and transmit it over a 1Gbit Ethernet network device. Crossover cables and switches between the Ethernet bridge and the remote station are supported. The maximum bandwidth of 1Gbit Ethernet cannot be achieved because the Ethernet packets transmitted and received are always 60bytes long. The maximum Ethernet payload size is 1500 bytes. Still, a maximum throughput of 465Mbit/s is possible Soft-core SoC A soft-core SoC consist of at least one processor core and additional components for storing program code and data input/output. Soft-core SoCs, provided by the support platform, can replace a full featured host system, such as a server or workstation, for controlling the MRP. The MRP supports only the PRHS SoC, written by Eckert[5], at the moment. The integration into the OCSN has been done by Grebenjuk[37]. The PRHS runs Linux as its OS. Access to the OCSN is implemented through a communicator device and a network card device driver for Linux. 7.3 Reconfiguration Platform The reconfiguration platform provides the reconfigurable resources for the MRP. The prototype uses Xilinx Virtex5 FPGAs at the moment and requires the availability of the Xilinx PR design flow. Figure 7.6 presents an example reconfiguration platform. It is divided into a reconfiguration module, supplying many same sized RMs, and the infrastructure connecting host systems or additional FPGAs. The reconfiguration module encapsulates all the structure required for runtime reconfiguration into one component. This encapsulation simplifies the instantiation of the runtime reconfiguration on different FPGAs because the FPGA specific requirements can be implemented without interfering with the runtime reconfigurable implementation. The connection infrastructure is basically the same as on the reconfiguration platform. Bridges to and from the OCSN are used to provide the interconnection functionality. The reconfiguration module uses the granularity solution, presented in Section 6.1.2, to reduce the effects of the granularity problem, while partitioning the FPGA into many RMs. These RMs are called Configurable Entity Blocks (CEBs) because they can be configured with entities of the Register Transfer Layer (RTL), not only of the logical layer. These CEBs are interconnected by a CSN for combining them into larger components. 61

82 7 Multicore Reconfiguration Platform Description Uplink FPGA - reconfiguration plattform Ethernet/ Uart reconfiguration Module ICAP OCSN Switch OCSN Switch CEB CEB CEB CEB IOB SW SW OCSN Switch CEB CEB CEB CEB CEB CEB CEB CEB IOB SW SW CEB CEB CEB CEB Downlink Ethernet/ Uart Figure 7.6: Example reconfiguration platform The Internal Configuration Access Port (ICAP) of Xilinx Virtex{5,6,7} devices is used, to configure the CEBs through the OCSN ICAP Like the resources of the support platform, the reconfiguration platform has one important device, the ICAP. The ICAP configures the CEBs of the reconfiguration module during runtime of the system. It is connected to the OCSN and accepts up to seven 32bit configuration words in one OCSN frame. These configuration words are written to the ICAP with 50MHz at the moment, but can be increased up to 100MHz. The maximum configuration speed is 381 MB/s at 100MHz CEB The CEB is the main building block of the MRP. It is the one component providing the reconfigurability of the system. Different components can be configured into a CEB. 62

83 7.3 Reconfiguration Platform All the CEBs in the reconfiguration module have the same size and provide the same static signal interface to the interconnection network. Figure 7.7 describes this signal odid 8 ocdebug icenabled icreset idsingle odsingle 4 4 CEB ic25mhzclk ic50mhzclk idbus 128 ic100mhzclk odbus 128 ic200mhzclk Figure 7.7: CEB Signal Interface interface. Every CEB has four different clock inputs reducing the hardware complexity in a CEB for additional clock dividers. A clock divider is only necessary, if none of the provided clock rates (25, 50, 100 and 200MHz) fit into the design. The clock signals are generated on the FPGA for system wide usage. They are not distributed through the CSN, but use the dedicated clock lines of the FPGA. After the configuration of a component into a CEB, the state of the component is unknown. For setting it in a known state, a reset signal (screset) exist. During the configuration process the values of the input/output signals can fluctuate. To prevent the flooding of the whole MRP with invalid data, the components have to be disabled during the configuration process. All components, developed to fit into a CEB, have to react to the active high scenable signal. It also starts a component at a specific moment in time. The MRP requires a way to evaluate, which CEB is already configured and what kind of component is using the CEB. This is achieved through the eight bit odid signal. If the CEB is empty, the signal is not driven by any component. The signal is configured at the FPGA level as a pull up, returning 0xFF at an empty state. Each possible component has been assigned a distinct id, which has to be put onto odid. A debugging signal (scdebug) is also available, to connect one CEB to off-chip components, such as a LED or a logic analyser. For receiving and transmitting data from and into a CEB, two kinds of input/output signals exist. The first are simple single lines. idsingle provides four single lines input and odsingle four single lines output in this example. The second kind of input/output signals are signal clusters. Signal clusters are useful for designing busses or register input/output. In this example the CEB supports four 32bit signal clusters (idbus, odbus). The number of signals is chosen as small as possible to be easily routable onto the FPGA and as large enough to support a wide range of components. 63

84 7 Multicore Reconfiguration Platform Description CSN To interconnect CEBs to the reconfiguration module, different requirements exist. The signal interface requires at the moment four single signals and four clustered signals for each CEB, but this requirement can change in the future. Because of the possible requirement change, the interconnection network should be scalable in the number of signal lines it can support. Most larger components of the RTL synchronise each other by using a global clock signal. To support such larger components on the MRP, low latency signal lines are very important because the largest latency is responsible for the maximum achievable clock-rate. In this case the clock signals are using dedicated signal lines of the FPGA to connect to each CEB. Still, the data has to travel from one CEB to another. The latency of these transmissions selects the usable clock rates. The network may be divided into fast localised signals, tightly interconnecting a small group of CEBs and long distance signals interconnecting these groups. The last are allowed to have a slightly higher latency. To form larger components one CEB possibly has to connect to multiple different other CEBs or to connect to one other CEB multiple times. These connection schemes require the network to support multipath links and multiple routes from a source to a destination. These requirements suggest a dynamic indirect circuit switched network. Through the dynamic part, connections can easily be changed, rerouted and even shared among CEBs. The indirect aspect reduces the space requirements for the network interface hardware, like done with the OCSN. To use single signals and signals clusters as the main kind of communication a circuit switched network is best suited because the signal lines can just be routed to their destination. It is not necessary to sample the signals and transmit the results in a multibyte frame. This reduces the latency for all signals. The following sections describe this network in more detail, by using the OSI model. Physical Layer The physical layer of the CSN uses the communication infrastructure of the underlying FPGA. The FPGA provides a low latency network connecting all the CLBs. This network is best suited to work as the physical layer for the CEBs interconnection because it has the same base requirements. Additional parameters, enforced by the used application, has to be implemented inside each CEB. Data-link Layer The data-link layer is not necessary in this network, because no actual data is transmitted, just a direct connection established. If an application is using the CSN to transmit data, it has to implement its own data-link layer. 64

85 7.3 Reconfiguration Platform Network Layer The CSN is an indirect network build out of crossbar switches. A crossbar interconnects all inputs to its outputs (see Section 5.5). Only one permutation of these connections is possible at one moment. In this network each input has a corresponding output and two different kinds of inputs/outputs exist. The first kind are single signals and the other clustered signals. The inputs/outputs are divided between the connected CEBs and extension devices. The extension device inputs/outputs are used to interconnect the switches. In Figure 7.6 four CEBs are connected to one switch and the switches are interconnected in a grid (see Section 5.3). Because the connections at the end of each row and column of the grid are open, this connection scheme is called a mesh. The number of inputs/outputs of a switch can be easily increased to support more CEBs, more extension devices or more inputs/outputs for each of them by the cost of a higher area usage on the FPGA. Figure 7.8 gives a more detailed view of the connection interface of one switch in the example network. The inputs/outputs are numbered from 31 downto 0. Signals ocro ocro CEB CEB CSN Switch CEB CEB2 ocro Figure 7.8: CSN group ocro 31 downto 28, 27 downto 24, 23 downto 20 and 19 downto 16 are always reserved for connecting CEBs. All switches are programmable through the OCSN by sending configuration frames for single or clustered signals to it. Through status requests the MRP controller can read the current crossbar configuration and what kind of components are configured into a CEB. Through the programming interface the MRP controlling device 65

86 7 Multicore Reconfiguration Platform Description can select which input is connected to which output. By programming different switches all CEBs connected to all the switches can be interconnected. Transport, Session, Presentation and Application Layer All OSI layers above the network layer have to be implemented by the application/component using the CSN for interconnections. The CSN does not provide any interface for a transport protocol or any application layer protocols IOB Like any digital hardware component, the interconnected CEBs have to communicate with the outside world at some point in time. Parameters and results of computations have to be fed into and out of the components. This is done by using IOBs. The IOBs of the MRP are very similar to the IOBs of FPGAs. On FPGAs they are connected to the pins of the chip housing. They allow components on the FPGA to communicate with off-chip components. The MRP supports two different kinds of IOBs. Both are connected to the extension ports of a CSN switch and to an OCSN switch. CSN2OCSNsimple bridge The CSN 2OCSN simple bridge maps the signals of the extension ports to internal registers. These registers can be read and written using OCSN network frames. By reading the registers, the values of the connected signal lines can be identified and the outgoing signals can be set to special values. This component is very useful for debugging the CSN because the value of every signal can be read and written. The disadvantage of this bridge is, that it cannot react to fast changing signals because the OCSN requires multiple clock ticks to transmit a frame. CSN2OCSNbridge The CSN 2OCSN bridge is the preferred IOB for the MRP. It maps a normal OCSN IF to the CSN physical layer. A component in a CEB is connected to the CSN 2OCSN bridge with two 32bit busses input and two 32bit busses output. One input and output bus is responsible for data transfer and the other for control lines. The CEB can create a full OCSN frame by providing data at its output bus and selecting, through the control lines, which part of the frame to set. For example, to set the source and destination addresses of the OCSN frame, the component writes the source address to the upper 16bit of the data bus and the destination address to the lower 16bit. Then it selects the input zero, through the control lines. Reading an OCSN frame works very similar. The component selects, which part of the frame to read, through the control lines, and can read the data through the data input bus. All control signals from the OCSN IF component are mapped to the control bus, within the CSN. All data signals are selectable through the control signals and can be read and written through the data bus. 66

87 7.4 Operating System Support 7.4 Operating System Support A system like the MRP requires some kind of controlling master component, such as a workstation, server or soft-core SoC. But providing the hardware is not enough. The OS of these systems has to support the MRP and the concept of reconfigurable hardware. For the host systems of the MRP, Linux was chosen as the OS because its source code is available as open-source and it is running on most platforms, including the PRHS SoC. Linux is a UNIX-like operating system[38]. It is build out of the Linux OS kernel and additional applications. Device drivers extend the Linux kernel and integrate additional hardware and network protocols. There are two interfaces from the MRP to the host system. An Ethernet bridge (Section 8.2.4) and a native memory mapped OCSN device for the PRHS SoC. Both have to be integrated into the Linux kernel for accessing the OCSN and the components configured into a CEBs. The OS support is partitioned into the implementation of the network driver and the device driver. The network driver is responsible for the socket interface. It is the interface to the Linux user space. Programmers get access to the OCSN using socket programming. The device driver is responsible for copying the OCSN frames from and to the hardware. For the PRHS memory mapped io device, the driver copies data to and from memory addresses to/from internal kernel structures. For the Ethernet bridge this is not necessary because device drivers for Ethernet cards are already available in the kernel. The implementation of the OS support is described in Chapter 9. Accessing the components connected to the OCSN is done through user space programs at the moment. The following programs are available: lsocsn list all devices connected to the OCSN ocsn-ping check if a device is alive and get its round trip time ocsn-switch-status get the status of an OCSN switch (free/used ports, connected devices, received/transmitted frames) ocsn-file2icap copy a partial bitfile to a ICAP for configuration ocsn-file2ram copy a file to a RAM device ocsn-ram2file copy part of a RAM to a file ocsn-print-ram print part of a RAM to the output ocsn-init-ram initialise part of a RAM to a given value lscebs list all CEBs connected to all CSN switches ocsn-csn-status get the status of a CSN switch (connected CEBs, if active or not) ocsn-csn-get-routing print the routing information of one CSN switch 67

88 7 Multicore Reconfiguration Platform Description ocsn-csn-set-single set the routing for a single signal ocsn-csb-set-bus set the routing for a clustered signal ocsn-csn-ceb-on activate a configured CEB ocsn-csn-ceb-off deactivate a configured CEB 7.5 Design Flow At this moment the MRP only supports the Xilinx PR design flow (see Section 2.5). It is the base for the MRP design flow. It can be divided into a full design flow, in which all components including the static MRP system are synthesised, placed and routed, and a reduced design flow, in which only the CEB components are synthesised, placed and routed. Figure 7.9 presents the eight step full design flow. The first five steps are 1. create/adapt the static MRP system in Very High Speed Integrated Circuits HDL (VHDL) 2. add VHDL entities for using as CEB components 3. create the netlist for the static system, using CEBs as black-boxes 4. place and route the static system 5. create bitfile for the whole system with CEBs as black-boxes 6. create netlists for all the CEB components 7. place and route the static system including one CEB component at a time 8. create bitfiles for the whole system, including one CEB component and partial bitfiles for each CEB component and every CEB Figure 7.9: full MRP design flow required to create the bitfile for a MRP system without any CEB components. After configuring the created bitfile, all CEBs are empty. The last three steps create bitfiles for all the CEB components. The normal Xilinx PR design flow would create all these components successively. The MRP design flow uses a parallel approach. The reduced design flow displayed in Figure 7.10 assumes that the MRP static system is already created and running on a FPGA. The already available placement and routing information is used in the reduced design flow to place and route the components for the CEBs only. 68

89 7.5 Design Flow 1. add VHDL entities for using as CEB components 2. create netlists for all the CEB components 3. place and route the static system including one CEB component at a time 4. create bitfiles for the whole system, including one CEB component and partial bitfiles for each CEB component and every CEB Figure 7.10: reduced MRP design flow 69

90

91 8 Implementation of the Multicore Reconfiguration Platform After introducing the MRP in the previous chapter, this chapter describes the implementation of the important MRP components. 8.1 General Components In the design process of digital circuits some components are reused constantly. These components provide common functionality, like FIFO queues, small BRAM, decoders, and encoders. The general components, used throughout the MRP, are described in the following subsections Clock Domain Crossing In larger digital circuit designs multiple different clock domains may exist. One clock domain contains all the digital components running at one specific clock rate, for example 25Mhz. Often data has to cross the boundary of two clock domains, differing in speed and polarity. Special actions are required to ensure the integrity of the data. The problem of clock domain crossing is described, among others, by Biddappa[39]. iddata gen_data_size ocdataavail icwe icre ocfull CDC_fifoIF icreadclk icwriteclk icreset oddata gen_data_size Figure 8.1: Clock Domain Crossing (CDC) component interface The CDC fifoif, displayed in Figure 8.1, is a simple component for clock domain crossing, using the recommended solution of Biddappa. It uses a FIFO queue interface to connect to other components, allowing it to replace FIFO queues, which are often used to cross domain boundaries. The usage of FIFO queues is often very expensive because 71

92 8 Implementation of the Multicore Reconfiguration Platform they are build out of scarce resources, BRAM. Not all designs/components require a queue at the domain boundaries. In these cases the CDC fifoif can replace them. Internally a handshake protocol and multiple register stages move the data to the other clock domain. The handshake protocol drives the external FIFO signals ocfull and ocdataavail. The sizes of the data signals (iddata, oddata) are configurable through a generic, a VHDL parameter for configuring individual components Dual Port Block RAM Dual ported BRAM provides two interfaces to a RAM. Through the one interface a component writes data into it while another component reads data from the RAM through the second interface. This is often useful while working on data streams or building FIFO queues. Figure 8.2 describes the signal interface of the dual port block ram component. icclka icclkb icwea icenb icena idaddra gen_addr_size dual_port_block_r am gen_addr_size idaddrb gen_width oddatab iddataa gen_width Figure 8.2: Dual Port Block RAM interface The Xilinx tools identify the component as an onboard BRAM, if available onto the used FPGA. Otherwise, the RAM is build out of logic cells. This kind of implementation allows the flexible usage of this component on any FPGA, without the requirement of available BRAM FiFo Queue Component FIFO queues are a very common component on the RTL. The queues can be used to cross clock boundaries (like described earlier in this section) or to implement buffers. They are often implemented using BRAM components, available on certain FPGAs. This requires the creation of special Intellectual Property (IP) cores for each FPGA. The SimpleFifo, shown in Figure 8.3, implements a simple Fifo using the techniques described by Cummings[40]. It uses the dual port block ram component for saving the queue objects. To prevent buffer over- and underflow the write and read addresses are converted into gray code and propagted through two register stages into the other clock domain. In Gray code the code distance between two adjacent words is just one (only one bit can change from one Gray count to the next)[40]. This ensures that all changing bits 72

93 8.2 OCSN oddata gen_width ocempty icreadclk ocfull icreadenable iddata gen_width SimpleFifo ocaempty ocafull icwriteclk icclkenable icwe icreset Figure 8.3: SimpleFiFo interface of the address are synchronized at the same clock tick into the other clock domain. The SimpleFifo can be synthesised for any FPGA without the need of a special IP core. The design of the dual port block ram ensures that Xilinx tools can use BRAM, if available. It supports different read and write clock signals for clock domain crossing. Through the generics gen width and gen depth the data-width and the maximum number of queue elements can be selected. The thresholds for the ocafull and ocaempty signals are selectable through the generics gen a full and gen a empty. 8.2 OCSN The OCSN implementation is divided into multiple components, according to the OSI model OCSN Physical Interface Components The OCSN physical interface consist of the five signals idocsndatain, odocsndataout, icocsnctrlin, ococsnctrlout and icocsnclk. They are used to interconnecting all the OCSN devices. Figure 8.4 shows the reception of a single OCSN frame through icocsnclk icocsnctrlin idocsndatain Figure 8.4: Reception of one OCSN Frame these five signals. The transmission of a packet works alike. icocsnclk is the clock signal for the whole OCSN on one FPGA. icocsnctrlin and ococsnctrlout are active low signals for controlling, when a transmission is taking place. The transmission in Figure 8.4 starts when icocsnctrlin is going from high to 73

94 8 Implementation of the Multicore Reconfiguration Platform low and ends when it is going from low to high again. The number of required clock ticks varies according to the number of bits transmitted concurrently. The generic data link determines these number of bits. This simple interface is chosen in favour of a more sophisticated physical interface because it reduces the design complexity of the system. Using a high speed serial io physical interface would require much more components, such as some high speed serialiser and deserialiser and a special transmission encoding like 8b/10b[41]. The interface to the data link layer are 312bit data input/output signals and control signals for signalling the reception or transmission of the data and a trigger signal for starting the transmission. implementation The implementation of the OCSN physical layer is done through two components. The ocsn write component is responsible for transmitting data and the ocsn read component for the reception of data. ocsn write is a simple shift register implementing the OCSN physical output interface. The signal interface of csn write is given in Figure 8.5. In addition to the OCSN physical odocsndata data_link icsend ococsnctrl icocsnclk OCSN_WRITE ocready icclkenable iddata 312 icreset Figure 8.5: OCSN physical transmission component interface it features a 312bit data input for the OCSN frame and control signals to start transmission and signal the end of transmission (icsend, ocready). idocsndata data_link ocreceived icocsnctrl icocsnclk OCSN_READ icclkenable icreset oddata 312 Figure 8.6: OCSN physical reception component 74

95 8.2 OCSN ocsn read likewise is a simple shift register implementing the OCSN physical input interface. It works in the opposite direction than ocsn write. Figure 8.6 displays its signal interface. A new OCSN frame is received and its data is only valid for the one clock tick the ocreceived signal is high OCSN Data-Link Interface Component The data link layer is implemented in the OCSN IF component. It is responsible for identifying the remote interface and for initiating flow control, before the receive buffer overflows. The flowchart in Figure 8.7 describes the used identification protocol. Both IF0 IF1 identify identity Figure 8.7: Flowchart of OCSN identification protocol endpoints of the communication send an identification request to the OCSN physical interface. If a remote interface is connected, it responds with an identity response. Sending an identification request is repeated, with a short timeout, until an identification response is received. The flow control protocol is similar easy as the identification protocol. An example flow chart is given in Figure 8.8. IF1 is transmitting many OCSN frames to IF0. At some point the receive buffer of IF0 will hit an upper bound. At this moment IF0 transmits a wait request to IF1. IF1 stops sending frames as soon as it processes this wait request, still some more frames can be transmitted. Because of these frames, the upper bound cannot be the maximum FIFO queue depth. At some later point in time IF0 has processed most of the frames in its receive buffer and will hit a lower bound. At this moment is transmits a continue request and IF1 starts transmitting again. Both protocols are identified through OCSN frame type zero and the first byte of the payload. Appendix A gives an overview of all available OCSN frame types. The OCSN IF encapsulates the components of the physical layer. Therefore, it provides the OCSN physical interface to the outside and passes it through to these components. Figure 8.9 displays the full signal interface of the OCSN IF component. In addition to the OCSN physical interface, it has to provide an interface to the network 75

96 8 Implementation of the Multicore Reconfiguration Platform IF0 IF1 frame receive buffer reaches upper bound wait receive buffer reaches lower bound continue frame Figure 8.8: Flowchart of OCSN flow control protocol layer. This interface includes signals for controlling the status of the connection, for working with OCSN frames, for controlling the transmission and reception of frames and for resetting and running the component. The following signals are used for controlling the status of the connection between two connected OCSN IF components. identity input for the 16bit OCSN address of the interface icidentity this active high control signal selects, if the identity is automatically set for each transmitted frame odidentity 16bit output of the OCSN address of the remote interface ocidvalid active high validity signal for odidentity The interface to the network layer consist of the frame and frame controlling signals. It simplifies the usage of OCSN frames by dividing them into individual signals for each frame part. {id,od}dst destination address of the OCSN frame 76

97 8.2 OCSN idocsndatain data_link 16 odsrc icocsnctrlin 8 odtype odocsndataout data_link 8 odsrcport ococsnctrlout 8 oddstport icocsnclk 256 oddata identity 16 icforward iddst 16 ocdataavail idsrc 16 OCSN_IF icreaden idtype 8 icidentity idsrcport 8 16 odidentity iddstport 8 ocidvalid iddata 256 icreset icsend icclken ocready icclk oddst 16 Figure 8.9: OCSN IF signal interface {id,od}src source address of the OCSN frame {id,od}dstport destination port of the OCSN frame {id,od}srcport source port of the OCSN frame {id,od}type the frame type of this OCSN frame {id,od}data the 31byte payload of the OCSN frame The frame control signals form a simple FIFO queue interface. The active high ocready signal indicates, if the interface is ready to transmit a new frame. Through the icsend signal, the frame, created in the frame part, is transmitted. icdataavail indicates the availability of OCSN frames in the receive FIFO queue. ocreaden removes the first queue element. The system interface consist of the main clock signal icclk, an active high asynchronous reset signal icreset and an active high clock enable signal icclken. 77

98 8 Implementation of the Multicore Reconfiguration Platform implementation The OCSN interface is build out of the components ocsn write, ocsn read, SimpleFifo, CDC FifoIF and a FSM controlling all these components. Figure 8.10 displays a simpli- icocsnctrlin idocsndatain ocsn_read ocreceived oddata icwe iddata Register oddata icwe SimpleFIFO OCSNFrameOUT ocdataavail icreaden ocfifowe scready FSM ocready ococsnctrlout odocsndataout ocready icsend ocsn_write iddata MUX OCSNCMACFrameIN/icSend OCSNFrameIN/icSend CDC icsend OCSNFrameIN Figure 8.10: OCSN IF implementation schematic fied block diagram of the OCSN IF buildup. ocsn read and ocsn write are responsible for the physical communication. If an OCSN frame is received it is cached in a register and the FSM evaluates the frame at the same moment. If the frame belongs to the identification or flow control protocol, the frame is not stored in the FIFO queue. If the frame is a normal OCSN frame the FSM sets the write enable signal (icwe) of the FIFO queue to append the frame. Through the multiplexer the FSM controls, if a frame from the outside is transmitted through ocsn write or if a control frame generated by the FSM. Figure 8.11 shows the FSM graph. The FSM starts with the state st start on the left side. After waiting for the ocsn write component getting ready the FSM switches to the st identify state. In this state it transmits the identify request to the remote interface and switches to st wait id for waiting for an identity response. The internal signals scsendidentity and scidentityreceived are control flags. The first flag request that the interface should transmit its own identity and the other shows, if the remote identity has already been received. If the remote interface is identified, the FSM switches to the st idle state. The st idle state is the main state of the FSM. The states st wait, st cnt send, st wait send are just intermediate states returning to the st idle state as soon as an OCSN frame has been successfully been sent to the network. All other states are only reachable from st idle. If a new identify request is received, the FSM switches to the st identify state. If a wait request is received from the remote interface the FSM stays in the st stop state until a continue request is received. If the FIFO queue is almost full the FSM transmits a wait request in the st wait state and, if the FIFO is almost empty again a continue request in st continue. 78

99 8.2 OCSN scready = 0 st_start scready=1 scsendidentity = 0 & scidentityreceived=0 st_identify st_wait_id scidentityreceived=1 scsendidentity=1 st_idle scready=1 scsendidentity=1 sccdcdataavail = '1' and scready='1' scwait=0 scwait = 1 scalmostfull = 0 & scaf=1 scalmostfull = 1 & scaf=0 st_identity st_send scready=1 scwait=1 st_stop st_continue scready=1 st_send_wait scready=1 scready=0 st_id_send scready=0 st_wait scready=0 st_cnt_send scready=0 st_wait_send Figure 8.11: Graph of the OCSN IF FSM 79

100 8 Implementation of the Multicore Reconfiguration Platform OCSN Network Component The OCSN switch implements the network layer of the OCSN. It uses the OCSN IF of the previous section to provide seven ports for interconnecting devices, including additional switches. Because of the addressing scheme introduced in Section 7.1.3, seven is the maximum number of ports at one switch. Figure 8.12 displays the signal interface icocsnclk 16 identity idocsndatain 7*data_link 7 odled icocsnctrlin 7 OCSN_Switch_7Port icreset odocsndataout 7*data_link icclken ococsnctrlout 7 Figure 8.12: signal interface of an OCSN Switch of an OCSN switch. Switches are devices of the OCSN too and, as such, require its own address, given by the identify signal. odled is a debug interface showing at which ports a remote interface has been detected. Devices are connected through the OCSN physical signal interface. The switch implements the same interface than an OCSN IF, but has seven control signals and seven times data link data signals. data link is the number of data signals for one OCSN IF. The icocsnclk is shared by all the OCSN devices. The main task of a switch is routing incoming OCSN frames according to their destination address to another port. This includes forwarding frames to other connected switches. Because of the tree structure, a switch has to identify its uplink switch, which can be connected to any of the seven ports. A connected switch A is the uplink of a switch B, if the address of B is a postfix of the address of A. The same comparison has to be done for the destination address of each incoming OCSN frame. The addr compare component, shown in Figure 8.13, is responsible for this comparison process. Two OCSN addresses are inducted into the component and it calculates, if idaddr2 is a postfix of idaddr1. It uses a chain of multiplexer to compare every sub- idaddr1 idaddr addrcompare isnet ocvalid Figure 8.13: signal interface of the addr compare component part of the OCSN addresses, leading to very long signal propagation delays, reducing 80

101 8.2 OCSN the maximum clock rate for an OCSN switch. The alternative is to implement the component clock triggered and invest multiple clock cycles for the comparison. This would increase the complexity of the FSM, controlling the OCSN switch. Furthermore, the comparison of two addresses could require a different number of clock cycles, making it harder to calculate the actual switch throughput. The multiplexer approach is used in this work because a simpler implementation is better suited for a prototype system than the higher performance solution. While forwarding OCSN frames, multiple problems can occur, which has to be addressed by the switch. If multiple received frames have the same destination address, the switch has to select one for transmission at a time for preventing a deadlock. The transmission of the frames has to occur as soon as possible and no starvation of interface ports have to take place. No frame-drop is allowed to occur on switches other than the root switch. ac ac idocsndatain0 icocsnctrlin0 odocsndataout0 odocsnctrl0 OCSN IF 0 FSM0 ac ac ac ac ac ac ac ac ac ac FSM4 OCSN IF 4 idocsndatain4 icocsnctrlin4 odocsndataout4 odocsnctrl4 ac ac idocsndatain1 icocsnctrlin1 odocsndataout1 odocsnctrl1 OCSN IF 1 FSM1 ac ac ac ac ac ac ac ac ac ac FSM5 OCSN IF 5 idocsndatain5 icocsnctrlin5 odocsndataout5 odocsnctrl5 ac ac idocsndatain2 icocsnctrlin2 odocsndataout2 odocsnctrl2 OCSN IF 2 FSM2 ac ac ac ac ac ac ac ac ac ac FSM6 OCSN IF 6 idocsndatain6 icocsnctrlin6 odocsndataout6 odocsnctrl6 ac ac idocsndatain3 icocsnctrlin3 odocsndataout3 odocsnctrl3 OCSN IF 3 FSM3 ac ac ac ac ac ac ac ac ac ac Uplink Check FSM Main ac Figure 8.14: OCSN switch implementation schematic Figure 8.14 gives a simplified overview of the OCSN switch implementation. Each of the seven OCSN IF components has a FSM connected. For each port six add compare components (ac) calculate, if any incoming frame is designated for it. Another seven add compare components compare the remote interface addresses of each switch port with the address of the switch to identify the uplink port of this switch. The FSMs 81

102 8 Implementation of the Multicore Reconfiguration Platform implement, together with the main FSM, a snapshot based pulling algorithm. The algorithm ensures fairness by saving the availability of incoming frames of each OCSN port in a snapshot. Every available incoming frame is pulled to its destination port in a round robin manner. If the snapshot is processed, another is created. Listing 8.1 displays this algorithm in a C like pseudo language. Lines 3 to 6 are responsible for doing the snapshot by saving the data available signal from each OCSN port and marking each port as not transmitted. In lines 8 to 44, two encapsulated for loops, with the indices s for source and d for destination port, walk through all port combinations. The snapshot is tested, if any port combination has an available and not yet transmitted incoming frame. If source and destination port are the same and the destination address of the frame is the address of the switch, the destination of the frame is the switch itself and has to be processed appropriately. Processing such a frame only, if source and destination port are the same, ensures that it is processed once. If source and destination ports differ and the destination of the frame at source port s is a sub-address of the remote address at destination port d, the frame is forwarded to d. If d is identified as the uplink port of the switch and the destination of the frame at source port s is not a sub-address of any remote address, the frame is forwarded to d. After working through all ports in the snapshot, all frames are removed from the incoming queue. Frames not transmitted yet are dropped. This happens at the root switch only because all other switches have an uplink port, to which all not directly routable frames are sent. The hardware implementation of this algorithm uses two different kind of FSMs. The main FSM takes the snapshot and removes frames from the incoming queues. It synchronises the seven FSMs, of the second type. Each of these FSMs is responsible for one OCSN port. They test, if incoming frames in the snapshot from any port are destined for their assigned port and implement all the tests described in Listing 8.1 line 8 to 44. Through the partitioning of the algorithm in multiple FSMs, its implementation is straight forward and clear OCSN Application Components The components of the OCSN application layer are connected to OCSN switches through OCSN interfaces. All of them have the same basic structure, consisting out of an OCSN IF and a FSM, processing the incoming data. Figure 8.15 displays this basic structure. The device has the OCSN physical signal interface as minimum input/output signals. More signal are added according to the application specific hardware part, such as the GPIO pins of a OCSN GPIO device. The FSM divides into a general and application specific part. The application specific part implements actions for incoming OCSN frames specific to this device, such as reading and writing internal registers or RAM. The general part implements actions for OCSN frames, which are common to all OCSN devices. This includes reactions to 82

103 8.2 OCSN 1 while(1) { // create the snapshot, save which ports have data available 3 for(int i =0; i <7; i ++) { snapshot [i]. avail = port [i]. dataavail ; 5 snapshot [ i]. transmitted =0; } 7 // pull frames from source ( s) to destination ( d) ports for (int d =0; d <7; d ++) { 9 for (int s =0; s <7; s ++) { // only do something if a frame is available and not transmitted yet 11 if ( snapshot [s]. transmitted ==0 && snapshot [s]. avail ==1) { // destination and source port are the same and the dest. 13 // address is the same as the switch address of port d if ( d == s && port [ s]. frame. dst == switch. address ) { 15 // do something according to the frame type, destination port and payload // eg. send a ping response 17 } else // if destination and source port differ and the 19 // destination address is a subaddr of the remoteaddr of // port d 21 if ( subaddr ( port [s]. frame.dst, port [d]. remoteaddr ))) { // forward frame to this port 23 send (d, port [s]. frame ); snapshot [ s]. transmitted =1; 25 } else // if d is the uplink port and the frame is not destined for any other port 27 // forward it to i if ( uplink (d )==1 && ( 29! subaddr ( port [s]. frame.dst, port [d +1%7]. remoteaddr ) &&! subaddr ( port [s]. frame.dst, port [d +2%7]. remoteaddr ) && 31! subaddr ( port [s]. frame.dst, port [d +3%7]. remoteaddr ) &&! subaddr ( port [s]. frame.dst, port [d +4%7]. remoteaddr ) && 33! subaddr ( port [s]. frame.dst, port [d +5%7]. remoteaddr ) &&! subaddr ( port [s]. frame.dst, port [d +6%7]. remoteaddr ) 35 ) ) { 37 // forward frame to this port send (d, port [s]. frame ); 39 snapshot [ s]. transmitted =1; } 41 } } 43 } // remove frames in snapshot from fifo queue 45 for(int i =0; i <7; i ++) { if ( snapshot [i]. avail ==1) { 47 snapshot [i]. avail =0; port [ i]. removefromqueue (); 49 } } 51 } Listing 8.1: basic snapshot based pulling algorithm 83

104 8 Implementation of the Multicore Reconfiguration Platform idocsndatain icocsnctrlin odocsndataout odocsnctrl OCSN IF FSM application specific hardware Figure 8.15: OCSN application component basic schematic ICMP ping requests only at the moment. Through ICMP ping requests, the identify of a OCSN component can be determined. OCSN BRAM device The VHDL description of the application specific part is very similar to the description of the dual ported block ram, described earlier, but it uses only one port for read and write access. Each of the supported frames, as described in Section 7.2.2, corresponds to a state in the application specific part of the FSM. Data read or written from and to the BRAM has to be encoded into the payload of OCSN frames. The address, to read from or to write to, is also encoded into the payload. The main function of the FSM states is to read the requested number of bytes from the RAM and write them into the payload of the frame, or otherwise round, writing the given number of bytes from the frame to the RAM. OCSN ICAP device The ICAP device takes the number of bytes to write and the bytes from an OCSN frame. The FSM always writes 32 bit data words to the ICAP component at 50MHz. OCSN GPIO device The GPIO device maps registers to external input and output pins. The FSM takes bytes from an OCSN frame and writes them into internal registers, leading to a change in the GPIO pins. If the status of the input pins is requested, the FSM returns the internal register, connected to these pins. OCSN PRHS device The OCSN PRHS device connects the OCSN to the PRHS SoC through a memory mapped input/output interface. The implementation is described by Grebenjuk[37]. 84

105 8.2 OCSN OCSN Ethernet Bridge The OCSN Ethernet Bridge device consist of the basic OCSN device structure, an Ethernet MAC IP core and two synchronised FSMs, for controlling the transmission and reception of data. Figure 8.16 displays both FSMs. The numbers at the beginning of the transition labels set the priority of each transition. They implement a simple synchronisation protocol (shown in Figure 8.17) to ensure, the Ethernet MAC addresses of both endpoints are known to each other. (1)sdRemoteMAC = && scdiscovertimerinterrupt=1 st_discover (2)sdRemoteMAC /= && srselectionacksend=0 st_sel_ack st_start st_idle (3)scOCSNdataAvail = 1 st_prepare st_ocsn (1)sdTransmitCounter = 0 st_send (2) sctxdstrdy = 0 st_wait (a) Transmission FSM scrxsrcrdy=0 && scrxsof=0 st_start st_idle (scrxeof =0 && sdreceivecounter<60) (scrxeof = 1 && sdreceivecounter>60) st_receive scrxeof =0 && sdreceivecounter = 60 sdreceivedeth.dst_mac = idinitialmac && sdreceivedeth.frame_type=0x81fc && sdreceivedeth.ocsn_op=op_selection st_check1 st_check2 sdreceivedeth.dst_mac = idinitialmac && sdreceivedeth.frame_type=0x81fc && sdreceivedeth.ocsn_op=op_ocsn_frame st_send_frame scfifofull=0 (b) Reception FSM Figure 8.16: OCSN Ethernet Bridge FSMs The OCSN2Ethernet bridge starts by sending discovery Ethernet frames through the Ethernet MAC IP core every second. If a host system is available on the other side of the connection or connected to the same Ethernet switch, it answers with a selection frame to the MAC address of the OCSN2Ethernet bridge. The OCSN2Ethernet bridge confirms the reception of the selection frame by sending a selection ack frame. After this handshake protocol every OCSN frame is encapsulated into an Ethernet frame and transmitted to the remote device. The FSMs do not support answering to OCSN ping frames. 85

106 8 Implementation of the Multicore Reconfiguration Platform Host OCSN2Ethernet discover selection selection ack Figure 8.17: OCSN Ethernet Discovery Protocol OCSN UART Bridge Like all application devices, the base of the OCSN UART Bridge is the basic application device structure of Figure The application specific hardware consist of a UART component and another FSM, which controls the incoming data from the UART. No special handshake protocol is implemented. The device just starts transmitting through the UART as soon as an OCSN frame arrives and builds an OCSN frame out of the incoming data from the UART. Sending an end of frame byte, identified through the Parity bit, is the only used synchronisation method between local and remote bridge component. 8.3 CSN Like the description of the OCSN implementation, the implementation of the CSN is divided into different components, according to the OSI model. Section already described the required OSI layers. 86

107 8.3 CSN Physical Layer Implementation The CSN uses the interconnection network of the underlying FPGA. This reduces the implementation complexity of the CSN physical layer. The signal interface, to communicate through the CSN, is the only implementation specific part of it. It is already described in Section Network Layer Components The CSN is an indirect network with crossbar switches as the main network components. Through the crossbar switches application layer devices can be connected and other crossbar switches, to extend the network. Figure 8.18 displays the connection schema of ocro ocro CEB CEB CSN Switch CEB CEB2 ocro ocro Figure 8.18: Crossbar Interconnection Schema one CSN crossbar switch. There are dedicated ports for connecting CEBs, and dedicated extension ports, for connecting switches and application layer devices. Each device is connected with four single signal lines and four clustered or bus signal lines. One bus line is 32 bit wide. The CSN crossbar switch requires a complex signal interface, to support this kind of connection schema. Figure 8.19 presents this signal interface. The first six signals on the left side belong to the OCSN physical interface, because the routing table of the CSN crossbar switch is programmable through the OCSN. Additional status information concerning CEBs can be requested from the OCSN too. icswid identifies all connected switches. It consists of eight times the number of connectable switches bits. For every switch eight bits of identifier are available, limiting the number of switches for one CSN to 256. Each switch connects to this signal starting with the top switch at bits 8 nr sw 1 down to 7 nr sw. ocresetceb and ocenabled are control signals to the CEBs. The first resets the component configured into the CEB to a known state, the second enables the clock for the component. Both signals have bit width number of connectable CEBs. 87

108 8 Implementation of the Multicore Reconfiguration Platform idocsndatain data_link nr_cebs*8 iccebid icocsnctrlin 2**ctrl_lines_single idctrl odocsndataout data_link 2**ctrl_lines_single odctrl ococsnctrlout 2**ctrl_lines_bus*bus_size idbus icocsnclk CSN_Switch 2**ctrl_lines_bus*bus_size odbus identity 16 icclkenable icswid nr_sw*8 icreset ocresetceb nr_cebs icclk ocenabled nr_cebs Figure 8.19: CSN Crossbar Switch Signal Interface iccebid is the same as icswid but identifies the connected CEBs. The eight bits width per CEB limits the number of CEBs on a reconfiguration platform to 256. But this value is easily extended, if necessary. idctrl, odctrl, idbus and odbus are the data signals of the CSN. The first two have a bit width of 2 nr ctrl lines single and the later two of 2 nr ctrl lines bus bus size. At the moment there are five control lines for single signal lines and five control lines for clustered or bus signal lines. The bus width is 32. Eight components can connect to one crossbar switch, leading to four signals of each type for one component. The components connect to the crossbar switch according to the connection schema of Figure implementation Figure 8.20 displays the main components of a CSN crossbar switch. Its main structure resembles the basic structure of a OCSN application layer component. An OCSN interface and a FSM manage the connection to the OCSN. The number of single and cluster control lines is reduced to two, in this example. This simplifies the display of all required components. The more control lines, the more components are required. With two control lines four signal lines or signal clusters can be addressed. In this example, four outgoing single signal lines are shown on the left side and four outgoing clustered signals on the right. Each of these outputs is connected to the outgoing port of a multiplexer. The incoming signal lines are connected to the input ports of the multiplexer. Through a connected routing register, the signal passing through to the output is selected. The outgoing signals for resetting and enabling CEBs and the incoming signals for CEB and switch identifiers are connected to registers too. All the available registers, except the identification registers, can be set by sending special OCSN frames to the switch and program the routing. 88

109 8.3 CSN idctrl(3 downto 0) idbus(127 downto 0) Routing Register Routing Register odctrl(0) MUX MUX odbus(127 downto 96) Routing Register Routing Register odctrl(1) MUX MUX odbus(95 downto 64) Routing Register Routing Register odctrl(2) MUX MUX odbus(63 downto 32) Routing Register Routing Register odctrl(3) MUX MUX odbus(31 downto 0) idocsndatain icocsnctrlin odocsndataout odocsnctrl OCSN IF FSM Reset Reg Enable Reg ocresetceb ocenabled CEB IDs iccebid SW IDs icswid Figure 8.20: CSN Crossbar Switch Implementation Schematic Application Layer Components The application layer components of the CSN divide into the CEBs and other extension devices. At the moment only one extension device is available, the OCSN2CSN bridge to communicate with the outside world. 89

110 8 Implementation of the Multicore Reconfiguration Platform CEB The interface of the CEBs has already been described in Section 7.3. The implementation is application specific and is not described here. OCSN2CSNsimple Bridge Both OCSN2CSN bridges are gateways between the packet switched OCSN and the circuit switched CSN. Therefore, they require a physical OCSN signal interface and a physical CSN signal interface. Figure 8.21 displays these signal interfaces. The OCSN idocsndatain data_link 4 odsingle icocsnctrlin 4*bus_size idbus odocsndataout data_link 4*bus_size odbus ococsnctrlout icocsnclk CSN2OCSN icreset icclkenable identity idsingle 16 4 icclk Figure 8.21: CSN2OCSN Bridge Signal Interface interface is the same as for any other OCSN device and enables the bridge to connect to an OCSN switch or directly to any other OCSN application layer component. The CSN signal interface ist designed to connect directly to the extension ports of a CSN crossbar switch. The OCSN2CSNsimple Bridge is implemented as an OCSN application layer device, introduced in Section It supports four different OCSN network frames. readsingle returns the value of the idsingle lines writesingle sets the value of the odsingle lines readbus returns the value of the idbus lines writebus sets the value of the odbus lines The values returned are sampled at the moment the OCSN frame is processed by the bridge. OCSN2CSN Bridge The structure of OCSN2CSN bridge is nearly the same as of the OCSN2CSNsimple bridge. The signal interface is the same displayed in Figure 8.21 and it is also an OCSN 90

111 8.3 CSN application layer component. The difference is, that the OCSN2CSN bridge enables a CEB to create a full OCSN frame and transmit it and to receive a full OCSN frame. To create the OCSN frame, the following signal mapping on the CSN physical layer is used: idbus(31 downto 0) data input from the CSN odbus(31 downto 0) data output to the CSN idbus(32) directly mapped to the OCSN IF icsend signal idbus(33) directly mapped to the OCSN IF icreaden signal idbus(63 downto 60) request to which register the incoming data should be written idbus(59 downto 56) request which register to put on the output data bus odbus(32) directly mapped to the OCSN IF ocidvalid signal odbus(33) directly mapped to the OCSN IF ocready signal odbus(34) directly mapped to the OCSN IF ocdataavail signal The CEBs can use this interface to create or read an OCSN frame. Table 8.1 describes the selectable registers. New values are written to the register at the next clock tick. Address Register 0000 source address and destination address 0001 source port, destination port and frame type 0010 bits 31 downto 0 of OCSN payload 0011 bits 63 downto 32 of OCSN payload 0100 bits 95 downto 64 of OCSN payload 0101 bits 127 downto 96 of OCSN payload 0110 bits 159 downto 128 of OCSN payload 0111 bits 191 downto 160 of OCSN payload 1000 bits 223 downto 192 of OCSN payload 1001 bits 255 downto 224 of OCSN payload rest identity of the remotly connected OCSN device Table 8.1: Address to register mapping After creating an OCSN frame, it can easily be transmitted by setting the icsend signal to high. If an OCSN frame is available, it can also be read through this interface. The interface is necessary because the CSN only features four 32bit busses and four single lines for each connected component at the moment. One OCSN frame is 312bit wide and has to be mapped to fewer signals. 91

112 8 Implementation of the Multicore Reconfiguration Platform One problem arises from the fact that each CEB can be operated with a different clock speeds and this clock speed is not required to match the clock speed of the OCSN2CSN bridge. If the clock signals do not match the CDC problems arises, described in Section Different solutions, to ensure, that the data is correctly saved into the internal registers, exist: The interface can be extended by read and write acknowledge signals. These acknowledge signals ensure that the data can correctly cross the clock boundaries, like a CDC component does. It requires additional hardware in the CEBs and the OCSN2CSN bridge for handling the acknowledge signals. Using clock speed selections lines instead of acknowledge signals, would reduce the hardware requirements within a CEB because no FSM is required to handle the acknowledge signals, but would require the usage of special BUFG-MUX components in the OCSN2CSN bridge. These special components are multiplexer dedicated to global clock lines of the FPGA and are limited in number. This approach is only feasible, if the number of clock signals and the number of OCSN2CSN bridge components is very small. The simplest solution is to reduce the flexibility of the overall design and determine one fixed clock rate for communication with OCSN2CSN bridges. This increases the hardware requirements in the CEBs only, if the CEB is running at a different clock rate than the OCSN2CSN bridge. For the prototype of the MRP the last option is chosen because the implementation complexity is very small and using a simple interface without additional control signals reduces the error probability in CEB implementations. The determined clock rate is 25MHz at the moment. 92

113 9 Operating System Support Implementation Section 7.4 described the overall idea of the OS support for the MRP. At the moment only support for the OCSN is required to interact with the MRP, especially the CEBs. Linux is chosen as the OS for the host system of the prototype. It is an UNIX like OS[38] and divides into the Linux kernel and user applications. The current kernel version is The MRP operating system support requires adaption of the Linux kernel and writing user applications for managing the different tasks of the MRP. Robert Love[42] gives a good introduction to Linux Kernel Development. The Linux OS has different ways of extending its functionality. The main, and most used, way is writing device drivers. These device drivers interact with hardware devices connected to the system, and integrate them into the Linux kernel as character, block or network devices. Character and block devices are represented as ordinary files in the Linux device tree and require the implementation of at least open, read, write and release callback functions. The network device driver requires read, write and poll callbacks. The kernel uses these callback functions to interact with the hardware devices. Another extension point of the Linux kernel are network drivers. Network drivers are different from network device drivers. While the later interact with hardware, network drivers implement the BSD socket API for every supported network. This includes creating a kernel structure, representing the addressing schema of the network, callbacks for bind, connect, release, accept, listen, poll, sendmsg and recvmsg. The socket interface allows user space applications to open sockets and transmit and receive data through the network. Common network drivers of the Linux kernel are IPv4, IPv6, AppleTalk and Ethernet. All drivers of the Linux Kernel register at least one C structure with the kernel. These C structures contain configuration parameters, like names and sizes of other structures, and function pointers to callbacks. The OS support for the MRP uses a device driver and a network driver. The network driver for the OCSN allows user application to directly create, transmit and receive OCSN frames. The frames are en-/decapsulated by the network driver into/from Ethernet frames and transmitted/received using the Ethernet network driver. If the OCSN is connected natively to the host system, for example using the PRHS SoC, a OCSN network device driver interacts with the OCSN network interface hardware. The driver fetches received frames from the interface hardware, encapsulates them into Ethernet frames. The Ethernet frames are passed to the OCSN network driver. The network driver delivers the frame to the corresponding user space process. A frame transmitted 93

114 9 Operating System Support Implementation from a user space application is first processed by the OCSN network driver and than delivered to the network interface connected to the OCSN. 9.1 OCSN Network Driver The first part of the network driver initialisation is registering a new network protocol to the Linux kernel with its name and the size of its socket data structure (Listing 9.1). 1 static struct proto ocsn_ proto = {. name = " OCSN ", 3. owner = THIS_MODULE,. obj_ size = sizeof(struct ocsn_ sock ) }; Listing 9.1: OCSN protocol structure The ocsn sock structure represents a network socket. In the OCSN context it consist of the basic kernel socket structure, src and dst address, src and dst port and the application layer frame type as presented in Listing 9.4. struct ocsn_ sock { 2 struct sock sk; unsigned short ocsn_ dst ; 4 unsigned short ocsn_ src ; unsigned char ocsn_ src_ port ; 6 unsigned char ocsn_ dst_ port ; unsigned char protocol ; 8 }; Listing 9.2: OCSN socket structure The basic socket structure sk holds information about the incoming or outgoing network device and a queue for incoming network frames. The second initialisation step is registering a new sub-packet of an Ethernet packet, with a fixed Ethernet frame type of ETH P OCSN(0x81fc) and the callback function ocsn rcv. static struct packet_ type ocsn_ packet_ type read_ mostly = { 2. type = cpu_to_be16 ( ETH_P_OCSN ),. func = ocsn_ rcv 4 }; Listing 9.3: OCSN packet structure This packet type is represented by the structure displayed in Listing 9.3. This step ensures that all incoming Ethernet frames of type ETH P OCSN are forwarded to this network driver by calling the ocsn rcv function and the Ethernet frame as a parameter. The ocsn rcv function is responsible for processing the incoming Ethernet frames, extract the OCSN frame from its payload and find the destination socket from a list of sockets, by comparing destination address and destination port of the incoming frame and every existing socket. If the OCSN is connected to the host system through an 94

115 9.1 OCSN Network Driver OCSN Ethernet bridge, ocsn rcv has to respond according to the handshake protocol described in Section too. The last step registers the socket interface of the network driver at the kernel. The implemented interface is identified by the structure given in Listing 9.4. static const struct proto_ ops ocsn_ dgram_ ops = 2 {. family = PF_OCSN, 4. owner = THIS_MODULE,. release = ocsn_release, 6. bind = ocsn_bind,. connect = sock_no_connect, 8. socketpair = sock_ no_ socketpair,. accept = sock_no_accept, 10. getname = sock_no_getname,. poll = datagram_poll, 12. ioctl = sock_no_ioctl,. listen = sock_no_listen, 14. shutdown = sock_ no_ shutdown,. setsockopt = sock_ no_ setsockopt, 16. getsockopt = sock_ no_ getsockopt,. sendmsg = ocsn_sendmsg, 18. recvmsg = ocsn_recvmsg,. mmap = sock_no_mmap, 20. sendpage = sock_ no_ sendpage, }; Listing 9.4: OCSN socket interface structure Only the bind, release, poll, sendmsg and recvmsg callbacks are implemented, because the OCSN does not feature a connection oriented transmission protocol. bind The bind function creates a persistent OCSN socket with a fixed OCSN src port. This src port identifies the user space application and every OCSN frame received with the same destination address is delivered to this socket. The user application can choose a new random src port or request a specific port, if it is available. release The release function removes a previously created OCSN socket from the list of sockets and frees its used memory. poll Poll uses a standard datagram polling function. sendmsg The sendmsg function creates an OCSN frame out of a given address structure and data buffer. It creates the kernel structure for transmitting Ethernet frames and passes this structure to the network device for transmission. recvmsg The recvmsg function is called for receiving data from an OCSN socket. It fetches a received frame from the socket queue and creates an OCSN address structure and data buffer from it. These are returned to the user application. 95

116 9 Operating System Support Implementation 9.2 OCSN Network Device Driver The network device driver for the OCSN -PRHS-SoC memory mapped io interface was written by Grebenjuk[37] and its implementation is only briefly described here. The hardware OCSN network interface is connected to an OCSN IF on the one side and on the other side to the memory bus of the PRHS SoC. The network driver is responsible for copying received OCSN frames from the memory mapped registers to the kernel space, encapsulate them into Ethernet frames and pass them to the Linux network stack for more processing. In the opposite direction the network stack delivers Ethernet frames to the network device driver. The device driver extracts the OCSN frame and copies it to the memory mapped io registers of the hardware interface. 96

117 10 Evaluation The usability of the presented framework is evaluated using the two dimensions space and time and an example application. The space dimension is analysed by looking at the area usage of the MRP. For the time dimension the maximum clock rates, achievable by CEBs interconnected through the CSN are measured. For the example implementation a small general-purpose processor is ported to the MRP Area Usage The area required to support the MRP onto the FPGA is a very important factor how efficient designs using the MRP can be. The area is measured in FPGA LUTs (see Section 2.4). The reconfiguration platform of the MRP is configured into a Xilinx xc5vlx330 Virtex5 FPGA supporting LUTs divided into slices. The CEBs consist of slices only. The integration of special purpose hardware, such as DSPs and BRAM, is not supported at the moment. To use the available special purpose hardware the usage requirements for the complete MRP infrastructure has to be aquired. The available resources have to be evenly distributed through all CEBs. The CEBs have to be placed in such a way on the FPGA that each of them encapsulates all the hardware resources it should support. The size of the used FPGA does not allow that. The MRP uses LUTs of the FPGA, including the area for the CEBs. This is roughly 75% of the available resources. Relocating the CEBs leads to an unroutable design. A larger FPGA can support the placement of CEBs with integrated special purpose hardware. Table 10.1 displays the area usage of the MRP system. The given Percentage relates to the number of LUTs not the maximum number available. A CEB consist out of 800 CLBs, which equals 3200 LUTs. All the CEBs together require 32.8% of the used FPGA area. The CSN switches differ in size because during design synthetisis the components get optimised for area usage. Switch 3 and 1 only support two switch extension ports, while the other feature three. These additional port and the number of used connections per port determine the size of each switch. They are roughly three time larger than a CEB and together require 21.86% of the used FPGA space. The IOB components are only the size of halve a CEB. Most of the area is required by the OCSN. Alltogether it requires 43.31% of the used FPGA space. The reason for this is the complex routing algorithm within the OCSN switches. A simple BUS can replace the OCSN and reduce the area usage of the interconnection infrastructure, but would limit the flexibility of communication, for example with resources like RAM, processor cores and additional FPGAs. Another drawback would be the limited size and 97

118 10 Evaluation Component Nr. LUTs Nr. MUXFX Nr. BRAM Area Usage Percentage clkmanager ,03 OCSN-Switch ,64 OCSN-Switch ,18 OCSN-Switch ,45 OCSN2BRAM ,17 OCSNbridgeUART ,66 OCSN2ICAP ,21 CEB ,05 CEB ,05 CEB ,05 CEB ,05 CEB ,05 CEB ,05 CEB ,05 CEB ,05 CEB ,05 CEB ,05 CEB ,05 CEB ,05 CEB ,05 CEB ,05 CEB ,05 CEB ,05 CSN-Switch ,02 CSN-Switch ,42 CSN-Switch ,86 CSN-Switch ,56 CSN2OCSN ,96 CSN2OCSNsimple ,03 Gesamt: Table 10.1: Area usage of the MRP 98

119 10.2 Maximum CSN Propagation Delay Measurement extensibility of busses. Looking only at the CSN and the CEBs the hardware overhead is not that big because four switches provide interconnectivity to 16 CEBs. The overhead can be reduced even more by increasing the number of CEBs per switch and improve the multiplexer implementation within them Maximum CSN Propagation Delay Measurement The CSN is a very critical part of the MRP. It is an indirect network and has no direct connections between network components, such as CEBs and IOBs. Virtual paths through CSN switches have to be created to interconnect them. The propagation delay of a path is an important factor in digital circuit design because it determines the maximum clock rate of the overall system. At least two physical paths are necessary to create a virtual path within the CSN because it has to connect a CEB or IOB to a CSN switch, and this switch has to connect to the other CEB or IOB. If the second component is connected to a different switch, more physical paths are necessary. It is obvious that the propagation delay of the created virtual path is composed of the propagation delay of the individual physical paths and the gate delay within each CSN switch. It is important to analyse all the possible path delays within the CSN to determine the maximum overall clock frequency, and to indentify areas of the same maximum clock frequency. The measurement of propagation delays on a FPGA is difficult because the start and endpoints are not directly accessible from outside. Routing both to I/O pins of the FPGA would greatly distort the measurement result because the additional path to the I/O buffer, and the I/O buffer itself are affecting the propagation delay with an unknown factor. Another not feasible method is grinding the FPGA to get access to the path. A working solution to analyse the propagation delay of paths on a FPGA was published by Ruffoni and Bogliolo[43]. They used two Ring Oscillators (ROs) R 0 and R 1 on the FPGA. R 1 was extended by the path p to analyse. They determined the periods T 0 and T 1 of the ROs. The period of a RO is twice the propagation delay of its loop[43]. Adding a path to the loop extends the period by twice the propagation delay of the path p: T 1 = T 0 + 2d p. Hence, the delay d of the path is calculated by d p = (T 1 T 0 )/2. This method has been adapted for the MRP RO-Component A special RO component has been developed for configuring into any of the CEBs. It consists of a RO, whichs path can be extended by using a control output and a control input of the CEB interface. The switching between the base and the extendend path is implemented using a 2-1 multiplexer and a 2-1 demultiplexer. The control line of each of them is connected to the CEBs enable signal (see Figure 7.7). The RO is driving the clock input of a 32bit counter. The enable and reset signal of the counter is driven by a FSM, clocked at 50Mhz. Both signals are passed into the clock domain of the RO using two FFs connected in a row. The FSM is responsible for doing the measurement of the number of RO ticks within a given amount time. If it receives the start signal 99

120 10 Evaluation from the outside the FSM enables the counter, waits for a given number of 50Mhz clock cycles, and disables the counter. The counters value is connected to an outgoing 32bit bus connection. On reception of a reset signal from the outside, the FSM resets the counter. The component can be used to first measure the base period T B of the RO and afterwards the period T E of the RO with the extended path. The period in nanoseconds can be calculated from the measured number of ticks by 1 T = 1000 RO ticks f[mhz] ticks f[mhz] The propagation delay of the extended path p can then be calculated with: ReRouter-Component d p = (T B T E )/2 Another component is required to measure the propagation delay of all paths within the CSN. The RO requires an extended path to start and stop at itself. Therefore,1 a component is necessary, which can route the incoming singals of a CEB back through its outputs. The component is called ReRouter. Its implementation is very simple because it just connectes its inputs to its outputs Measuring Setup To get as much information as possible out of the propagation delay measurement all the paths between the CEBs are analysed. Figure 10.1 displays one configuration of the measurement setup. This configuration is used to measure all path delays between CEB0 at CSN switch 0 to any other CEBs. Hence, the RO component is configured into CEB0 at CSN switch 0. All the other CEBs are configured with the ReRouter component. The red line shows one of the measurement virtual paths. It consists out of six physical paths (CEB0 to SW0, SW0 to SW2, SW2 to CEB0, CEB0 to SW2, SW2 to SW1, SW1 to CEB0). As you can see, the round trip time between the two CEBs are measured. Therefore, the result has to be divided by two to estimate the one way time. First the base period of the one RO component is determined. After that the CSN is programmed to every possible virtual path and the period of it is measured. The last step is to calculate the individual virtual path propagation delay Measurement Results Table 10.3 presents the propagation delay matrix for the full MRP. To improve the table size the column and row names are shortend. The format x-y states CEBy at CSN switch x. The measurment results are symmetric with small variations. The blue marked leading diagonal represents the propagation delays of the CEB to its switch. The results are already divided by two to estimate the one way trip time, not the round trip time. There are a few variants in the symmetrie of the matrix, which need to be explained. 100

121 10.2 Maximum CSN Propagation Delay Measurement RO ReRouter ReRouter ReRouter CSN2OCSN CSN SW 0 CSN SW ReRouter ReRouter ReRouter ReRouter ReRouter ReRouter ReRouter ReRouter CSN2OCSNsimple CSN SW 2 CSN SW ReRouter ReRouter ReRouter ReRouter Figure 10.1: MRP Measurement Configuration for Setup 1 1. There is always at least a small variant within the propagation delay of the path to a CEB and back. 2. Sometimes a propagation delay from one CEB to another is shorter than the sum of the propagation delay to their switch. An example of this phenomenon is the path between CEB1-2 and CEB1-1. Their propagation delay is measured 1.86ns while their propagation delays to their switch are measured 3.15ns and 2.39ns. The problem with measuring the propagation delay within the CSN is, that it is not regularly placed into the FPGA. Figure 10.2 displays the placement of all four CSN switches. It is clearly visible that all the switches are distributed throughout the FPGA, Switch Clk s (Mhz) Clk c (Mhz) Table 10.2: Maximum clock rates within each switch 101

122 10 Evaluation CEB , , Table 10.3: Propagation Delay Matrix for all CEBs in ns 102

123 10.2 Maximum CSN Propagation Delay Measurement yellow CSN Switch 0, red CSN Switch 1, green CSN Switch 2, purple CSN Switch 3 Figure 10.2: Floorplan of the reconfiguration platform 103

124 10 Evaluation and are even entangled. This distribution leads to very different gate delays for different parts of the CSN switches. This can lead the second phenomenon because the route through the used multiplexer to another CEB can be very short while the path back to itself is very long. Another problem is the placement within each CEB area. The RO could be placed very near the I/O signals or very far away. The placement process is a highly randomised process, so this scenario is likely. Figure 10.3 shows the CEB to CSN switch 0 connections in orange and the connections from CSN switch 0 to switch 2 in pink. The lengths of these paths are very different, such as the paths to the left of CEB0-3. The result of these measurements are, that CEBs connected through one switch can be clocked at a higher frequency than CEBs connected at different switches. For example components configured into the CEBs at switch 0 can be clocked at 135Mhz if sequential circuits are used and at 67Mhz if a combinational circuit is required in at least one CEB. The clock frequencies are calculated using the worst case propagation delay at one switch. The clock rates for the other switches are displayed in Table Clk s is the maximum achievable clock rate using sequential circuits only. Clk c is the maximum clock rate with at least one combinational circuit, but ignoring its gate delay. As soon as a CEB at a different switch is connected to a system, the clock rate is at least halved Example Microcontroller Implementation for MRP Showing that the MRP can support complex digital components is very important for the framework evaluation. Therefore, a small CPU has been ported to run as a distributed core onto the MRP. The used processor core was developed for teaching purposes by the Computer Engineering group of the Helmut Schmidt University in Hamburg. It supports 16 32bit registers, a 32bit ISA, a 32bit databus, and a 16bit address bus. A simple assembler is available for easier software development. To port the processor core onto the MRP the processor core has to be divided into its core parts, such as fetch and decode unit, control unit, register file, and ALU. These components have to be encapsulated into the CEB signal interface. The fetch and decode unit has to be divided into two units. One unit is responsible for fetching datawords from a RAM component within the OCSN using the CSN2OCSN bridge. The second one decodes the fetched words for the datapath of the processor core. The control unit was extended by two states in its FSM to use the additional fetch stage, enforced by the OCSN access. The fetch unit is accessible from the OCSN to select the address of the OCSN RAM component and its port. Additional command frames are available to start, stop, and reset the proccessor core. This is necessary because programms running on the MRPs host system shall manage the processor core and its software. Figure 10.4 presents the MRP configuration for the processor core. All components, except the ALU, fit into the CEBs of CSN switch 0. The ALU is configured into CEB 1 of switch 1. Without the MRP and configured as a SoC onto a Xilinx Virtex5 FPGA the processor core can run at 30Mhz. Hence, 25Mhz is the maximum frequency of the core on the MRP. Using 104

125 10.3 Example Microcontroller Implementation for MRP yellow CSN Switch 0, red CSN Switch 1, green CSN Switch 2, purple CSN Switch 3 Figure 10.3: Floorplan with interconnects of the reconfiguration platform 105