Online Clock Routing in Xilinx FPGAs for High-Performance and Reliability

Transcription

1 Online Clock Routing in Xilinx FPGAs for High-Performance and Reliability Xabier Iturbe, Khaled Benkrid, Raul Torrego, Ali Ebrahim and Tughrul Arslan System Level Integration Group, The University of Edinburgh, Edinburgh EH9 3JL, Scotland, UK {x.iturbe, k.benkrid, a.ebrahim, Embedded System-on-Chip Group, IKERLAN-IK4 Research Alliance, Mondragón 20500, Basque Country (Spain) {xiturbe, Abstract In this paper, we report the design and implementation of a reconfigurable system that exploits regional clocking resources that exist in Xilinx Virtex-4 FPGAs for increased performance and, for the first time, enhanced reliability. Unlike previous approaches, our system is able to individually manage the regional clock buffers (BUFRs) to adjust the frequency delivered to each hardware task and to detect and recover from faults affecting the clock-tree on-the-fly. Towards this end, we propose global and regional clock multiplexers, named GCMUX and RCMUX respectively, which allow for switching to spare clocking resources whenever needed. These multiplexers are based on the inner programmable interconnection points of the FPGA, leading to zero area overheads. I. INTRODUCTION State of the art trends in Reconfigurable Computing (RC) envision substantial gains in performance, reliability and power efficiency over traditional systems by customizing at runtime the underlying architecture of the system to match the specific needs of a given application [1]. Different pieces of circuitry, specifically designed for efficiently implementing each type of computation, are allocated on a dynamically reconfigurable FPGA, executed and finally replaced by other circuits, leading to a continuous stream of input operands, computation and output results. Analogously to the software field, these swappable pieces of circuitry are named as hardware tasks. Current research efforts try to improve the performance of RC in each of the domains where computation occurs, in space and time [2]. In the space domain, they are aimed to increase the allocatability of the hardware tasks, reducing the fragmentation in the chip. In the time domain, the efforts concentrate in exploiting the parallelism delivered by the FPGA. This includes both process-level parallelism or multitasking, where the objective is to make the highest amount of hardware tasks run simultaneously, and data-level parallelism, where the objective is to build efficient architectures able to exploit the high-bandwidth offered by the tens of thousands logic blocks and memories included in modern FPGAs. However, these approaches can highly benefit from advances in precisely the key factor for computing speed in traditional processors: the clock frequency. Indeed, the highest clock frequency a hardware task can run depends on the maximum delay between its sequential components, the socalled longest path, which usually depends on the complexity of the task itself. As a result, several hardware tasks which can run at different clock frequencies are typically found in an RC application. Clocking the system at the slowest rate is the easy option that is often chosen, but it is not the most efficient. Hence, the open question here is, how to deal with frequency-heterogeneity to improve the performance of an RC application? Another issue of concern in RC is fault-tolerance. Researchers have traditionally focused on scrubbing bit upsets [3] and reconfiguring around damaged resources [4], but little attention has been put on common-source failures such as clocking distribution. However, the clock-tree is a single-point of failure and must be carefully hardened to increase the reliability of the system [5]. Modern families of Xilinx FPGAs include enhanced clocking capabilities that can be useful when dealing with the above-mentioned problems. These devices permit to independently handle different portions of the device s reconfigurable area, the so-called clock-regions [7]. Each clock region includes specific clocking resources and thus, the clock-tree is not a single resource that must be managed as a whole. Instead it is divided into several branches that feed the hardware tasks. However, most of RC systems do not still exploit these capabilities offered by new FPGAs. This paper is aimed at exploiting the multi branched clocktree to increase performance and reliability in an RC system. The work reported in this paper is part of a larger effort in our group which aims to implement OS-like support to develop reconfigurable applications using Xilinx FPGAs; e.g. support for task scheduling, allocation, deallocation, inter-task communications and synchronization, etc. Our OS is named as Reliable Reconfigurable Real-Time Operating System (R3TOS) [6], emphasizing its three major features: reconfigurability, reliability and real-time performance. The main contributions of this paper are twofold: The implementation of an RC system able to manage the regional clocking resources at runtime to make each hardware task run at its maximum frequency. A novel method to detect and recover at runtime from a fault affecting the clock-tree.

2 The remainder of the paper is organized as follows. Section II introduces the architecture of partially reconfigurable Xilinx FPGAs, making special emphasis on the clock-tree, and reviews related work. Next, in Section III, the main contributions of this paper are described by means of a proof-of-concept implementation and finally, conclusions are pointed out in Section IV. II. DYNAMICALLY RECONFIGURABLE XILINX FPGAS Xilinx FPGAs include an array of Configurable Logic Blocks (CLBs), Input-Output Blocks (IOBs), routing resources, a clock-tree and some special resources (e.g. Block- RAM memories, DSPs) [7]. A. The Clock-Tree Fig. 1a shows the global clocking resources in a Virtex-4 fabric. 32 matched-skew global nets are driven by a global clock buffer (BUFGCTRL) each, which can select between two input clock sources. Usually these clock sources are Digital Clock Managers (DCMs), used to eliminate the clock distribution delay, adjust its delay relative to another clock, or adjust the frequency of a clock source. IOBs can be directly connected to BUFGCTRLs as well. The Programmable Interconnection Points (PIPs) that distribute the output signals of the BUFGTRL to the clock regions are especially interesting in this figure. These PIPs act as Global Clock MUltipleXers (GCMUXs), selecting up to 8 out of the 32 input global signals to feed the sequential components in the device. Starting from Virtex-4, Xilinx FPGAs are divided into different clock regions to improve the clocking distribution (See Fig. 1b). The number of regions varies with device size, 8 regions in the smallest device to 24 regions in the largest one. Each clock region includes two independent regional clock nets (Net1 and Net2) and two regional clock buffers (BUFR1 and BUFR2) with the capability to divide the input clock rate by any integer number, named as BUFR DIVIDE, which can range from 1 to 8 (See Fig. 1c). This input clock signal can come from an IOB, through an I/O clock Buffer (BUFIO), or directly from a BUFGCTRL. Especially interesting are the PIPs to connect the output signals of the BUFR with the regional clock nets. These PIPs act as a Regional Clock MUltipleXer (RCMUX), selecting up to 2 out of 6 different input clock signals, 2 coming directly from the BUFRs in the same clock region and the other 4 coming from the BUFRs in the adjacent up and down clock regions. Hence, each BUFR can drive up to three adjacent clock regions. Summing up, the clock-tree of a Xilinx FPGA is hierarchically structured: Clock Source (DCM) BUFGCTRL GCMUX BUFR RCMUX (See Fig. 1d). B. Dynamic Partial Reconfiguration and Online Task Reallocation The physical resources of a Xilinx FPGA are configured by means of a bitstream, that is stored in a configuration memory. Dynamic Partial Reconfiguration (DPR) consists in changing the content of some positions of the configuration memory (a) Simplified scheme of the global clocking-tree (b) Regional clocking resources are located in the middle of each clock region while the global clocking resources are located in the middle of the die (c) Simplified scheme of a regional clocking branch (d) Clock-tree hierarchy Fig. 1: Clocking in a Virtex-4 XC4VFX12 FPGA

3 at runtime, thus changing the functionality implemented by a portion of the device. Access to memory is carried out through the so-called Internal Configuration Access Port (ICAP). For Xilinx FPGAs the smallest amount of configuration information that can be accessed in the configuration memory is the configuration frame. Each configuration frame spans the whole height of a fabric clock region, defining the minimum piece of resources modifiable when using DPR [8]. In Virtex-4 FPGAs each frame includes 1312 bits and is addressed by a 32-bit address which include five fields: (a) block resource type, (b) top / bottom half, (c) clock region, (d) major column address, which identifies the column within the clock region, and (e) minor intra-column address, which identifies the specific frame within the column. Note that fields (a), (b), (c) and (d) are related to the type and location of the resources the frame configures. For instance, BUFRs and RCMUXes are mapped into IOB type frames in the leftmost and rightmost columns, while BUFGCTRLs and GCMUXes are mapped into dedicated clocking type frames in the middle column. Due to the relation between the physical location of the resources in the FPGA die and the logical location of the frames in the configuration memory, a hardware task can be physically relocated to different positions by simply writing its associated partial bitstream to the appropriate configuration frames [9]. The only condition is that the target position must be identical to the original one in the type and arrangement of resources as well as communication interfaces it contains. In order to circumvent the latter, i.e. identical communication interface between the source and target positions, we have recently proposed a technique that harnesses the ICAP for performing inter-task communication and synchronization, eliminating thus the needs for fixed proxy logic [10]. To support this, we have developed a generic Communication Interface (CIF) to be attached to each hardware task in the system. The CIF includes a set of input/output data buffers, which can be accessed through the configuration layer regardless of the final placement of the tasks within the FPGA; i.e. data to process is written to the input buffer and results are retrieved from the output buffer. Consequently, the CIF makes hardware tasks closed fully relocatable structures where the clock is the only signal that crosses their boundaries. Note that when all of the tasks use the same clock signal, distributed through the global clock net, this scheme does not constrain their allocatability. The problem appears when dealing with multiple clock signals feeding each of the tasks. In this case the global clock net cannot be used. Since the clock signals cannot be routed through conventional resources (e.g. CLBs) because of the high skew and timing instabilities this provokes, the only valid alternative is using the regional clocking resources. C. Related Work Xilinx provides practical information on the use of BUFRs in a partially reconfigurable design in [11] and a successful implementation is reported in [12], [13]. Basically, these approaches include the BUFRs as a component of the hardware tasks. Despite being functional, this solution limits the efficiency of the system described in [13]: As the partial bitstream of the hardware tasks include the configuration of the BUFRs, each clock region can host only those tasks running at the same clock frequency; Every time a new task is configured in a clock region the configuration of the BUFR in that region is overwritten. As the tasks can be allocated only to positions where BUFRs are located, they cannot be horizontally shifted inside a clock region. As the hardware tasks are driven by a single BUFR, their height must be no greater than 3 clock regions. As the clock rate is switched in the BUFGCTRLs, only two different frequencies can be selected for the tasks; BUFGCTRL have only two clock inputs. III. OUR PROOF-OF-CONCEPT SYSTEM IMPLEMENTATION On the contrary to the approaches presented in Section II, we propose to keep BUFRs separate from the hardware tasks. This represents what really occurs in a better way: the regional clocking resources are not dedicated to any specific hardware task, but shared among all of the tasks placed in the same clock region. By doing so, tasks can be clocked using multiple BUFRs located in different clock regions and thus, their maximum height is not constrained. While all of the clock signals delivered to a task must be of the same frequency, as shown in Fig. 2, they can be routed through different regional clock nets in each clock region (e.g. Net1 or Net2). Moreover, not including the BUFRs in the architecture of the tasks enhances their (horizontal) allocatability. We also propose to adjust the configuration of the BUFRs at runtime by changing the BUFR DIVIDE parameter. This permits to generate up to 8 different clock frequencies. Finally, we propose to route the clock signal on-the-fly in order to switch away from failed clocking resources. Fig. 2: Clock feeding to a hardware task using the BUFRs We note that the hardware tasks include a RAM-LUT which is remotely writable through the ICAP, the so-called Hardware Semaphore (HWS). The latter acts as the internal reset for all of the sequential components in a task, allowing to delay the computation until all of the clock signals are correctly set-up

4 for that task; i.e. the task is kept in reset start while configuring its clock signals. This Section describes a proof-of-concept system that demonstrates the feasibility of these ideas, using a Virtex-4 XC4VFX12 part. A. Static Infrastructure Synthesis The static infrastructure gives support for the execution of the hardware tasks. It includes a the R3TOS kernel, which acts as the brain of the system, and the clocking resources, which deliver the clock signal along the chip. The R3TOS kernel offers inter-task communication and synchronization services and manages the clock-tree based on performance and reliability premises. Since most of these functions are carried out through the configuration layer of the FPGA, it includes an ICAP instance as well. The static infrastructure also includes the instantiation of all the BUFRs located in the leftmost and rightmost columns of the FPGA, which are fed by the same BUFGCTRL. For reliability purposes, each input of the BUFGCTRL is connected to an independent clock source. Moreover, a BUFR diagnostic circuit is included in each clock region to detect damaged regional clocking resources (See Fig. 3). Initially the two BUFRs in each clock region are directly connected to the two regional clock nets. As the BUFRs do not drive any static logic, we must prevent their removal during the synthesis by including an S=TRUE attribute on them. Static routing is a critical issue when building an RC system as the relocated hardware tasks may use the routing resources already assigned to static routes in the target position. To limit the static routing in our design the R3TOS kernel is defined as partially reconfigurable and constrained to a specific closed region. As a result, it is self-contained in that region and the rest of the chip is static route-free. The location of the placement region for the kernel is set between the ICAP (located in the centre of the FPGA die) and the IOBs to be used (located in the right border of the chip). Since Xilinx design tools do not allow to include neither IOBs nor the ICAP within a partially reconfigurable region, these components are kept outside the region assigned to the R3TOS circuitry, being connected to the latter by means of Bus Macros (BMs). As shown in Fig. 3, the only signals which extend beyond the boundaries of the region assigned to the R3TOS kernel are the clock lines, which indeed do not constrain the allocatability of the tasks as they are separately managed by R3TOS. B. Hardware Task Synthesis The hardware tasks, fed by a BUFR, are separately synthesized and constrained to specific closed regions within the FPGA. Note that the clock route inside the task is adjusted at runtime (See Section III-C). C. Adjusting Task s Clock Frequency Up to two different clock frequencies can be distributed through each of the two regional clock nets existing in a clock region. The process of delivering a specific clock signal to a hardware task is done in two steps. (a) Placed design (b) Routed design Fig. 3: Layout of the static infrastructure First, one of the BUFRs in each of the clock regions the task spans are appropriately configured in the configuration memory. BUFR DIVIDE parameter is coded using 4 bits. For BUFR1s, these bits are in positions [ ] within the IOB type frame with MINOR=14 and for BUFR2s, the

5 bits are in identical positions within the IOB type frame with MINOR=22. Table I shows the configuration values to be written in these positions to select each clock frequency. TABLE I: BUFR DIVIDE configuration values BUFR DIVIDE Divided by 1 Divided by 2 Divided by 3 Divided by 4 Divided by 5 Divided by 6 Divided by 7 Divided by 8 Value 0x8 0x9 0xA 0xB 0xC 0xD 0xE 0xF Then, the clock signal is routed inside the task to feed all of its sequential components; i.e. the components must be driven by the regional clock net connected to the previously configured BUFR. To select the clock Net1 as the clock source for a resource column, the bit in position 655 within the frames with MINOR=18 is activated and, clock Net2 is selected by activating the bit in position 654 within the same frames. This applies for CLB, DSP and BRAM interconnection type frames. Finally, the clock signal is routed through the switch matrices associated to each resource within the column. The position of the bits to be activated to do so is shown in table II. frequency, the values in the latches can never be the same, except when any either clk_i0 or clk_i1 do not work; i.e. they are stuck-at the same logic level. Hence, the XOR between the values stored in the latches is used to select the clock source in the BUFGCTRL, automatically switching to clk_i0 when clk_i1 fails. In order to circumvent problems when the two clock sources are not synchronized, the BUFGCTRL clock selection port is not directly driven by the XOR logic gate. The output of this gate is delayed at least one clock cycle, ensuring there is sufficient guard time between the last clock pulse generated by the active clock source before it gets damaged, and the clock pulse immediately after, which is generated by the switched clock source. Finally, we note that errors due to phase and frequency deviations in the clock sources are not directly handled by the DMC circuit, but they can be detected by DCMs via their LOCKED output. Despite the DMC circuit is able to recover from a single failed clock source, it is convenient that the R3TOS kernel is aware of the latter situation with the objective of preventing potential system failures; i.e. when the remaining clock source fails. Therefore, the signal to drive the BUFGCTRL clock selection (error) is delivered to the R3TOS kernel to enable keeping track of the state of the clock sources. TABLE II: Configuration values to route the clock signals Net1 Net2 MINOR Bit position n m n+80m 0 n 3 0 m n+80m 0 n 3 9 m n+80m 46+8n+80m 0 n 1 0 m n+80m 78+8n+80m 0 n 1 9 m n+80m 0 n 3 0 m n+80m 0 n 3 9 m n+80m 41+8n+80m 0 n 1 0 m n+80m 73+8n+80m 0 n 1 9 m 15 In order to reduce power consumption, when only one frequency is needed in a clock region, the other BUFR is disabled. For BUFR1s, this is done by writing 0 in position 654 within the IOB type frames with MINOR=19 and MI- NOR=21. For BUFR2s, the bits to be cleared are located in position 653 in the same frames. D. Dealing with Faults in the Clock-Tree For reliability purposes, each input of the BUFGCTRL, I0 and I1, is connected to an independent clock source, clk_i0 and clk_i1, respectively. The clock selection port of the BUFGCTRL is driven by the Dual-clock Management Circuit (DMC) depicted in Fig. 4. This circuit takes both clock sources, clk_i0 and clk_i1, as inputs. Specifically, clk_i0 is used as the clock in the circuit and clk_i1 is captured in two latches, each working at falling and rising edges. Assuming both input clock signals are of the same (a) Schematic clk i0 clk i1 error clk o OK OK 1 clk i1 OK X 0 clk i0 X OK 1 clk i1 X X X clk i1 (b) Functioning Fig. 4: Dual-clock Management Circuit (DMC) To detect damaged regional clocking resources, a diagnostic circuit is implemented in each clock region, next to the BUFRs. As shown in Fig. 5 this circuit is very simple, fitting in only 8 Slices and thus spanning only one CLB column. It takes two clock sources as inputs: the global clock signal which is input to the BUFR to diagnose and the regional clock signal which is output. The global clock feeds a delay line composed of 5 cascaded latches which capture the regional clock signal at rising edges, and another latch which capture the regional clock signal at falling edge. As the frequency of the regional clock signal can be equal or up to 8 times slower than that of the global clock, the values stored in all of the latches can never be the same. Note that when their frequencies are the same, the values captured at falling and rising edges must be different, and when their frequencies are different, the 5 delay line latches store values corresponding to more than a half

6 period of the regional clock signal and hence, at least one of them must be different. Therefore, the error signal must be always 0. If this is not met, then the regional clock signal is stuck at the same logic level, i.e. the BUFR is broken. In order to make the diagnostic result remotely accessible from the R3TOS kernel without using static routes across the chip, the error signal is registered in a RAM-LUT, whose content can be read-back using the ICAP. Once the latter RAM-LUT is accessed, the R3TOS kernel is responsible for clearing it back to 0. (a) Normal configuration Fig. 5: BUFR diagnostic circuit When a BUFR is damaged, the RCMUXes are used to switch a signal coming from any of the other 4 BUFRs located in the up and down clock regions. The RCMUX configuration values are coded using 4 bits located in positions [ ] within two different IOB type frames. MINOR=23 configures the clock Net1 and MINOR=25 configures the clock Net2. Table III shows the values to be written in these positions to select a specific input clock source in the RCMUXes. (b)...with unused resources in large tasks spanning regions TABLE III: RCMUX configuration values Input BUFR1 (down clock region) BUFR2 (down clock region) BUFR1 BUFR2 BUFR1 (up clock region) BUFR2 (up clock region) Value 0x8 0x9 0xA 0xB 0xC 0xD When any of the hardware task spans more than one clock region in height, non-damaged BUFRs of these regions can be used in the regions with damaged BUFRs (See Fig. 6b). Likewise, one of the BUFRs in a region where only one clock frequency is needed can feed the adjacent clock regions with damaged BUFRs (See Fig. 6c). Despite being more complex to implement, the same strategy can be used to switch a spare BUFGCTRL when one of the input clock sources to the active one fails; i.e. the system will fail if the remaining clock source also fails. Note that this diagnosis information can be obtained from the aforementioned DMC circuit. However, the global clock switch should be carefully done in order to avoid glitches that could potentially provoke functioning errors, e.g. wrong computation by the hardware tasks or ICAP stuck. Indeed, prior to adjusting the PIPs used to route the global clock signal, the ICAP Controller should be clocked from a different clock source and the active global clock source should be switched off. This can be implemented by taking advantage of the fact that modern FPGAs include two ICAP ports, which allows to (c)...with unused resources in regions where only one clock frequency is needed Fig. 6: Replacing damaged BUFRs... include two ICAP Controllers in the system to be fed with a different clock source (See Fig. 7). Hence, after switching the global clock using the spare ICAP Controller (See Fig. 7b), the main ICAP Controller should be again used to restore the clock diversity in the system (See Fig. 7c). By using all the techniques reported in this paper together a complete online clock routing mechanism is implemented, where some parts are automatically managed (BUFGCTRL) and the others need to be managed by the R3TOS kernel (clock sources, routing inside the tasks, BUFRs, RCMUX, GCMUX). IV. CONCLUSIONS In this paper we have described how to build an RC system able to feed each hardware task with the required clock frequency. We demonstrate that our system circumvents up to four limitations existing in previous related work. In addition, we have presented a novel mechanism to detect and recover from faults affecting the clock-tree of the underlying FPGA. This is considered significant in reliability-sensitive applications (e.g. space). Our mechanism enables the system to gain control over each of the resources in the FPGA clock-tree

7 (a) Initial situation (b) Phase 1 (c) Phase 2 Fig. 7: Global clock switch: clock source 1 clock source 3 hierarchy to respond at the appropriate level to each necessity. Our solution has been implemented using Xilinx Virtex-4 FPGAs and it is also extensible to newer Xilinx families, such as Virtex-5/6/7. Future work includes testing our RC system in various real-world applications. Last but not least, we note that the work reported in this paper is enshrined in the R3TOS project, which is aimed at building a Reliable Reconfigurable Real-Time Operating System for Xilinx partially reconfigurable FPGAs. R3TOS also includes a novel fault-handling strategy to cope with transient and permanent faults which affect both the hardware tasks and its own kernel circuitry. Furthermore, the R3TOS kernel includes a set of fault-tolerance by design features, including ECC protection of internal BRAMs and finite state machines. The fault-handling strategy used by R3TOS as well as its kernel implementation are to be described in future publications. REFERENCES [1] S. Hauck and A. DeHon, Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation. Morgan Kaufmann Publishers Inc., San Francisco, USA, [2] T. Marconi, Y. Lu, K.L.M. Bertels, G. N. Gaydadjiev, 3D Compaction: a Novel Blocking-aware Algorithm for Online Hardware Task Scheduling and Placement on 2D Partially Reconfigurable Devices, Proc. of the International Symposium on Applied Reconfigurable Computing, [3] M. Berg, C. Poivey, D. Petrick, D. Espinosa, A. Lesea, K.A. LaBel, M. Friendlich, H. Kim and A. Phan, Effectiveness of Internal versus External SEU Scrubbing Mitigation Strategies in a Xilinx FPGA: Design, Test, and Analysis, IEEE Transactions on Nuclear Science, 55(4): , [4] D. P. Montminy, R. O. Baldwin, P. D. Williams and B. E. Mullins, Using Relocatable Bitstreams for Fault Tolerance, Proc. of the NASA/ESA Conference on Adaptive Hardware and Systems, [5] H. Kopetz, Real-Time Systems: Design Principles for Distributed Embedded Applications, Kluwer Academic Publishers, Norwell, USA, [6] X. Iturbe, K. Benkrid, A. T. Erdogan, T. Arslan, M. Azkarate, I. Martinez and A. Perez, R3TOS: A Reliable Reconfigurable Real-Time Operating System, Proc. of the NASA/ESA Conference on Adaptive Hardware and Systems, [7] Xilinx Inc., Virtex-4 FPGA User Guide, UG070, [8] Xilinx Inc., Virtex-4 FPGA Configuration User Guide, UG071, [9] P. Sedcole, B. Blodget, T. Becker, J. Anderson, and P. Lysaght. Modular Dynamic Reconfiguration in Virtex FPGAs, IEE Proceedings Computers and Digital Techniques, 153(3): , [10] X. Iturbe, K. Benkrid, T. Arslan, R. Torrego and I. Martinez, Methods and Mechanisms for Hardware Multitasking: Executing and Synchronizing Fully Relocatable Hardware Tasks in Xilinx FPGAs, Proc. of the International Conference on Field-Programmable Logic and Applications, [11] E. Eto, Support for BUFR in Partial Reconfigurable Modules, Xilinx White Paper, WP344, [12] A. Flynn, A. Gordon-Ross and A. D. George, Bitstream Relocation with Local Clock Domains for Partially Reconfigurable FPGAs, Proc. of the Conference on Design, Automation and Test in Europe, [13] A. Jara-Berrocal and A. Gordon-Ross, VAPRES: A Virtual Architecture for Partially Reconfigurable Embedded Systems, Proc. of the Conference on Design, Automation and Test in Europe, 2010.