Hardware Assertion Checkers in On-line Detection of Faults in a Hierarchical-Ring Network-On-Chip

Hardware Assertion Checkers in On-line Detection of Faults in a Hierarchical-Ring Network-On-Chip Jean-Samuel Chenard, Stephan Bourduas, Nathaniel Azuelos, Marc Boulé and Zeljko Zilic McGill University, Montréal, Québec, Canada {jsamch, stephan, mboul, zeljko}@macs.ece.mcgill.ca; nathaniel.azuelos@mail.mcgill.ca Abstract In this paper, we present a methodology to use assertions in network-based designs to facilitate debugging and monitoring of SoCs. We rely on our assertion-checker generator (MBAC) to produce efficient RTL-level checkers from high-level temporal assertions, with optional debugging features. We further have built tools to encapsulate the source design with assertion checkers generated by MBAC and to coordinate the sending of management flits. Further details of the debug infrastructure are presented as well. I. INTRODUCTION The progression of silicon technology has allowed engineers to build systems at increasing levels of complexity after every generation. At a certain integration level, the Network-On-Chip (NoC) [1], [2] paradigm becomes appealing, but it also confronts designers with new challenges, especially in ensuring reliable and failure-free operation in the field. While application errors occurring in operation above the NoC layer can be to a large extent dealt with by the system, communication failures in the NoC can often be catastrophic. In analogy to traditional computer systems, a Bus Error is often an unrecoverable failure, leading to a system downtime, while other higher-level failures are correctable and the system may recover. During their operation, systems employing NoCs can experience errors due to a multitude of conditions such as single-event upsets [3], faults due to untested or unverified corner cases or network deadlocks/livelocks. Thus, it becomes important to detect imminent failures or errors within the network and to be able to correct their effects whenever possible. The detection must be carried out as close as possible to the source of the failure such that any inconsistencies can be reported as early as possible. This capability facilitates the diagnostics of the problem and may give a solution to circumvent it and possibly recover from it. As assertion-based design methodologies are becoming pervasive in the verification of the design before fabrication, similar efforts can be extended to facilitate NoC monitoring, diagnosis and even self-repair and run-time recovery. A large engineering effort is required to create the assertion libraries, which currently are used mainly in simulation. Re-using those assertions in the final silicon by transforming the assertion statements into small and efficient hardware checkers is then quite advantageous in general. Structured assertions running at the same speed as the rest of the networking circuitry can become a precise failure detection and localization tool. In large NoCs, a single error may triggers a slew of consequent errors in a way similar to alarm showers in supervisory environments [4]. Furthermore, the circuit may operate on several clock domains. Thus, errors detected by the assertion checkers can be reported in an unpredictable order, giving the wrong impression of the original cause for the shower of errors. Re-ordering the temporal sequence of errors will simplify diagnostics by facilitating the identification of the root cause of the failure. In this paper, we propose a solution adapted to a two-level hierarchical ring interconnect. We base our work on the integration of a Property Specification Language (PSL) assertion compiler [5], [6] that translates PSL statements into efficient synthesizeable Register Transfer Level (RTL) checkers into a NoC [7] that optimises its use for debugging. The concepts are applied to our hierarchical-ring NoC interconnect circuit. In the context of a NoC, centralizing all the assertion checker signals would break hierarchy encapsulation rules required to design such large systems. If all the signals were propagated to a single assertion aggregation unit, the large number of global signals on various clock domains would make the back-end design of the integrated circuit a much more difficult task. In this context, there is a need for a suitable networking infrastructure to aggregate the assertion checkers status. A. Background Different methods have been proposed to replace the functionality of a logic analyzer with more powerful on-chip alternatives. The IBM team working on the Cell microprocessor [8] is a prime example of this effort. They chose to include on-chip a parallel bus architecture to specifically route debug packets. While very attractive, this approach adds a significant amount of area and wiring. Ciordas [9], [10] and a team at Philips research labs have studied the reuse of the existing network for purposes of debugging and monitoring targeting specifically their Æthereal NoC platform and its trade-offs, and found that re-use of resources through networking saves a large amount of area compared to dedicated busses alternative. Other efforts [11] in the testing field have considered the re-use of the on-chip network, but did not consider the real-time debugging use of these techniques. The work we performed targeted specifically the two-level hierarchical ring structure [7] illustrated in Figure 1. In the current design, there are four low-level local rings attached to one central high-level global ring. Inter-Ring Interfaces (IRIs) provide a means for passing data from one ring level to the next. Every flit (flow control digit, or the smallest unit of data between stations) passing on the ring contains information about the sender s as well as the target s address ring and station. The rings are designed to account for the case where each station is operating in a different clock domain. Therefore, most of the FIFO queues in the rings are asynchronous. Further, all the rings are unidirectional, that is flits sent by a station in any ring are not immediately accessible by the other stations with the exception of the next one in the directional ring order. This property ensures that data is received by the target nodes in the order they were sent (within the same end-to-end link). The in-order arrival of flits is helpful in the context of debugging. In our case, however, a problem occurs when data must pass from one ring to another through the central, top-level ring. In that case, the temporal order of the data generation is lost in the network. Since data traversing the network is buffered at each station, two flits generated at a specific time interval in two different rings might not attain a target node within the same time interval and could be received out of order. The differential delay encountered by the flits is based on local congestion levels and relative clock domain frequencies in the

Station Fig. 1. Inter-Ring Interface Two-level Hierarchical Ring Network-on-Chip Overview network. All this is further complicated by the dynamic clock scaling supported by our platform. II. ASSERTION CHECKERS IN DISTRIBUTED ENVIRONMENTS Our debugging process starts with a module (RTL or behavioral description) and a set of properties representing interface-level requirements written in the PSL language. A subset of the PSL assertions are selected for hardware implementation after an analysis of their hardware cost overhead. These statements are then translated into efficient hardware checkers and encapsulated within the RTL module by our assertion compiler (called MBAC). During normal execution, the outputs of the checkers are monitored in order to localize functional errors. Each checker s assertion signal triggers dynamically during execution each time its property is violated by the Device Under Test (DUT). The resulting assertion failures are used as a starting point for the debugging process, and are the fundamental reasoning behind the Assertion Based Design methodology [12]. Programmable checkers can also be implemented such that the assertion checkers can be controlled by a processor for testing or on-line monitoring. Programmable checkers contain one or more CPU-writable registers which control the values of parameters and constants used in the assertion checker. One example of this is for packet tracing, in which case a checker can be programmed by the CPU to track a particular packet through the system. A. Sample Hardware PSL Checkers Within our RTL-level NoC, we have created a flit tracer that monitors packets going in and out of the various buffers within a station, along with a hardware assertion checker validating the bursting behavior of the FIFOs. The flit tracer is written in PSL and the corresponding RTL code is produced by the MBAC checker generator. The resulting hardware outputs can then be monitored using a CPU interface connected to the auxiliary AMBA bus. 1) Hardware Flit Tracer : In this case, the flit tracing module is simply written in PSL and some elements of the flit (for example the source and destination addresses) are left as inputs in the PSL code. This translates into input signals on the hardware checker interface. Those signals are then attached to a CPU-controlled register allowing re-configuration of the tracer. A cover statement is written in PSL which automatically infers a CPU-accessible register in hardware. Below is an abridged version of a 3-level flit tracer written in PSL, applicable in our infrastructure: vunit IRI_north(InterRingIF) { default clock = (posedge NR_Clk_p); sequence NR_P1 = { NR_DIng_DataValid == 1 && NR_DIng_SrcGlobalRing == Reg_Src1Global && NR_DIng_SrcLocalRing == Reg_Src1Local && NR_DIng_Data == Reg_Src1Data }; // NR_P2 and NR_P3 are defined similarly // but with different register inputs (Reg_) assert always NR_P1 -> eventually! {NR_P2; [*]; NR_P3}; cover NR_P1; cover NR_P2; cover NR_P3; } The above PSL checker, once transformed into a hardware module by MBAC can add several interesting debug features to our station. The three cover statements allow 3 types of flits to be counted when they pass on an interface. They can be used as monitoring points within the station. The assert... eventually! statement allows us to monitor a particular trio of flits (programmable by the CPU interface) that are required to pass on the interface in sequence, but that could be interleaved with traffic from other stations. In this particular example, an assertion failure simply indicates that the three flits did not pass in the expected order, but not that our hardware is misbehaving in any way. When synthesized for a Stratix II architecture, this hardware checker uses 75 ALUTs and 9 flip-flops. Most of the ALUTs are used to compare the 32-bit NR DIng Data bus with the CPU Registers. Only a few flip-flops are required to define the temporal expression of the assert statement. 2) Hardware FIFO Monitor : Below are two simple assertions that can be translated to hardware and are used to assess the proper behavior of the output FIFO of a NoC station. They do not require any CPU intervention to operate. However, since an assertion failure from this monitor indicates a fault in the hardware, it is important that the higher-level application be notified in some way. The first property validates that when the FIFO threshold is getting low, the station stops bursting data and goes to an interleaved mode of transmission. The second property validates the proper operation of the flow control. When StopDown is asserted while the station is sending data, the station should stop sending within 2 clock cycles. property StopBurst = always { LowThresh } => { [*2] ; BurstEn } property FlowCtlDown = always { StopDown & DataValid } => { [*2] ; DataValid } assert StopBurst; assert FlowCtlDown; Those two assertion statements, when compiled in hardware, constantly monitor the station s FIFO behavior and latch the assertion failure within one clock cycle of its occurrence. B. Propagation and Timestamp of Assertion Failures Finally, each group of assertion checkers within a station is aggregated and monitored by a dedicated unit, central to the NoC. If the module detects a particular assertion failure, it records it and also propagates a special management flit (m-flit) within the NoC to relay the information to the station responsible for analyzing failures. Using the NoC routing structure to propagate assertion failures is

much faster than relying on the local station host CPU, and since the results are centralized, a more accurate picture of assertion failure event chains can be obtained. The assertions failure messages within the network are automatically associated with their source address, thus clearly identifying the station responsible for the message. In the event of a failure in the NoC data transport mechanism (interconnect), the assertion information can still be accessed through the local station CPU interface, but will require more debugging effort to correlate with other assertion failures. The special flit containing the assertion information also triggers the generation of a high-priority flit (hp-flit) that will be associated with the assertion flit and propagate to the central aggregation unit. The hp-flit will help resolve the timing of a sequence of assertion failures in our two-level hierarchical ring NoC. III. USING HARDWARE ASSERTION CHECKERS Each hardware assertion checker representing a PSL statement can be augmented with specialized debugging enhancements [6] such as activity monitoring or sequence completion monitoring, by passing command-line options to the assertion compiler. A database is then built based on the particular signals associated with each assertion checker circuit. A. Benefits of Automating Assertion Checkers Designers could implement their own code for validating the internal state of their circuits, but an automated translation from PSL assertions to gates eliminates the risk of altering the behavior of the circuit with the checkers. The MBAC assertion compiler generates hardware that only monitors the circuit behavior, guaranteeing that the hardware will behave as originally intended. Furthermore, automatically translating PSL assertions to RTL code eliminates the risk of introducing errors into the assertion circuitry itself, which is likely to occur if the translations were done manually. Any errors introduced into the assertion hardware during translation would render it useless. We can therefore conclude that an automated approach is not only preferable, but necessary for large and complex systems such as NoCs. The methodology also benefits from the uniform representation of assertion failures and high designer productivity as the temporal languages can very concisely describe complex and pipelined behavior. Furthermore, the PSL statements database and NoC topology information can be combined within a more advanced debugger to automate the localization of failures. B. Outline of Our Approach Our approach to the design of Network-on-Chip diagnostic and failure detection modules can be summarized in the following steps: 1) Start with verification units (vunit) containing PSL assertions related to our NoC design that we wish to commit to hardware. 2) Run those assertions through our hardware assertion-checker generator (MBAC) followed by the RTL synthesizer to extract the size and timing characteristics of the hardware checker circuits. 3) Evaluate the suitability for hardware implementation of the PSL statements based on their area and critical path. 4) Create new PSL statements that are used for debugging and diagnostic. For example, a flit tracer, checkers for corner-cases that are very long to get in a simulation. 5) Run the automated interface generator for the local CPU and NoC interconnect. This tool will produce the hardware module that can interface with the assertion checkers and will produce a table showing the mapping of assertions to a memory address for its integration with the local CPU firmware. A. Network Management Flits IV. IMPLEMENTATION DETAILS To separate application data traffic from specialized network traffic, a new type of Network Management Flits (m-flits) are used to provide an independent level of processing dedicated to the NoC maintenance. Those special flits are marked with an additional bit in the physical layer. The m-flits are not propagated into the Station CPU FIFOs. They are handled at a low-level directly by the hardware to exchange network information between the various stations connected to the NoC. In order to communicate through m-flits, one (or many) NoC nodes must use a special memory-mapped interface to send or receive m-llits. Not all stations need to be equipped with such an interface. The hardware can also autonomously respond or generate m-flits. They are then injected in the network with a rate-limiting scheduler so that they can only use a certain amount of bandwidth in the NoC. The traffic rate for m-flits is expected to be very low, amounting to transporting the Station performance counter, coverage monitor values and assertion failure information. B. NoC Station Assertion Checker Module This unit resides within the Station architecture alongside an automatically generated CPU interface. It consists of a group of input signals representing the input dependencies of the PSL statements. This unit constantly validates the PSL assertions in hardware and its generation is fully automated through the MBAC tool. C. Generation of the CPU Interface A rich set of assertion checkers is beneficial to the debug of complex NoC architectures. The potentially large number of signals connecting each assertion circuit could quickly become unmanageable if they are not organized in a structured and documented way. If they are to be used as a debug methodology, the assertion checkers must be easy to add to the circuit and must not burden the hardware designer in this task. The need for automating the organization and reporting of assertions becomes clear for the above goals to be met. Another benefit of automation in this scenario is that the logic generated is predictable and uniform, thus reducing the chances of introducing a bug in the circuit through the addition of the debugging enhancements. The following sections detail the procedure of automating the interface creation. 1) Extraction of the Main Interfaces : The first step in the generation of the CPU interface is the extraction of signals feeding the verification unit and the construction of an empty shell for the MBAC tool to register bus widths and attributes of the signals in the design. The PSL code is then combined with this empty shell to automatically generate efficient hardware checkers with the same temporal behavior as the ones described by the PSL. The generated hardware checkers are then parsed to extract two interfaces. The first interface contains the input signals that are being monitored by the hardware checker. This interface is processed separately since those signals can require special synthesis attributes. For example, the inputs to the checkers can be buffered so that monitoring has as little effect on the circuit performance as possible. The second interface relates to the control and observability of the hardware checker outputs. In order for the hardware checker outputs to be useful in a complex system, they have to be internally readable

and they need to be presented in a uniform manner such that when an assertion failure is found during the execution of the hardware, the identity of the assertion can be determined and the information can be propagated within the Network-on-Chip. 2) Automated Generation of the Register File: The next step in the process is the creation of a unit that will contain the status and control registers. This unit must support basic identification registers (node identification in the NoC, and version information). Some routing control registers must also be provided to direct assertion failure information to a central unit within the Network-on-Chip. The CPU interface also supports two different register types that are customized to work with the assertion signals. The first register type is an event latch that will memorize any assertion failures (pulses) coming from the hardware checkers. This register supports the clearing of detected assertions by writing a 1 to the appropriate bit location in the register. The second mode of operation is the counter mode. For a particular assertion or cover statement in the PSL, a register can be generated to count the number of times the assertion or cover statement has occurred and allow the local CPU to clear the counter. The counter is built with saturation arithmetic such that a delay in measuring its output will give less information (a saturated counter) but will not report misleading results. The generation of the CPU interface module is fully automated using a code generator. A base register definition file where registers are described in a high-level format by defining their base address, reset behavior and associated documentation is provided to our tool. To this basic set of registers a table of registers is automatically added based on an extracted list of assertions and debug enhancements options (such as coverage counters or activity monitors). The resulting signals are automatically organized in a list of registers and a documentation file is generated based on the description of the assertion (PSL code statements). This methodology allows the integration of a large number of assertions to a complex design, while automating most of the tedious process of gathering the assertion result vectors in a comprehensive way that can be utilized by the hardware designers to diagnose particular errors in a complex NoC. Our tool also automates the instantiation and connectivity between the assertion checker module and the register file. 3) Area Overhead of the Generated CPU Interfaces : To get an estimate of the area overhead of generated CPU interfaces, we have synthesized a register file containing the basic set of control registers to make the unit identifiable by the CPU and to perform basic routing of the assertion flits (Base circuit). Then, the base registers augmented with 20 assertion checker registers (Asr circuit). Finally a base register file with 20 assertion checker registers and 20 8-bits coverage counters with independent reset control (Asr+Cnt circuit). The synthesis target was the Stratix II family of FPGA and the synthesis was performed as a LogicLock module (no I/O pads, no specific device). The results are summarized in Table I. TABLE I GENERATED CPU INTERFACES ( 8 BITS ADDRESS BUS, 16 BITS DATA BUS) AND THEIR SYNTHESIS RESULTS Circuit Assertions Cover Counters ALUT FF Base 0 0 31 34 Asr 20 0 74 54 Asr + Cnt 20 20 395 234 The larger logic utilization in the circuit with coverage is mostly due to the large number of 8-bit counters and reset controls, resulting in an additional 180 (20*(8+1)) flip-flops and the corresponding arithmetic logic and addressing multiplexers. To those results, the hardware cost of each assertion checker must be added. Interested readers are referred to more comprehensive publications [5], [6] highlighting synthesis results of complex PSL expressions compiled with MBAC. D. Handling of Assertion Failures In our approach, the assertion failures can be set to trigger the generation of an m-flit indicating the assertion failure and its node within the NoC. The assertion failures are memorized and a second vector containing the reported assertions is maintained in hardware. Any differences in the two vectors are automatically reported and the reported assertion vector is updated. In such a system, each latched assertion failure is thus reported only once (until the reported assertion vector is cleared by the local CPU interface). This approach enables a global monitoring of the NoC without incurring any computation overhead from the CPU units present in the NoC. V. TIMESTAMPING OF ASSERTION FAILURES The management flits described in Section II-B are used to provide timing information and high-priority flits are used to route this information to the central aggregation unit. The structure of the hpflit is designed to keep the chronology of assertion failures as well as extending its use beyond debugging to monitoring by adding an operations field. A. Architectural Issues As explained above, it is useful in the context of debugging to keep the order in which errors occur in the system. The hierarchical ring NoC presents a challenge in this context, as each ring operates on a potentially different clock domain. It thus becomes necessary for the test information to be time-stamped. The architecture of the interconnect implicitly provides information with respect to local timing. Indeed, every flit must go through a certain number of stations on a ring before arriving at a destination or a different ring level. Unfortunately, every station is buffered, introducing not only some latency but also a loss of information as to the flit s time trace. The hp-flit introduces a simple solution to this problem. In the case of an error detection, an hp-flit is sent by a module s hardware checker s network interface and will make its way with high priority across the local ring. In order to keep the latencies of the hp-flits to a minimum, they bypass the station buffers and get routed to the output port before normal traffic as shown in Figure 2. The special case of collision between two hp-flits, one local, the other already on the ring is a difficult case. Intuitively, the hp-flit already on the ring has priority since it was fired first, in accordance with normal traffic rules. Unfortunately, a starvation problem may occur, causing a disordering of hp-flits. It is quite unlikely however that a very long series of hp-flits will be fired by the same station sequentially causing severe starvation on the other nodes. Once the hp-flit leaves the local ring and makes its way to the global ring, it is time stamped by a free-running counter whose value is directly accessible by all IRIs. The hp-flit then loses its highpriority, and becomes just another m-flit as the purpose of its priority is voided by the time stamp. It then makes its way to the debug and monitoring reception node. The hp-flit does not contain any information on the data it represents. It simply provides chronological and circumstantial bits indicating the severity of it and other issues. Thus, following its emission from the station, the actual data must be sent through an

Incoming ring packet Ingress FIFO Device Local Ring Egress FIFO M-Flit HPF Output Buffer HPF Debug Router Station Outgoing ring packet Fig. 2. Mid-level view of local station architecture. The hp-flits bypass buffering FIFOs to accelerate their arrival to the global ring where they are time-stamped. In case of collision, the hp-flit already on the ring has priority. m-flit. Such an event is likely to occur where two (or more) hp-flits must be sent in a short interval or in succession. The m-flit associated to the first might not be sent when the second hp-flit comes out. Two packets of information will thus be sent in succession. To allow the debug and monitoring reception node to differentiate the first from the second, a few bits must then be reserved in both to make the association between the hp-flit and its m-flit payload. Thus an hp-flit will have the following structure: 1 bit to mark it as such, 3 bits to identify the type of operation, 4 bits to mark the hp-flit number, 4 don t care bits, in addition to the bits to mark the global ring time stamp. VI. CONCLUSION We presented a novel approach to create and instantiate assertion checker circuits and diagnostic modules in the context of Networkon-Chip development. Our method of transforming PSL assertions into hardware checkers is guaranteed not to influence the circuit logical behavior. We also showed that the PSL statements combined with automated CPU interfacing can be used to quickly generate circuits applied to the SoC to increase the ability to monitor the network operation and diagnose failures. To optimize the usefulness of the assertion checkers, we have applied small modifications on our existing NoC to provide the chronology of errors. To this effect, we introduced a NoC-level high-priority messaging system. The only potential side effect of our additions to the NoC is an increased load on the circuit and a modest increase in device area. We believe that the small modification to the existing system open the way to a new approach to debugging complex NoC s. REFERENCES [1] W. Dally and B. Towles, Route packets, not wires: On-chip interconnection networks, in Proceedings of the Design Automation Conference, 2001. [2] L. Benini and G. D. Micheli, Networks on chips: a new SoC paradigm, IEEE Computer, vol. 35, 2002. [3] D. Park, C. Nicopoulos, J. Kim, N. Vijaykrishnan, and C. R. Das, Exploring Fault-Tolerant Network-on-Chip Architectures, in Proceedings of the International Conference on Dependable Systems and Networks (DSN 06), 2006, pp. 93 104. [4] D. Spoor and R. Chu, Building causal models for operator aiding in a supervisory control environment, in Proceedings of the IEEE Conference on Systems, Man, and Cybernetics, 1993, pp. 344 347. [5] M. Boulé and Z. Zilic, Efficient Automata-Based Assertion-Checker Synthesis of PSL Properties, in Proceedings of the 2006 IEEE International High Level Design Validation and Test Workshop (HLDVT 06), 2006, pp. 69 76. [6] M. Boulé, J. Chenard, and Z. Zilic, Adding Debug Enhancements to Assertion Checkers for Hardware Emulation and Silicon Debug, in Proceedings of the 24th IEEE International Conference on Computer Design (ICCD 06), 2006, pp. 294 299. [7] S. Bourduas, J.-S. Chenard, and Z. Zilic, A RTL-Level Analysis of a Hierarchical Ring Interconnect for Network-on-Chip Multi-Processors, in Proceedings of the International System-on-a-Chip Design Conference (ISOCC 06), 2006. [8] M. Riley, N. Clestrom, M. Genden, and S. Sawamura, Debug of the cell processor: Moving the lab into silicon, in IEEE International Test Conference, October 2006, pp. 1 9. [9] C. Ciordas, K. Goossens, A. Rădulescu, and T. Basten, Noc monitoring: Impact on the design flow, in IEEE International Symposium on Circuits and Systems, 21-24 May 2006, pp. 1981 1984. [10] C. Ciordas, T. Basten, A. Rădulescu, and K. Goossens, An event-based monitoring service for networks on chip, in ACM Transactions on Design Automation of Electronic Systems, October 2005, pp. 702 723. [11] Érika F. Cota, M. E. Kreutz, C. A. Zeferino, L. Carro, M. Lubaszewski, and A. A. Susin, The impact of noc reuse on the testing of core-based systems, in VLSI Test Symposium, 2003, pp. 128 133. [12] H. Foster, A. Krolnik, and D. Lacey, Assertion-Based Design, 2nd ed. Norwell, Massachusetts: Kluwer Academic Publishers, 2004.