Pausible Clocking: A First Step Toward Heterogeneous Systems æ

Pausible Clocking: First Step Toward Heterogeneous Systems æ Kenneth Y. Yun yan P. Donohue Department of Electrical and Computer Engineering University of California, San Diego 9500 Gilman Drive La Jolla, C 92093-0407 kyy@ucsd.edu bstract This paper describes a novel communication scheme, which is guaranteed to be free of synchronization failures, amongst multiple synchronous modules operating independently. In this scheme, communication between every pair of modules is done through an asynchronous channel; communication between a module and the is done using a requestacknowledge handshaking. Synchronization of handshaking signals to the local module clock is done in an unconventional way [17, 15, 3, 12, 5] the local clock built out of a ring oscillator is paused or stretched, if necessary, to ensure that the handshaking signal satisfies setup and hold time constraints with respect to the local clock. We constructed a test bed consisting of two synchronous modules with pausible clocking control and an asynchronous on a MOSIS1:2çm CMOS chip. The resulting system functions reliably up to the local clock frequency of 220MHz (according to SPICE simulation) the maximum clock rate is limited by the ring oscillator, not the pausible clocking control. Preliminary test results indicate that the fabricated chips operate correctly as simulated. 1. Introduction The next generation digital VLSI systems will necessarily be based on system-on-chip concepts, in order to satisfy unrelenting demands for higher performance and also to accommodate smaller packaging and low power requirements for mobile applications. These on-chip systems will consist of multiple independently synchronized domains: some may be clocked domains, such as synchronous processor cores, while others may be clockless (asynchronous) domains, such as peripheral controllers. These chip designs will resemble today s complex board-level designs. Onchip modules will be held together by interface glue logic which must facilitate high speed communication between synchronous modules operating at different clock frequenæ This research was supported in part by a gift from Intel Corporation. cies, between synchronous and asynchronous modules, and between asynchronous modules. The key difference between today s board-level designs and the future systemon-chip designs is the speed at which communications take place. first step toward such heterogeneous system design on chip is a reliable high-speed communication scheme among multiple synchronousmodules operating independently. We examined a variety of communication schemes that attempt to mitigate synchronization failure without sacrificing communication throughput. They generally fall into one of two categories: (1) brute-force synchronization of communication signals to each module s free-running clock with an acceptable level of synchronization failure; (2) adjustment of individual synchronous module s local clock, when necessary, to avoid synchronization failure. The first category includes methods such as the wellknown double-latching scheme and a natural extension of the double-latching scheme called pipeline synchronization [16]. These methods reduce the probability of synchronization failure to an acceptable level by repeatedly resynchronizing communication signals with back-to-back latches. These methods are simple and inexpensive to implement, but a major drawback is the latency of communication. The methods in the second category [17, 15, 3, 12, 5] generally rely on stopping or stretching each synchronous module s local clock to guarantee that communication signals never violate setup and hold time constraints with respect to the local clock. lthough these methods are robust and do not incur long communication latency, they involve designing a special clocking circuit, unfamiliar to most designers. The simplest example using this scheme one can conceive is a synchronous module communicating with an asynchronous peripheral. In this system, the synchronous module latches the handshaking signals from the asynchronous module by stopping or stretching its own clock, when necessary. In this paper, we describe a general method of communication between two synchronous modules operating in-

sender Synchronous Module 1 σ σ ρ ρ receiver Synchronous Module 2 (a) 1 T1 G2 (a) One way communication 2 T2 sender receiver Synch Module 1 1ρ 1ρ 1σ 1σ 2σ 2σ 2ρ 2ρ sender receiver Synch Module 2 (b) rbiter 1 2 C G (b) Bidirectional communication G2 C Figure 1: Two synchronous modules communicating via an asynchronous channel. dependently, i.e., at different clock frequencies or phases, based on the pausible clocking scheme as shown in figure 1. Synchronous modules communicate with each other via an asynchronous used as a communication channel. The interfaces between the synchronous modules and the are pausible clocking control () circuits, i.e., the handshaking signals from the are sampled by the pausible clock of each synchronous module. lthough selftimed s have been used for communication between synchronous modules elsewhere [6], they have not been utilized in communication between synchronous modules operating independently at different clock frequencies. In order to validate this scheme, we implemented a test bed consisting of two synchronous modules with pausible clocking control and an asynchronous on a MOSIS 1:2çm CMOS chip. The resulting system functions reliably up to a local clock frequency of 220MHz the upper bound on the local clock rate is due to the ring oscillator, not the pausible clocking control. The rest of this paper is organized as follows: section 2 reviews mutual exclusion element and arbiter, key components used in the circuit; section 3 describes the design and implementation of the unit; section 4 describes several system configurations using this scheme and limitations; section 5 describes the experimental results; section 6 concludes the paper with some remarks on the future system design. 2. Background: Mutual Exclusion and rbiter In this section, we briefly review the concepts of the mutual exclusion element and the arbiter. mutual exclusion Figure 2: (a) CMOS mutual exclusion element circuit; (b) rbiter circuit using. element [15, 14] is a circuit (see figure 2a) that that allows one request to pass through at a time on a first come first serve basis. When two inputs arrive simultaneously, it selects one to pass through arbitrarily. n arbiter [10, 9, 4, 14, 8, 13] is a circuit that propagates one request at a time (as does the ) but also acknowledges the requesters with grant signals as well. The circuit used in our design is shown below (figure 2b). The symbol C represents C-element,a self-timed latch which raises its output when both inputs become high, lowers its output when both inputs become low, and keeps the old value if the inputs have different polarities. The detailed circuit behavior of both circuits is explained clearly in many texts and journals, so we will not elaborate here. However, an interesting characteristic of the should be noted [2, 7]. The closer the arrival times of the rising transitions of two inputs are, the longer it takes for the internal analog difference circuit to resolve the metastability [2, 11], hence the latency becomes longer. In order to effectively use the circuit in our design, we simulated our design in 1:2çm CMOS with SPICE. ll PMOS transistors in our design have W=L =12ç=2ç, and all NMOS transistors have W=L =6ç=2ç. Themean latency from input to output versus the difference in input arrival times is shown in figure 3. The rise time of both inputs was set to 1ns for the simulation, which is typical for 1:2çm technology. 3. Design of Pausible Clocking Control The pausible clocking control is a scheme to avoid synchronization failure by adjusting the local clock. synchronization failure at the module interface occurs when the arrival times of an external signal transition and a sampling

Latency (ns) 2.0 1.9 1.8 1.7 1.6 1.5 1.4 1.3 ρ 1 S ρ ρ setup 1.2 1.1 1.0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Difference in input arrival times (ns) 0.8 paused Figure 5: timing (one way communication). Figure 3: Mutual exclusion element mean latency versus difference in input arrival times. edge of the clock are indistinguishable by the sampling latch at the module boundary. In our scheme, the synchronization failure is circumvented by pausing or stretching the local module clock when necessary. ρ ρ FSM Sρ FSM 1 ing Oscillator Figure 4: eceiver for one way communication. block diagram of the receiver is shown in figure 4. This scheme uses a mutual exclusion element () to force the temporal separation of the sampling edges of the clock and external signal transitions. Because s require that requesters competing for shared resources must be persistent, the clock input to the must be stretched when it loses the arbitration. ring oscillator is used instead of a crystal oscillator in order to be able to adjust the duration of off-phases of the clock. The local module clock,, is a buffered version of one of the outputs of the. It normally has a 50% duty cycle, except when the clock input loses the arbitration, in which case the off-phase of the clock is stretched. s shown in figure 5, one-way communication, for which the synchronous module is the receiver, is straight-forward. request event from the is forwarded to the mutual exclusion element () via the asynchronous finite state machine (FSM). If is low when 1 rises, then the immediately raises G 1, which prompts the FSM to generate an event on S ç. This event is effectively synchronized to, i.e., guaranteed not to induce a synchronization failure when sampled by the FSM, under a reasonable timing assumption as described below. Note that may rise before the lowers G 1, but the will not allow to rise until G 1 becomes low. On the other hand, if is already high when 1 rises, the assertion of G 1 is stalled until is lowered. s soon as falls, the raises G 1 and the FSM generates an event on S ç. clk may actually rise at about the same time 1 rises. In such situations the situations in which temporal separation of 1 and becomes blurred the simply tosses a coin to determine which signal to service first. If wins the coin toss, then rises first and G 1 remains low until falls (whichhappensshortlyafter falls). On the other hand, if 1 wins, then the raises G 1 first and blocks from rising. In order to prevent from stalling indefinitely (until the next toggling of the request, ç ), the FSM lowers 1 immediately after G 1 rises, which in turn causes G 1 to fall allowing to rise. The does not differentiate rising edges of ç from falling edges both edges enable 1 to be asserted and G 1 to be asserted as a result. In fact, the FSM effectively performs a two-phase to four-phase conversion from ç to 1 and a four to two-phase conversion from G 1 to S ç.this conversion is independent of whether a two-phase or fourphase communication protocol is used between the and the synchronous module. It is merely done so that both edges of S ç are synchronized to. 1 ρ G S ρ ρ 1 ρ 1 G 1 1 G S ρ 1 ρ 1 G 1 Figure 6: timing constraint.

In order for the synchronous FSM that generates ç to recognize the change in S ç, we need to ensure that S ç must satisfy setup and hold time constraints with respect to. s illustrated in figure 6, in order to recognize S ç (S ç,) 1 reliably, the path from G 1 to S ç (S ç,) must be shorter than the path from G 1 to 1, to G 1, to by at least the data setup time for the FSM latches. This is easily satisfied because G 1 to S ç (S ç,) delay is a simple generalized C-element delay (transitions on S ç are directly triggered by G 1 ),whichismuchlessthanthe delay from G 1 to 1, to G 1, to. 0 1 2 3 ρ 1 1 S ρ ρ 1 ρ Sρ Sρ weak reset 1 For bidirectional communication, the synchronous module must interface with two s as shown in figure 1b. s illustrated in figure 8, two handshaking signals must be synchronized to the local clock: the request from the sending and the acknowledge from the receiving. In order to simplify the interface to the, our design uses an arbiter to select just one external signal to pass through at a time. Synchronization of the handshaking signals is done in the same way as for one-way communication. (a) ing configuration with synchronous nodes only 4 5 1 S ρ weak S ρ sync Module Figure 7: asynchronous finite state machine specification and implementation. Sync Module Sync Module The asynchronous finite state machine (see figure 7) is specified in burst-mode [19, 18] and synthesized using the 3D-gC synthesis tool [20]. This burst-mode state machine has two inputs ( ç, G 1 ) and two outputs ( 1, S ç ). In state 0, when ç rises, the machine raises 1 and goes to state 1. In state 1, the machines waits for G 1 to rise; when it does, the machine lowers 1 and raises S ç concurrently and goes to state 2. When G 1 falls in state 2, the machine transitions to state 3. The machine transitions through states 4 and 5 and back to 0, as ç, triggers a sequence of signal transitions ending with G 1,. ρ σ 1 FSM ρ FSM S ρ 2 FSM σ FSM S σ G2 rbiter α Gα ing Oscillator Figure 8: for bidirectional communication. 1 We use a and a, to denote rising and falling transitions of a respectively. (b) Heterogeneous ring configuration Figure 9: (a) heterogeneous message-passing multiprocessor using s; (b) heterogeneous system with a mixture of asynchronous and synchronous modules. 4. System Configurations and Limitations Using the pausible clocking scheme, it is conceivable that one can construct a heterogeneous multi-processor system with point-to-point links between every pair of nodes. Each link is a bidirectional as shown in figure 1. However, as fanouts from and fanins to each node increase, the arbiter block becomes larger making the system impractical. However, we assert that it is possible to construct a ring configuration as shown in figure 9a similar to the systems proposed in Scalable Coherent Interface (SCI) specification [1]. In this structure, messages are always transmitted to one side and received from the other side, so that only one level of arbitration is required. major advantage of our ring configuration over other proposed systems, such as SCI system, is that it is a truly heterogeneous system with each node operating at its own speed. nother typical system configuration would be a mixture of asynchronous and synchronous modules as shown in figure 9b.

ing Oscillator Sender 2 s ρ eceiver 1 ing Oscillator G2 FSM S s FSM s ρ FSM Sρ FSM 2 1 2 2 1 1 Figure 10: One way test configuration. Systems-on-chip should be designed with as many reusable components as possible. Standard modules, such as CPU cores, should be reused with little or no modification, because these modules are highly optimized for performance and sensitive to timing variation. For the systems proposed in this paper, ideally, the pausible clocking control circuit should simply replace a portion of the system clock generation unit. However, for the state-of-the-art microprocessors, the system clock is produced by a phase locked loop (PLL). We cannot adjust the phase of the output of a PLL instantaneously in an analog fashion, as required in our pausible clocking control. Thus a ring oscillator should be used in place of a PLL. Then we lose control of the nominal frequency. Tuning the ring oscillator frequency would require more control pins and hence are more expensive. (However, the ring oscillator in this case does have an advantage that its frequency drift closely tracks logic components on the chip, e.g., if logic components slow down due to an increase in operating temperature, then so does the ring oscillator.) Furthermore, anything in the clock path designed to generate multiple frequencies andor to minimize jitter creates a problem for pausible clocking. 5. Experimental esults We performed extensive SPICE simulations of two modules connected via an asynchronous as shown in figure 10, after backannotating the layout parasitics into the schematic using Mentor Graphics ccusim. We varied the depth of the between 1 and 4. The performance of the appears to be independent of the depth of the. The inclusion of the is for a generic reason of smoothening the bursty data transfer between two modules operating at different clock rates, not to enhance the performance of the s. The first timing trace (figure 11) shows a receiver module operating at 217MHz. The first event on ç (a rising transition) is acknowledged normally without pausing. The second event (a falling transition) causes to be paused for about 1.8ns. The second timing trace (figure 12) shows a simulation p 1 Sp p nalog Trace 4.50e-08 e-08 5.50e-08 6.00e-08 6.50e-08 7.00e-08 7.50e-08 TI (sec) Figure 11: simulation trace illustrating a clock pause ( = 217 MHz). result of one-way communication between two modules operating at different clock frequencies. In this simulation, the sender module operates at 135MHz and the receiver at 217MHz. The sender FSM is simply a rising edge-triggered flip-flop followed by an inverter. t the first rising edge of 2 after the system reset signal turns off, the sender FSM generates a request to the by raising s. The responds immediately by raising s. The sender FSM samples the synchronized version of this signal (S s )at the next rising edge of 2 and lowers s. s long as the responds to the sender s request signal immediately, there is no jitter (pausing) on 2 because s is synchronizedto 2. The receiver FSM is a rising edgetriggered flip-flop. When a request from the sender reaches the receiver through the, it is acknowledged at the next rising edge of 1. Because the request signal toggles independently of 1, 1 pauses occasionally to synchronize the request. Because the receiver operates at higher clock frequency than the sender, the never fills up, so the sender never slows down in this simulation.

p p 1 s s 2 nalog Trace 3.000e-08 4.000e-08 0e-08 6.000e-08 7.000e-08 8.000e-08 9.000e-08 1.000e-07 1.100e-07 1.200e-07 TI (sec) Figure 12: One way communication simulation trace (1 = 217 MHz; 2 = 135MHz). 6. Conclusion We presented a new communication scheme, which is based on the pausible clocking scheme, for multiple synchronous modules operating independently. In order to prove its feasibility, we constructed a test bed consisting of two synchronous modules with the pausible clocking control and an asynchronous on a MOSIS 1:2çm CMOS chip. The resulting system functions reliably up to the local clock frequency of 220MHz (according to SPICE simulation) the maximum clock rate is limited by the ring oscillator, not the pausible clocking control. t the time of publication, preliminary test results indicate that the fabricated chips operate correctly as simulated. In the future, we plan to investigate a larger system, a heterogeneous ring configuration with a mixture of synchronous and asynchronous modules. In addition, we will investigate a new oscillator design (other than simple ring oscillator designs). cknowledgment The authors would like to thank Charles Dike of Intel Corporation for pointing out real-world problems associated with implementing pausible clocking control for microprocessor cores. eferences [1] IEEE Standard 1596-1992. Scalable coherent interface (SCI). [2] T. J. Chaney and C. E. Molnar. nomalous behavior of synchronizer and arbiter circuits. IEEE Transactions on Computers, C-22(4):421 422, pril 1973. [3] Daniel M. Chapiro. Globally-synchronous Locally- Synchronous Systems. PhD thesis, Stanford University, October 1984. [4] P. Corsini. Speed-independent asynchronous arbiter. IEE journal on Computers and Digital Techniques, 2(5):221 222, October 1979. [5] G. Gopalakrishnan and L. Josephson. Towards amalgamating the synchronous and asynchronous styles. In TU-93. [6] Mark. Greenstreet. Implementing a STI chip. In Proc. International Conf. Computer Design (ICCD), pages 38 43. IEEE Computer Society Press, October 1995. [7] Lindsay Kleeman. Service and Metastability Performance of rbiters. PhD thesis, Dept. of Electrical and Computer Eng., Univ. of Newcastle, ustralia, ugust 1986. [8] lain J. Martin. On Seitz s arbiter. Technical eport 5212:T:86, Caltech Computer Science, 1986. [9]. C. Pearce, J.. Field, and W. D. Little. synchronous arbiter module. IEEE Transactions on Computers, 24:931 932, September 1975. [10] W. W. Plummer. synchronous arbiters. IEEE Transactions on Computers, 21(1):37 42, January 1972. [11] Fred U. osenberger and Charles E. Molnar. Comments on metastability of CMOS latchflip-flop. IEEE Journal of Solid-State Circuits, 27(1):128 130, January 1992. eply by obert W. Dutton pages 131 132 of same issue. [12] Fred U. osenberger, Charles E. Molnar, Thomas J. Chaney, and Ting-Pien Fang. Q-modules: Internally clocked delayinsensitive modules. IEEE Transactions on Computers, C- 37(9):1005 1018, September 1988. [13] T. Sakurai. Optimization of CMOS arbiter and synchronizer circuits with submicron MOSFETs. IEEE Journal of Solid- State Circuits, 23(4):901 906, ugust 1988. [14] Charles L. Seitz. Ideas about arbiters. Lambda, 1(1, First Quarter):10 14, 1980. [15] Charles L. Seitz. System timing. In Carver. Mead and Lynn. Conway, editors, Introduction to VLSI Systems, chapter 7. ddison-wesley, 1980. [16] Jakov N. Seizovic. Pipeline synchronization. In Proc. International Symposium on dvanced esearch in synchronous Circuits and Systems, pages 87 96, November 1994. [17] M. J. Stucki and J.. Cox Jr. Synchronization strategies. In Charles L. Seitz, editor, Proceedings of the First Caltech Conference on Very Large Scale Integration, pages 375 393, 1979. [18] K. Y. Yun. Synthesis of synchronous Controllers for Heterogeneous Systems. PhD thesis, Stanford University, ugust 1994. Technical eport CSL-T-94-644. [19] K. Y. Yun, D. L. Dill, and S. M. Nowick. Synthesis of 3D asynchronous state machines. In Proc. International Conf. Computer Design (ICCD), pages 346 350. IEEE Computer Society Press, October 1992. [20] Kenneth Y. Yun. utomatic synthesis of extended burstmode circuits using generalized C-elements. To appear in EUODC-96.