Pausible Clocking Based Heterogeneous Systems

Transcription

1 Pausible Clocking Based Heterogeneous Systems Kenneth Y. Yun, Member, IEEE Ayoob E. Dooply, Student Member, IEEE Abstract This paper describes a novel communication scheme, which is guaranteed to be free of synchronization failures, amongst multiple synchronous and asynchronous modules operating independently. In this scheme, communication between every pair of modules is done through an asynchronous channel; communication between a module and the is done using a requestacknowledge handshaking. hronization of handshake signals to the local module clock is done in an unconventional way [1], [2], [3], [] the local clock built out of a ring oscillator is paused or stretched,if necessary, to ensure that the handshake signal satisfies setup and hold time constraints with respect to the local clock. In order to validate this scheme, we implemented a test chip in 0.5µm CMOS. This chip is designed as a ring, composed of two synchronous modules, an asynchronous module, and two asynchronous s. Each module functions as a receiver to one module and a sender to another module. Test results show that the chip functions reliably up to 56MHz. Keywords lobally asynchronous locally synchronous, Stretchable clock, hronization, Heterogeneous systems I. INTODUCTION The next generation digital VLSI systems will necessarily be based on system-on-a-chip concepts, in order to satisfy unrelenting demands for higher performance and also to accommodate smaller packaging and low power requirements. These on-chip systems will consist of multiple independently synchronized modules: some may be clocked modules, such as synchronous processor cores, while others may be clockless (asynchronous) modules, such as peripheral controllers. These chip designs will resemble today s complex board-level designs. On-chip modules will be held together by interface glue logic which must facilitate high speed communication between synchronous modules operating at different clock frequencies, between synchronous and asynchronous modules, and between asynchronous modules. The key difference between today s board-level designs and the future system-on-a-chip designs, however, is the speed at which communications take place. A first step toward such heterogeneous system design on a chip is a reliable high-speed communication scheme among multiple synchronous modules operating independently. We examined a variety of communication schemes that attempt to mitigate synchronization failure without sacrificing communication throughput. They generally fall into one of two categories: (1) brute-force synchronization of communication signals to each module s free-running clock with an acceptable level of synchronization failure; (2) adjustment of individual synchronous module s local clock, when necessary, to avoid synchronization failure. The first category includes methods such as the well-known double-latching scheme and a natural extension of the doublelatching scheme called pipeline synchronization [5]. These methods reduce the probability of synchronization failure to This work was supported in part by a gift from Intel Corporation and by a National Science Foundation CAEE Award MIP The authors are with ECE Dept., University of California, San Diego, 9500 ilman Drive, La Jolla, CA ; {kyy,adooply}@ucsd.edu. an acceptable level by resynchronizing communication signals with back-to-back latches. These methods are simple and inexpensive to implement, but a major drawback is the latency of communication. The methods in the second category [1], [2], [3], [] generally rely on stopping or stretching each synchronous module s local clock to guarantee that communication signals never violate setup and hold time constraints with respect to the local clock. Although these methods are robust and do not incur long communication latency, they involve designing a special clocking circuit, unfamiliar to most designers. The simplest example using this scheme one can conceive is a synchronous module communicating with an asynchronous peripheral. In this system, the synchronous module latches the handshake signals from the asynchronous module by stopping or stretching its own clock, when necessary. Both categories described above require some form of arbitration. However, if the phase relation between two synchronous modules can be predicted deterministically as in [6], then we can avoid synchronization failure, without arbitration, by timing communications when safe. In this paper, we describe a general method of communication amongst multiple synchronous and asynchronous modules operating independently, i.e., at different clock frequencies or phases, based on the pausible clocking scheme as shown in Fig. 1. hronous modules communicate with one another via asynchronous s 1 used as communication channels. The interfaces between the synchronous modules and the are pausible clocking control (PCC) circuits. In order to validate this scheme, we implemented a test chip in 0.5µm CMOS through MOSIS. This chip is designed as a ring, composed of two synchronous modules, an asynchronous module, and two asynchronous s. Each module functions as a receiver to one module and a sender to another module. Each synchronous module has a frequency control with four settings, which is independently programmable externally. The chip functions reliably for all the settings, albeit substantially faster than expected by simulation. II. DESIN OF PAUSIBLE CLOCKIN CONTOL The pausible clocking control is a scheme to avoid synchronization failure by adjusting the local clock. A synchronization failure at the module interface occurs when the arrival times of an external signal transition and a sampling edge of the clock are indistinguishable by the sampling latch at the module boundary. In our scheme, the synchronization failure is circumvented by pausing or stretching the local module clock when necessary. 1 Self-timed s have been used to facilitate robust communication between synchronous modules operating at the same frequency elsewhere [7]. However, they have not been utilized in communication between synchronous modules operating independently at different clock frequencies.

2 IEEE TANSACTIONS ON VLSI SYSTEMS 2 sender receiver h Module 1 PCC 1ρ A1ρ A1σ 1σ 2σ A2σ A2ρ 2ρ sender receiver PCC h Module 2 Fig. 1. Two synchronous modules communicating via asynchronous channels. ρ Aσ 1 Aρ FSM S ρ 2 σ FSM SA σ 2 Arbiter A. hronization Strategy ME rclk Fig. 2. PCC. ing Oscillator A block diagram of the PCC is shown in Fig. 2. This scheme uses a mutual exclusion element (ME) to force the temporal separation of the sampling edges of the clock and external signal transitions. A mutual exclusion element [2], [8] is a circuit that allows one request to pass through at a time on a first come first serve basis. When two inputs arrive simultaneously, it selects one to pass through arbitrarily. Because MEs require that requesters competing for shared resources must be persistent, the clock input to the ME must be stretched when it loses an arbitration. A ring oscillator is used instead of a crystal oscillator in order to be able to adjust the duration of off-phases of the clock. The local module clock,, is a buffered version of one of the outputs of the ME. It normally has a 50% duty cycle, except when the clock input loses an arbitration, in which case an off-phase of the clock is stretched. Consider a scenario in which a synchronous module equipped with the PCC is receiving inputs from the. The PCC needs ρ 1 1 S ρ A ρ rclk A σ 2 setup paused Fig. 3. PCC timing (bidirectional communication). an arbiter, as depicted in Fig. 2, to decide which of the two independent inputs (the request from the sending [ ρ ]andthe acknowledgment from the receiving [A σ ]) to be presented to the ME. For this scenario, we assume that ρ has won the arbitration. Thus, in this case, the arbiter simply forwards 1 to the ME and from the ME. As shown in the shaded area of Fig. 3, ρ is forwarded to the ME via the asynchronous finite state machine () and the arbiter. If rclk is low when rises, then the ME immediately raises, which prompts the to generate an event on S ρ.this event is effectively synchronized to, i.e., guaranteed not to induce a synchronization failure when sampled by the FSM, under a reasonable timing assumption as described below. Note that rclk may rise before the ME lowers, but the ME will not allow to rise until becomes low. On the other hand, if rclk is already high when rises, the assertion of is stalled until rclk is lowered. As soon as rclk falls, the ME raises and the generates an event on S ρ. clk may actually rise at about the same time rises. In such situations the situations in which temporal separation of + and rclk + becomes blurred the ME simply tosses a coin to determine which signal to service first. If rclk + wins the coin toss, then rises first and remains low until falls (which happensshortlyafter rclk falls). On the other hand, if + wins, then the ME raises first and blocks from rising. In orderto prevent from stalling indefinitely (until the next toggling of the request, ρ ), the lowers 1 immediately after 1 rises, which in turn causes to fall allowing to rise. The PCC does not differentiate rising edges of ρ from falling edges both edges enable 1 to be asserted and 1 to be asserted as a result. In fact, the effectively performs a two-phase to four-phase conversion from ρ to 1 and a four to two-phase conversion from 1 to S ρ. This conversion is independent of whether a two-phase or four-phase communication protocol is used between the and the synchronous module. It is merely done so that both edges of S ρ are synchronized to. In order for the synchronous FSM that generates A ρ to recognize the change in S ρ, we need to ensure that S ρ satisfies setup and hold time constraints with respect to. Inor- der to recognize S ρ + (S ρ ) 2 reliably, the path from 1 + to S ρ + (S ρ ) must be shorter than the path from 1 + to + via and by at least the data setup time for the FSM latches. This is easily satisfied in practice because 1 + to S ρ + (S ρ ) delay is a complex gate delay (transitions on S ρ are directly triggered by 1 +),whichismuchlessthanthe delay from 1 + to to to +. The asynchronous finite state machine (see Fig. ) is specified in burst-mode [9] and synthesized using the 3D-gC synthesis tool [10]. This burst-mode state machine has two inputs ( ρ, 1 ) and two outputs ( 1, S ρ ). In state 0, when ρ rises, the machine raises 1 and goes to state 1. In state 1, the machines waits for 1 to rise; when it does, the machine lowers 1 and raises S ρ concurrently and goes to state 2. When 1 falls in state 2, the machine transitions to state 3. The machine transi- 2 We use a+ and a to denote rising and falling transitions of a respectively.

3 IEEE TANSACTIONS ON VLSI SYSTEMS ρ S ρ + ρ 1 + ρ Sρ Sρ weak reset gme (a) Arbiter with fast forward request S ρ weak S ρ 1 gme T1 2 Fig.. PCC asynchronous finite state machine specification and implementation. tions through states and 5 and back to 0, as ρ triggers a sequence of signal transitions ending with 1. B. Arbitration In order to simplify the interface to the ME, our design uses an arbiter to select just one external signal to pass through at a time. An arbiter [8], [11], [12] is a circuit that propagates one request at a time (as does the ME) but also acknowledges the requesters with grant signals as well. The arbiter used in our design [13] shown in Fig. 5 is different from conventional ones. When 1 or 2 arrives at the arbiter, the arbiter forwards a request to the ME, without determining which one has arrived first. As shown in Fig. 3, 1 and 2 (enabled by ρ and A σ respectively) may arrive at the same time, causing gme (gated ME inside the arbiter) to go into a metastable state. When + arrives at the ME, rclk + may also arrive at about the same time as well. If the ME determines that + has arrived first (after resolving its internal metastability), it raises and blocks from rising (thus stretching it) until falls. Note that the metastability resolution in the gme would be wellunderway, ifnotalreadyfinished, bythetime is asserted. Thus the metastability resolution in the gme and the ME are done concurrently. When is asserted, 1 rises, which in turn resets and 1. enables, which in turn enables to rise. The pending request 2 enables to be asserted again after 1 resets. However, it is blocked until rclk falls. Note that our arbiter design is not speed-independent, i.e., it requires a timing constraint on its environment for correct functionality. Consider an example depicted in Fig. 5(c), in which 2 + starts a chain of events, starting with +. After is asserted, 2 is raised. This event enables both and 2.If 2 side is slow to respond, i.e., 2 does not fall till is negated, then this residual 2 may be viewed as a new request to the gme. In our design, the ASFM (see Fig. ) lowers 2 immediately, once 2 rises, in order to prevent this problem. More precisely, the following timing constraint is always satisfied for both i =1and 2: t + i + t + + t + i >t +. (1) i i 1 T1 2 T (b) ated ME T2 (c) Arbiter timing Fig. 5. (a) Fast-forward arbiter; (b) ated ME (ME with enabled outputs); (c) Fast-forward arbiter timing. III. SYSTEM CONFIUATIONS AND LIMITATIONS Using the pausible clocking scheme, it is conceivable that one can construct a heterogeneous multi-processor system with point-to-point links between every pair of nodes. Each link is a bidirectional as shown in Fig. 1. However, as fanouts from and fanins to each node increase, the arbiter block becomes larger making the system design impractical. However, we assert that it is possible to construct a ring configuration as shown in Fig. 6(a) similar to the systems proposed in Scalable Coherent Interface (SCI) specification [1]. In this structure, messages are always transmitted to one side and received from the other side, so that only one level of arbitration is required in the PCC. A major advantage of our ring configuration over other proposed systems, such as SCI system, is that it is a truly heterogeneous system with each node operating at its own speed. Another possible system configuration would be a conventional bus based architecture depicted in Fig. 6(b). In this config-

4 IEEE TANSACTIONS ON VLSI SYSTEMS ρ Aρ FSM (1) 1 ME ing Oscillator Async (a) Heterogeneous ring configuration (3) S ρ Aσ σ FSM SA σ 2 2 (2) Arbiter rclk up Fig. 7. Performance constraints. Async Bus Arbiter Async (b) Bus configuration Asynchronous bus Fig. 6. (a) A heterogeneous message-passing multiprocessor using PCC s; (b) A conventional bus based system. uration, synchronous modules are connected to an asynchronous bus, e.g., Futurebus+, via s and asynchronous modules are connected directly to the bus. Since the synchronous modules in this configuration interface to the bus only via two s, the PCC s in the synchronous modules require just one level of arbitration. Systems-on-a-chip should be designed with as many reusable components as possible. Standard modules, such as CPU cores, should be reused with little or no modification, because these modules are highly optimized for performance and sensitive to timing variation. For the systems proposed in this paper, ideally, the pausible clocking control circuit should simply replace a portion of the system clock generation unit. However, for state-of-the-art microprocessors, the system clock is produced by a phase locked loop (PLL). We cannot adjust the phase of the output of a PLL instantaneously in an analog fashion, as required in our pausible clocking control. Thus a ring oscillator should be used in place of a PLL. We then lose control of the nominal frequency. Tuning the ring oscillator frequency would require more control pins and hence is more expensive. Furthermore, the ring oscillator based clocks may be more susceptible to jitter problems. However, the ring oscillator does have an advantage that its frequency drift closely tracks logic components on the chip, e.g., if logic components slow down due to an increase in operating temperature, then so does the ring oscillator. IV. PEFOMANCE CONSIDEATIONS In this section, we examine performance limitations of our PCC scheme. The maximum clock frequency of the PCC is constrained by the FSM input setup and hold time requirements as follows (see Fig. 7). Assume that is high (enabled by ρ )whenrclk falls. This activates + and 1 + in sequence, which in turn enables S ρ to toggle. We need to in- sure that S ρ satisfies setup and hold time requirements with respect to +. Assuming that the delays through paths (1) and (2) are t 1 = t rclk + t t 1 + and 1 S+ ρ t 2 = t rclk + + respectively, the following inequality must hold true: T 2 + t h <t 1 t 2 < T 2 t su (2) where t su and t h are the required setup and hold times of S ρ with respect to + and T is the nominal period of the ring oscillator clock with a 50% duty cycle. Therefore, the minimum clock period achievable is: T min =2max(t 1 t 2 + t su,t 2 t 1 + t h ). (3) In general, t 2 >> t 1 for a large clock tree; hence T min =2(t 2 t 1 + t h ). In other words, the minimum clock period is about twice the clock tree delay. Assuming that s + h represents the metastability window of the ME (analogous to the setup and hold time window of a flip-flop) and represents the delay from the time is granted an access to the ME to the time it relinquishes it, we can define window of vulnerability to be + h (if > s ). In other words, may pause (due to A + σ enabled by + σ )ifthe following condition is true: nt < t + + σ + t + σ A + σ + t A + σ + 2 +t rclk t < nt + h for n 1, () where =t t t 1 +. A similar inequality exists for ρ (the same as Eq., but with σ replaced by A ρ, 1 A σ by ρ,and 2 by 1 ). Clearly, the frequency and duration of clock stretching depends on the size of this window ( + h ) the smaller this window, the higher the performance of the system. V. EXPEIMENTAL ESULTS We constructed a heterogeneous ring on a chip as depicted in Fig. 8. The chip consists of two synchronous modules nicknamed CPU (CPU) and Co-processor (Coproc), an asynchronous module (Async), and two asynchronous s (1 and 2). 3 Every module has two neighbors. Messages are always transmitted to one side and received from the 3 Our implementation is a fully-decoupled described in [15].

5 IEEE TANSACTIONS ON VLSI SYSTEMS 5 in A in D in ρ A CPU PCC S ρ SA σ 1 2 Coproc FSM A 1 A 2 FF (a) A σ 2 A A3 1 Async A1 1 Fig. 8. Test chip configuration. A out out D out 2 A 1 =00 2 A 1 =01 2 Coproc A2 2 1 =0 1 A 2 =01 A 2 =0 1 A 2 =11 A 2 =1 1 A 2 =10 A 2 =0 (b) 1 =0 2 A 1 =11 2 A 1 =01 1 A 2 =11 1 =1 Fig. 9. (a) Co-processor implementation; (b) Co-processor FSM specification. Coproc CLK(162MHz) 2 A2 Coproc Dout[1] Coproc Dout[0] 3 A3 Async Dout[1] Async Dout[0] A 120ns 10ns 160ns 180ns 200ns 220ns 20ns 260ns Fig. 10. Simulation trace (f Coproc = 162 MHz, f CPU = 327 MHz). Fig. 11. Die photo. other side. All communications are done using a four-phase return-to-zero protocol. CPU and Coproc are equipped with a PCC whose ring oscillator frequency can be set to one of different frequencies ranging from 162MHz to 327MHz. CPU generates data and sends them to Coproc via 1. It has an FSM that generates 2-bit data in ray code sequence. Coproc generates -bit output by transforming the 2-bit input from CPU and passes them to Async. The Coproc FSM, depicted in Fig. 9(b), handles the flow control of data in a semidecoupled manner: i.e., it asserts an acknowledgment and a request to its input and output s, upon receiving an input request and latching input data, but does not complete handshaking on the input side until its output acknowledges the receipt of data. Async is composed of an and a data processing unit. This controls the flow of data in a similar fashion as the FSM of Coproc but asynchronously. The data processing unit extracts two copies of the original 2-bit data and passes them back to CPU via 2. 1 and 2 (both with depth of two) are included for a generic reason of smoothening the bursty data transfer between two modules operating at different clock rates and decoupling the operations of the modules. We performed extensive SPICE simulations of the chip after backannotating the layout parasitics into the schematic, using Mentor raphics Accusim. The timing trace in Fig. 10 shows a simulation result including handshake and data signals. CPU and Coproc are programmed to run at 327MHz and 162MHz respectively in this case. The frequencies of 2 and A 2 (handshake signals between Coproc and Async) are one fourth that of Coproc clock, because it takes one clock cycle to sample A 2 and another clock cycle to toggle 2. 2 and A 2 have about 50% duty cycle because they are synchronized to Coproc clock; however, 3 and A 3 have much narrower pulses because 2 (output side of Async) is never full and thus responds quickly. The chip (die photo shown in Fig. 11) was fabricated in 0.5µm HP CMOS1TB triple-metal process through MOSIS. Our test results show that the fabricated chips function correctly, but test results are found to be much faster than the simulated results (with a scale factor of 1.). We believe that this is due, in part, to the conservative estimation of parasitics by the Mentor tool. Test results clearly indicate that the clocks do become stretched. Although we were not able to observe high frequency clock signals (due to the pad design limitations), we could clearly observe several instances of clock pausing when the CPU module was programmed to operate at 162MHz (simulation speed). An oscilloscope trace is shown in Fig. 12. Signals labeled CLK and EQ are and ρ of the CPU module observed at the pins of the fabricated chip. CLK is paused if EQ toggles within the window of vulnerability of CLK+,whichwas calculated to be between 3.3ns and 2.3ns from the rising edge of CLK (with a scale factor of 1. taken into account). However, the time difference between CLK and EQ observed on the oscilloscope as shown in Fig. 12 is outside the calculated window. In fact, the instance captured in Fig. 12 shows that CLK pauses even when EQ rises 3.85ns before CLK rises. The source of this discrepancy is unclear. We conjecture that not all delays scale the same. Test measurements were made with HP 5720A Oscilloscope with two Sas plug-in cards. Table I compares the nominal frequencies (without clock We used pads readily available from MOSIS which could not support the high speed clock signals.

6 IEEE TANSACTIONS ON VLSI SYSTEMS 6 Test Simulation No. of Fraction of Operating Nominal Operating Handshake Input Clock Cycle Stretched (without pause) (with pause) Transitions Per HS Transition CPU Coproc CPU Coproc CPU Coproc Per 1000ns CPU Coproc (56MHz) (56MHz) 327MHz 327MHz 290MHz 290MHz (56MHz) 236MHz 327MHz 162MHz 290MHz 150MHz MHz (56MHz) 162MHz 327MHz 150MHz 290MHz MHz 236MHz 162MHz 162MHz 150MHz 150MHz TABLE I COMPAISON OF NOMINAL AND OPEATIN FEQUENCIES (, 27 C, N68MA UN). Fig. 12. Oscilloscope trace illustrating clock stretching ( frequency = 236 MHz tested at and 22 C). pausing) and the operating frequencies obtained from simulation and test. The number shown in parentheses (56MHz) was obtained by extrapolation, not from the actual test. This is due to the limitation imposed by IO pads. However, all the data and handshake signals, measurable because of their low frequencies, have been tested to be correct, even when the clock is set to the maximum frequency. Thus we concluded that the clock internal to the chip does not result in any synchronization failures. We extrapolated the top clock speed from the simulation and lowerspeed clock signals measured from the test. The impact of clock pausing on performance can be deduced from Table I. For example, when the nominal frequencies of CPU and Coproc are programmed to be 327MHz and 162MHz, the actual operating frequencies are 290MHz and 150MHz respectively (according to simulation). Therefore, CPU clock is stretched 37 (nominal) clock cycles per 1000ns and Coproc clock 12 cycles per 1000ns. However, clock stretching potentially occurs only when handshake inputs toggle. We measured 152 handshake input transitions per 1000ns for this case. Thus, in this case, the fraction of clock cycle stretched per handshake input transition for CPU is =0.2. As the frequency is lowered, the number of clock cycles stretched becomes smaller. This is because the window of vulnerability as a fraction of a clock period becomes smaller. Clearly, we are trading off system performance for communication reliability and latency. In fact, the degradation in CPU throughput is about 12% in this configuration. However, we assert that we can gain back this loss and possibly more by operating the system at a higher speed, i.e., by tuning the ring oscillator to a higher frequency. Systems with a ring oscillator clock do not require as much margin for the worst-case design as the ones with a fixed reference clock, because the operating condition changes affect both the logic and ring oscillators similarly. VI. CONCLUSION We presented a new communication strategy, which is based on the pausible clocking scheme, for multiple synchronous modules operating independently. We described the design and implementation of the pausible clock unit in detail. In order to prove its feasibility, we implemented a 0.5µm CMOS chip, mimicking a heterogeneous ring. Our test results show that the chip functions correctly as simulated, up to the top clock speed of 56MHz. We suggested possible system configurations (heterogeneous rings and bus based systems) and practical limitations. We also determined analytical performance bounds and analyzed the effects of clock stretching on overall system performance. Acknowledgment The authors would like to thank yan Donohue, eza Jalilizeinali, and LiJen Ko for their chip design effort. EFEENCES [1] M. J. Stucki and J.. Cox Jr., hronization strategies, in Proceedings of the First Caltech Conference on Very Large Scale Integration, C.L. Seitz, Ed., 1979, pp [2] C. L. Seitz, System timing, in Introduction to VLSI Systems, C. A. Mead and L. A. Conway, Eds., chapter 7. Addison-Wesley, [3] D.M.Chapiro, lobally-asynchronous Locally-hronous Systems, Ph.D. thesis, Stanford University, Oct [] F. U. osenberger, C. E. Molnar, T. J. Chaney, and T. Fang, Q-modules: Internally clocked delay-insensitive modules, IEEE Transactions on Computers, vol. C-37, no. 9, pp , Sept [5] J. N. Seizovic, Pipeline synchronization, in Proc. International Symposium on Advanced esearch in Asynchronous Circuits and Systems, Nov. 199, pp [6] L. F.. Sarmenta,. A. Pratt, and S. A. Ward, ational clocking, in Proc. International Conf. Computer Design (ICCD), Oct. 1995, pp [7] M.. reenstreet, Implementing a STAI chip, in Proc. International Conf. Computer Design (ICCD), 1995, pp [8] C. L. Seitz, Ideas about arbiters, Lambda, vol. 1, no. 1, First Quarter, pp. 10 1, [9] Steven M. Nowick, Automatic Synthesis of Burst-Mode Asynchronous Controllers, Ph.D. thesis, Stanford University, Department of Computer Science, [10] K. Y. Yun and D. L. Dill, Automatic synthesis of extended burst-mode circuits: part II (automatic synthesis), IEEE Transactions on Computer- Aided Design, vol. 18, no. 2, pp , Feb [11] A. J. Martin, On Seitz s arbiter, Tech. ep. 5212:T:86, Caltech Computer Science, [12] T. Sakurai, Optimization of CMOS arbiter and synchronizer circuits with submicron MOSFETs, IEEE Journal of Solid-State Circuits, vol. 23, no., pp , Aug [13] M. B. Josephs and J. T. Yantchev, CMOS design of the tree arbiter element, IEEE Transactions on VLSI Systems, vol., no., pp , Dec [1] IEEE Standard , Scalable coherent interface (SCI), [15] S. B. Furber and P. Day, Four-phase micropipeline latch control circuits, IEEE Transactions on VLSI Systems, vol., no. 2, pp , June 1996.