Clock Distribution in RNS-based VLSI Systems



Similar documents
TRUE SINGLE PHASE CLOCKING BASED FLIP-FLOP DESIGN

S. Venkatesh, Mrs. T. Gowri, Department of ECE, GIT, GITAM University, Vishakhapatnam, India

Alpha CPU and Clock Design Evolution

Topics of Chapter 5 Sequential Machines. Memory elements. Memory element terminology. Clock terminology

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

International Journal of Electronics and Computer Science Engineering 1482

Clock Distribution Networks in Synchronous Digital Integrated Circuits

路 論 Chapter 15 System-Level Physical Design

Clocking. Figure by MIT OCW Spring /18/05 L06 Clocks 1

Sequential 4-bit Adder Design Report

Power Reduction Techniques in the SoC Clock Network. Clock Power

Understanding CIC Compensation Filters

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 16 Timing and Clock Issues

Latch Timing Parameters. Flip-flop Timing Parameters. Typical Clock System. Clocking Overhead

INTEGRATED CIRCUITS. For a complete data sheet, please also download:

Model-Based Synthesis of High- Speed Serial-Link Transmitter Designs

Timing Methodologies (cont d) Registers. Typical timing specifications. Synchronous System Model. Short Paths. System Clock Frequency

Introduction to CMOS VLSI Design (E158) Lecture 8: Clocking of VLSI Systems

CHARGE pumps are the circuits that used to generate dc

High Speed and Efficient 4-Tap FIR Filter Design Using Modified ETA and Multipliers

Two-Phase Clocking Scheme for Low-Power and High- Speed VLSI

Design and analysis of flip flops for low power clocking system

CMOS Power Consumption and C pd Calculation

INTEGRATED CIRCUITS. For a complete data sheet, please also download:

74LS193 Synchronous 4-Bit Binary Counter with Dual Clock

ISSCC 2003 / SESSION 13 / 40Gb/s COMMUNICATION ICS / PAPER 13.7

LOW POWER DESIGN OF DIGITAL SYSTEMS USING ENERGY RECOVERY CLOCKING AND CLOCK GATING

Efficient Interconnect Design with Novel Repeater Insertion for Low Power Applications

Lecture 7: Clocking of VLSI Systems

A 1.62/2.7/5.4 Gbps Clock and Data Recovery Circuit for DisplayPort 1.2 with a single VCO

Low latency synchronization through speculation

Static-Noise-Margin Analysis of Conventional 6T SRAM Cell at 45nm Technology

DM74LS191 Synchronous 4-Bit Up/Down Counter with Mode Control

NTE2053 Integrated Circuit 8 Bit MPU Compatible A/D Converter

DM74LS169A Synchronous 4-Bit Up/Down Binary Counter

HCC/HCF4032B HCC/HCF4038B

A 10,000 Frames/s 0.18 µm CMOS Digital Pixel Sensor with Pixel-Level Memory

Counters and Decoders

DM74LS193 Synchronous 4-Bit Binary Counter with Dual Clock

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

Digital to Analog Converter. Raghu Tumati

CMOS, the Ideal Logic Family

An Efficient RNS to Binary Converter Using the Moduli Set {2n + 1, 2n, 2n 1}

Lecture 10: Sequential Circuits

Lecture 11: Sequential Circuit Design

DATA SHEET. HEF40193B MSI 4-bit up/down binary counter. For a complete data sheet, please also download: INTEGRATED CIRCUITS

PLAS: Analog memory ASIC Conceptual design & development status

On-chip clock error characterization for clock distribution system

8 Gbps CMOS interface for parallel fiber-optic interconnects

CHAPTER 11: Flip Flops

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

True Single Phase Clocking Flip-Flop Design using Multi Threshold CMOS Technique

System on Chip Design. Michael Nydegger

A Survey on Sequential Elements for Low Power Clocking System

A true low voltage class-ab current mirror

ECE410 Design Project Spring 2008 Design and Characterization of a CMOS 8-bit Microprocessor Data Path

An On-chip Security Monitoring Solution For System Clock For Low Cost Devices

TIMING-DRIVEN PHYSICAL DESIGN FOR DIGITAL SYNCHRONOUS VLSI CIRCUITS USING RESONANT CLOCKING

Clock- and data-recovery IC with demultiplexer for a 2.5 Gb/s ATM physical layer controller

Experiment # 9. Clock generator circuits & Counters. Eng. Waleed Y. Mousa

NOTE: The Flatpak version has the same pinouts (Connection Diagram) as the Dual In-Line Package.


54LS169 DM54LS169A DM74LS169A Synchronous 4-Bit Up Down Binary Counter

Chapter 9 Latches, Flip-Flops, and Timers

NAME AND SURNAME. TIME: 1 hour 30 minutes 1/6

Automist - A Tool for Automated Instruction Set Characterization of Embedded Processors

Laboratory 4: Feedback and Compensation

INTEGRATED CIRCUITS. For a complete data sheet, please also download:

DM Segment Decoder/Driver/Latch with Constant Current Source Outputs

A 1-GSPS CMOS Flash A/D Converter for System-on-Chip Applications

ICS SPREAD SPECTRUM CLOCK SYNTHESIZER. Description. Features. Block Diagram DATASHEET

DATA SHEET. HEF40374B MSI Octal D-type flip-flop with 3-state outputs. For a complete data sheet, please also download: INTEGRATED CIRCUITS

A Novel Low Power, High Speed 14 Transistor CMOS Full Adder Cell with 50% Improvement in Threshold Loss Problem

Multiple clock domains

DM54161 DM74161 DM74163 Synchronous 4-Bit Counters

Lecture-3 MEMORY: Development of Memory:

74F168*, 74F169 4-bit up/down binary synchronous counter

ISSCC 2003 / SESSION 4 / CLOCK RECOVERY AND BACKPLANE TRANSCEIVERS / PAPER 4.7

Sequential Circuits. Combinational Circuits Outputs depend on the current inputs

Sequential Circuit Design

Flip-Flops, Registers, Counters, and a Simple Processor

Sequential Logic: Clocks, Registers, etc.

74AC191 Up/Down Counter with Preset and Ripple Clock

Low Power AMD Athlon 64 and AMD Opteron Processors

1 TO 4 CLOCK BUFFER ICS551. Description. Features. Block Diagram DATASHEET

Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system

TIMING ISSUES IN DIGITAL CIRCUITS

TS555. Low-power single CMOS timer. Description. Features. The TS555 is a single CMOS timer with very low consumption:

ZL40221 Precision 2:6 LVDS Fanout Buffer with Glitchfree Input Reference Switching and On-Chip Input Termination Data Sheet

Abstract. Cycle Domain Simulator for Phase-Locked Loops

數 位 積 體 電 路 Digital Integrated Circuits

COMBINATIONAL and SEQUENTIAL LOGIC CIRCUITS Hardware implementation and software design

54LS193 DM54LS193 DM74LS193 Synchronous 4-Bit Up Down Binary Counters with Dual Clock

SN54HC191, SN74HC191 4-BIT SYNCHRONOUS UP/DOWN BINARY COUNTERS

css Custom Silicon Solutions, Inc.

Transcription:

Clock Distribution in RNS-based VLSI Systems DANIEL GONZÁLEZ 1, ANTONIO GARCÍA 1, GRAHAM A. JULLIEN 2, JAVIER RAMÍREZ 1, LUIS PARRILLA 1 AND ANTONIO LLORIS 1 1 Dpto. Electrónica y Tecnología de Computadores Universidad de Granada 18071 Granada SPAIN 2 ATIPS Laboratory, Dept. of Electrical and Computer Engineering University of Calgary Calgary, Alberta T2N 1N4 CANADA Abstract: Clock distribution networks synchronize the flow of data in digital systems, and the features of the clock signal affect system performance and reliability. The great advances in integration levels and speed requirements in actual digital systems lead to enormous complexity in synchronization. Differences in delay of clock signals along clock paths, due to different lengths or active elements such buffers, cause loss of performance and poor reliability. However, a perfect synchronization leads to a simultaneously triggering of great amount of devices and, thus, large current demands. This paper presents a new scheme for clock distribution in VLSI systems that avoids both skew and large current demand related problems. Different skewed clock signals synchronize different independent channels in a RNS-based system. This new strategy has been simulated over a three-stage decimation CIC filter, resulting in a diminution of current spikes and current change rate with no significant increase of power consumption. Key-Words: Clock, Clock Distribution, RNS, Synchronization. 1 Introduction The last advances in integrated circuit fabrication have lead to increasing integration levels and operating speeds. These factors make extraordinarily difficult the proper synchronization of integrated systems because the length of clock distribution lines increases along with the number of devices the clock signal has to supply, thus leading to substantial delays that limit system speed. For TSPC (True Single Phase Clock) one clock line must be distributed all over the chip, as well as within each operating block. More complex clocking schemes may even require two or four non-overlapping clock signals [1], thus increasing the resources required for circuit synchronization. Moreover, for clock frequencies over 500 MHz, phase differences between the clock signal at different locations of the chip (skew) start presenting serious problems [2]. Several techniques have been developed for overriding clock skew, being the most common RC-tree analysis and H-tree distribution. The first represents the circuit as a tree, modelling every line through a resistor and a capacitor and every block as a terminal capacitance [3]. In this way, distribution line delay is evaluated and elements to compensate clock skew can be subsequently added. The second option distributes symmetrically the different clock signals forming and H, thus avoiding the appearance of differences in the delay of each path, as illustrated in Fig. 1. However, minimizing skew has negative sides, since the simultaneous triggering of many devices leads to short but large current demands. Because of this, a meticulous design of power supply lines and device sizes is required, with this large current demand resulting in area penalties. If this is not the case, parts of the chip may not receive as much energy as required. This approximation to the problems related to fully synchronous circuits and clock skew has been previously discussed [4]. This paper will present an alternative for efficiently synchronizing RNS-based circuits while keeping current demand to a minimum. The underlying idea is to generate non-overlapping clock signals, each one controlling an RNS channel.

minimizes the skew caused by variations during chip fabrication [11-12] processes. Typically, synchronous systems consist of a chain of registers separated by combinational logic that performs data processing. The maximum clock frequency is derived from: Fig. 1. H-tree distribution This takes advantage of the non-communicating channel structure that characterizes RNS architectures in order to reduce the clock synchronization requirements for high-performance digital signal processing systems. A three-stage CIC (Cascade Integrator Comb) decimation filter [5-6] was used for evaluation of the presented strategy. 2 Skew When clock propagates across different paths, it suffers different delays due to different path lengths and active elements. The difference in the value of the clock in two nodes is known as clock skew. Clock skew produces system performance losses when compared to that obtainable from individual blocks, since it is necessary to guarantee the proper function of the chip with a reduced speed clock. Skew will cause clock distribution problems if the following inequality holds [7]: D k > v (1) f app where k<0.20 (typical value) is a constant, D is the typical size of the system, v is the propagation speed for the clock signal and f app is the applied clock frequency. Existing solutions for clock skew provide two different approaches to the problem; the first one consists in equalizing the length of clock paths to processing elements with buffers and delay elements, or through H-tree, mesh or X-tree topologies [3, 8-10]. The other option eliminates or 1 f T PD + T skew max (2) where f max is the maximum frequency obtainable, T PD is the time between the arrival of the clock signal at the i-th register and stable processed data at the output of the (i+1)-th register and T skew is the time between the arrival of the clock signal at the i- th register and the arrival of the same signal at the (i+1)-th register. Clock skew can be considered as either positive or negative, although the sign criteria is not standardized. If positive, then from equation (2), the minimum system clock period is increased, while if negative, T min decreases. This appears as an advantage, and can be used to improve the system features, but an excessive positive skew results in a race-related problems may arise if data processing time is lower that the skew. 3 New synchronization Strategy The proposed strategy for synchronization of RNSbased systems introduces the generation of several signal clocks from the master clock. These clocks signals are generated slightly out-of-phase with non-overlapping edges. Each one of these clock signals synchronizes one of the RNS channels, while the master clock synchronizes global data (mainly at the global inputs and outputs of the system) and guarantees data coherency. Thus, each channel process data at different time instants, so the current demand is distributed over the whole clock cycle. This has the effect of reducing current spikes on the power supply lines approximately by a factor that is the number of generated clock signals. The phase difference between the generated clocks has to satisfy some specifications: dclk1 dclk2 CLK i CLK i CLK i+1 CLK i+1 CLK dclk_cell 1 dclk_cell 2 T skew Fig. 2. Negative (left) and positive (right) skew. T skew Fig. 3. dclk_cell chain for clock generation

the number of clock signals with overlapping active cycles has to be minimized, as well as the time two or more of these active cycles overlap, clock edges must not coincide, data coherency has to be respected at both the input and output of the system. Clear advantages are obtained when these requisites are satisfied, since current spikes are reduced and power dissipation is distributed over the master clock cycle, rather than concentrated around the master clock edges. The temporal variation in current values is reduced also. Moreover, power supply lines may be scaled and clock distribution resources reduced, thus simplifying the chip design effort. This strategy can be used in systems whose operation mode implicates several noncommunicating modules or channels, as those based on the RNS. RNS [13-14], with noncommunicating channels, perfectly suits this clocking scheme and a generated clock signal is applied to each independent channel, while the master clock signal used to generate these other clocks can also synchronize the global input and output. Moreover, the resources required for implementing this new strategy are minimum and a few transistors, basically three inverters, are required for each channel. Specifically, the master clock signal is routed through an inverter chain, thus being delayed. Meanwhile, the generated clock signals are extracted at adequate points of this chain and conditioned to be used as clock signal for a complete RNS channel. This scheme requires the inverter chain to alternate large and small input capacitances and low driving capabilities, so appropriate delays can be generated. This structure generates low-quality clock signals within the inverter chain, so additional buffers are required for obtaining adequate clock signals. Fig. 3 illustrates the hardware required to generate the proposed synchronization scheme, where dclk stands for generated delayed clocks, while Fig. 4 shows the detailed scheme for the so-called dclk_cell cell, which consists of three inverters. It can be deduced from Fig. 4 that three design parameters, L d, W b and L b, are available in order to obtain the system specifications, while L min represents the feature size of the fabrication process and W min the minimum usual width for pmos transistors. Connecting CHout pads to CHin pads, the inverter chain described above is built, while the master global clock is used as input to this chain. Fig. 4 illustrates how large capacitance inverters are alternated with minimum-size devices. Thus, the low driving capabilities of the latter allow modelling of the required delay using the L d parameter. Meanwhile, the generated clock signal dclk is regenerated by a third inverter that includes the parameters W b and L b. These allow matching of the timing specifications for a given system, also allowing the adaptation of the cell to the overall capacitance to be driven by the generated clock. However, these three design parameters L d, W b and L b are not independent, and their relation needs a careful study of the final W=1.5*W b W=W b W=1.25*W d L= L d W=W min W=W d L= L d W=L min Fig. 4. dclk_cell schematic.

R I I I C C C z -1 z-d Fig. 6. Three-stage CIC decimation filter. Fig. 5. Clock signals generated for the new strategy. system to be synchronized in order to select their optimum values. Fig. 5 shows the resulting generated clocks in a simple design example for a 300 MHz master global clock and 0.1 pf loads for every dclk signal. Note that the requirements enumerated above about non-overlapping edges and active cycles are matched, with every dclk signal being to be used as clock for a given RNS channel. 4 Application to a Design Example and Simulation Results A three-stage decimation RNS-based CIC filter [6] was considered for the evaluation of the proposed synchronization technique. CIC filters [5] have been shown to be a useful alternative for FIR implementation for a variety of communication systems, because no multipliers are required. Although the frequency response out of the pass band may be insufficient for certain applications, low-order conventional filters may correct this effect. An S stage CIC system is defined by the transfer function: S RD RD z k = = z z 1 1 1 1 k = 0 H ( z) (3) Fig. 6 shows the structure of a three-stage CIC filter, illustrating the meaning of R and D parameters. The selected design example was a fully pipelined three-stage CIC decimation filter with R=32, D=2 and 26-bit dynamic range [6]. This RNS-enabled system requires four channels with moduli {256, 63, 61, 59}. This 26-bit dynamic range filter was designed at the transistor level and simulated using PSpice for both a single global clock and the proposed technique. For these S simulations, a public domain MOSIS CMOS 0.6 µm process [15] was used. This is a three-metal, one-poly, 3.3V CMOS process that is available to MOSIS costumers through Agilent Technologies. Since the system is composed just of adders and registers, the well-known two-stage modulo adder [6] was used, while registers were implemented using negative edge triggered D flip-flops (netdff) based on TSPC logic [1], that simplifies the implementation of the proposed alternative. Because of the great locality of the connections for the study case, load driving is kept to a minimum and device sizes can be fixed to the process minimum for most of the transistor involved. Only transistors used for clock management will require larger sizes since they have to drive large loads. The systems under simulation include around 15.000 transistors. In order to get illustrative results, the RNS-enabled CIC filter was simulated under three different clocking strategies: first of all, a single global clock was used to synchronize the whole circuit, through a train pulse voltage source. Because this is a perfect SPICE source, the second clocking alternative considered a buffer in the clock path while keeping a single clock for global synchronization. Finally, the proposed design example was synchronized using four dclk_cell cells and four generated dclk signals, each one synchronizing an RNS channel. The design parameters for the dclk_cell cells, after careful selection, were selected as L d =1 µm, W d =2 µm and W b =9 µm. These three alternatives were simulated for two different clock frequencies, 125 MHz and 300 MHz. A comparison between the proposed strategy and the buffered clock simulation will illustrate the low power penalty introduced by this new clocking strategy, which is an added feature.

80mA 50mA proposed strategy. Finally, if power dissipation is considered, the comparison between the buffered clock and the proposed strategy shows that the latter does not introduce a significant increase in power. 0A -50mA -100mA 500ns 520ns 540ns 560ns 580ns 600ns I(V5) Time 80mA 50mA 0A -50mA -100mA 500ns 520ns 540ns 560ns 580ns 600ns I(V2) Time Fig. 7. Current from power supply line for a single clock (above) and the proposed alternative (below). Fig. 7 shows the current on power supply lines for the CIC full system working with a 300 MHz clock for both a single global clock and the proposed synchronization strategy, while similar results were obtained for the 125 MHz case. The considerable decrease in the magnitude of the current spikes is clearly evident. In this way, current supply to the chip is distributed over time when the new strategy is considered, while for a global clock current spikes are around four times larger. This indicates that the expected benefits derived from the proposed synchronization scheme are confirmed through simulation. Table 1 summarizes the results obtained for the different simulations and both clock frequencies. Note that the maximum current spike is clearly reduced with this new clocking strategy, as well as the maximum value of the current change rate (di/dt). This happens for the ideal clock as well as for the buffered one, thus indicating the validity of the 5 Conclusions This paper has presented a new alternative for synchronizing RNS-based systems and it has been shown how this method can reduce the current demand and its change rate. These advantages can be implemented with minimum resources, while power consumption is kept within reasonable values. The proposed strategy was tested using an RNS-enabled three-stage CIC decimation filter consisting of about 15.000 transistors. The application of this new synchronization scheme leads to reduced skewrelated problems, as well as reduces chip area through the reduction of the power supply line size, caused by the decrease in current and current change rate requirements. Acknowledgments Daniel González, Antonio García, Javier Ramírez, Luis Parrilla and Antonio Lloris were supported by the Comisión Interministerial de Ciencia y Tecnología (CICYT, Spain) under project PB98-1354. Graham A. Jullien acknowledges financial support from Micronet R&D, icore and NSERC. References: [1] J. Yuan and C. Svensson, High-speed CMOS Circuit Technique, IEEE Journal of Solid State Circuits, Vol 24, No. 1, pp. 62-70, Jan. 1989. [2] D. W. Bailey and B. J. Benchsneider, Clocking Design and Analysis for a 600-MHz Alpha Microprocessor, IEEE Journal of Solid State Circuits, Vol 33, pp.1627-1633, Dec 1998. [3] P. Ramanathan, A. J. Dupont and K. G. Shin, Clock Distribution in General VLSI Circuits. IEEE Transactions on Circuits and Systems I: Single clock Single buffered clock Proposed strategy 125 MHz 300 MHz 125 MHz 300 MHz 125 MHz 300 MHz Max pike 78.0 ma 96.5 ma 64.4 ma 93.8 ma 23.4 ma 32.8 ma Max di/dt 150 A/ns 254 A/ns 16.3 A/ns 36.6 A/ns 2.2 A/ns 11.9 A/ns Power 17.0 mw 33.7 mw 19.5 mw 35.8 mw 23.5 mw 57.3 mw Table 1. Summary of simulation results for an RNS-enabled three-stage CIC filter.

Fundamental Theory and Applications, Vol 41, No. 5, pp. 395-404, May 1994. [4] J. Yoo, G. Gopalakrishnan and K. F. Smith, Timing Constraints for High-speed Counterflow-clocked Pipelining, IEEE Transactions on VLSI Systems, Vol. 7, No. 2, pp. 167-173, Jun. 1999. [5] E. B. Hogenauer, An Economical Class of Digital Filters for Decimation and Interpolation, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 29, No. 2, pp. 155-162, Feb. 1981. [6] U. Meyer-Bäse, A. Garcia and F. J. Taylor, Implementation of a Communications Channelizer using FPGAs and RNS Arithmetic, Journal of VLSI Signal Processing, Vol. 28, No. 1/2, pp. 115-128, May 2001. [7] W. D. Grover, A New Method for Clock Distribution. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, Vol 41, No. 2, pp. 149-160, Feb. 1994. [8] M.A.B. Jackson, A. Srinivasan and E.S. Kuh, Clock Routing for High Performance IC s 27 th ACM/IEEE Design Automation Conference, 1990. [9] D. F. Wann and N. A. Franklin, Asynchronous and Clocked Control Structures for VLSI Based Interconnect Networks, IEEE Transactions on Computers, Vol 32, No.5, pp. 284-293, May 1983. [10] E. G. Friedman, Clock Distribution Networks in Synchronous Digital Integrated Circuits. Proceedings of the IEEE, Vol. 89, No.5, May 2001. [11] E. G. Friedman and S. Powell, Design and Analysis of an Hierarchical Clock Distribution System for Synchronous Cell/macrocell VLSI, IEEE Journal of Solid State Circuits, Vol. 21, No. 2, pp., 240-246, Apr. 1986. [12] M. Shoji, Elimination of Process-dependent Clock skew in CMOS VLSI. IEEE Journal of Solid State Circuits, Vol. 21, pp. 869-880, Oct. 1986. [13] N. S. Szabo and R. I. Tanaka, Residue Arithmetic and Its Applications to Computer Technology, McGraw-Hill, NY, 1967. [14] M. A. Soderstrand, W. K. Jenkins, G. A. Jullien and F. J. Taylor, Residue Number System Arithmetic: Modern Applications in Digital Signal Processing, IEEE Press, 1986. [15] MOSIS Process Information, Hewlett Packard AMOS14TB, http://www.mosis.org/technical/ processes/proc-hp-amos14tb.html