Lecture 2 High-Speed I/O Mark Horowitz Computer Systems Laboratory Stanford University horowitz@stanford.edu Copyright 2007 by Mark Horowitz, with material from Stefanos Sidiropoulos, and Vladimir Stojanovic 1
Readings Readings Techniques for High-speed Implementation of Nonlinear Cancellation, Sanjay Kasturia and Jack H. Winters Overview: Your project will be the design of a circuit that processes the input data from a high-speed I/O. This processing is generally done in a mixed signal manner today, but your job will be to build a digital implementation of the algorithm. This lecture will try to give you some background about why I/O rates are important, and what issues need to be resolved to achieve high performance. The next lecture will discuss the operation of the circuit you need to build. 2
Computers Today CPU DVI, HDMI AGP, PCI-E FSB, HT DDR, RDRAM FBDIMM Display >1GB/s Graphics Controller >4GB/s System Controller >4GB/s Memory PCI-X PCI-E Storage Network I/O >0.1GB/s I/O Controller PCI*, *ATA, USB.. 3
Speed of Light: The Difference Between I/O and On-Chip Wires First question: Why is I/O different from on-chip wires? Both send signals to each other Gates send data to each other all the time Don t generally worry about signals, or delay Model the connection between gates as a capacitor Sometimes a capacitor/resistor network Answer: On-chip, ignore the speed of light, assume c infinite For external wires can t make that assumption Wire connecting the pins is not an equipotential References are different 4
Finite Speed of Light Ramifications Signals must have delay in reaching destination Td = L/ν, bits arrive at a different time than when sent Thus must determine right time to sample them Wires store energy Current is set by the geometry of wire (what else?) Signal can t see termination resistor (causality) V/I for the line is called the impedance, Z < 300 Ω When signal is traveling on the wire Power goes into the wire before it hits load Since energy is conserved, wire must be storing energy Signal is ALWAYS a pair of currents 5
Link Issues Signaling: getting the bit to the receiver R TERM R TERM Tx Channel Rx Timing: Determining which bit is which 1 0 0 1 0 1 0 t bit /2 6
Transmission Lines Wire where you notice c is finite Current flows in one terminal And flows out the other Figure from John Poulton Energy is stored in E and B fields But can model with L, C 7
Problems : Material Loss Loss in GETEK : 1m, 8mil μstrip trace H(s) (transfer function) Frequency PCB Loss : skin & dielectric loss Skin Loss f Dielectric loss f : a bigger issue at higher f 8
Dealing With Current Return/References Wire Utilization: Single Ended shared signal return path Differential explicit signal return path - + Pseudo Differential ref - + 9
Transmission Lines Z2 Z1 ------------------- Z1+ Z2 2Z2 ------------------- Z1+ Z2 Z1 Z2 Two constraints govern behavior at any junction: Voltage are equal They are electrically connected Power is conserved Energy flow into junction is equal to transmitted and reflected 10
Z2 High-Speed Wires Are Point to Point Can t split a wire to go to two location You will get a reflection from the junction Z1 will see impedance discontinuity Z1 Z2 11
At High Speeds, Vias are Stubs Top layer signaling results in large via stub Signal energy splits at via If via is short can be modeled as a cap load Causes a reflection in signal Higher the frequency, the more sensitive you are to stubs 12
Backplane Environment Line card trace On-chip parasitic Package (termination resistance and device loading capacitance) Package via Back plane trace Back plane connector Line card via Backplane via Line attenuation Reflections from stubs (vias) 13
Backplane Channel Loss is variable Same backplane Different lengths Different stubs Top vs. Bot Attenuation is large >30dB @ 3GHz But is that bad? Attenuation [db] 0-10 -20-30 -40-50 -60 9" FR4, via stub 9" FR4 26" FR4, via stub 26" FR4 0 2 4 6 8 10 frequency [GHz] 14
Inter-Symbol Interference (ISI) Channel is low pass Our nice short pulse gets spread out pulse response 1 0.8 0.6 0.4 0.2 Tsymbol=160ps Dispersion short latency (skin-effect, dielectric loss) Reflections long latency (impedance mismatches connectors, via stubs, device parasitics, package) 0 0 1 2 3 ns 15
ISI 1 0.8 Error! Amplitude 0.6 0.4 0.2 0 0 2 4 6 8 10 12 14 16 18 Symbol time Middle sample is corrupted by 0.2 trailing ISI (from the previous symbol), 0.1 leading ISI (from the next symbol) resulting in 0.3 total ISI As a result middle symbol is detected in error 16
Equalization For Loss : Goal is to Flatten Response + = Channel is band-limited Equalization : boost high-frequencies; or attenuate low freq 17
Equalization Mechanisms 1 0.8 No equalization 0.6 0.4 Tx equalization Amplitude 0.6 0.4 Amplitude 0.2 0 0.2-0.2 0 0 2 4 6 8 10 12 14 16 18 Symbol time -0.4 0 2 4 6 8 10 12 14 16 18 Symbol time Tx equalization Pre-filter the pulse with the inverse of the channel Filters the low freq. to match attenuation of high freq. Rx feedback equalization Subtract the error from the signal 18
Removing ISI Linear transmit equalizer Tx Data Anticausal taps Sampled Data Deadband Feedback taps Channel Causal taps 50Ω d 50Ω outp outn d Tap Sel Logic Decision-feedback equalizer I eq0 Transmit and Receive Equalization Changes signal to correct for ISI Initial work was at transmitter J. Zerbe et al, "Design, Equalization and Clock Recovery for a 2.5-10Gb/s 2-PAM/4-PAM Backplane Transceiver Cell," IEEE Journal Solid-State Circuits, Dec. 2003. 19
Transmit Equalization Headroom Constraint Tx Data Anticausal taps Peak power constraint Channel Attenuation [db] 0-5 -10-15 unequalized equalized Causal taps -20 frequency [GHz] -25 0 0.5 1 1.5 2 2.5 Amplitude of equalized signal depends on the channel Transmit DAC has limited voltage headroom Unknown target signal levels Harder to make adaptive equalization work Need to tune the equalizer and receive comparator levels If you have multi-level signals 20
Removing Interference at Receiver Could also build a linear filter Could have gain in the filter But either it would need to be analog and have gain Or need high-speed A/D And real multiplication Sum (ai*xi) Increases channel noise too 21
High Frequency Channel Noise: Crosstalk Many sources On-chip Package PCB traces Inside connector Differential signaling can help Minimize xtalk generation & make effects common-mode Both NEXT & FEXT NEXT very destructive if RX and TX pairs are adjacent Full swing-tx coupling into attenuated RX signal Effect on SNR is multiplied by signal loss Simple solution : group RX/TX pairs in connector NEXT typically 3-6%, FEXT typically 1-3% 22
Subtract Out Residual Interference Called Decision feedback equalization (DFE) Subtracts error from input No attenuation 1 0.8 Feedback equalization Problem with DFE Need to know interfering bits ISI must be causal Problem - latency in the decision circuit Receive latency + DAC settling < bit time Can increase allowable time by loop unrolling Receive next bit before the previous is resolved Amplitude 0.6 0.4 0.2 0 0 2 4 6 8 10 12 14 16 18 Symbol time 23
Removing ISI Linear transmit equalizer Tx Data Anticausal taps Sampled Data Deadband Feedback taps Channel Causal taps 50Ω d 50Ω outp outn d Tap Sel Logic Decision-feedback equalizer I eq0 Transmit and Receive Equalization Changes signal to correct for ISI Initial work was at transmitter J. Zerbe et al, "Design, Equalization and Clock Recovery for a 2.5-10Gb/s 2-PAM/4-PAM Backplane Transceiver Cell," IEEE Journal Solid-State Circuits, Dec. 2003. 24
One Bit Loop Unrolling (for 2 level signal) 2PAM signal constellation 1 1 + αd 1 +α K.K. Parhi, "High-Speed architectures for algorithms with quantizer loops," IEEE International Symposium on Circuits and Systems, May 1990 1 +α +1 1 α +α +α 1 α + α 1 = 1 d n d n 0 α α x n dclk DQ d n 1 1 1 +α 1 α 1 +α 1 α α dclk 1 = 0 d n d n Instead of subtracting the error Move the slicer level to include the interference Slice for each possible level, since previous value unknown 25
More Bits/Hz Multi-level signaling (aka PAM) Convert extra voltage margin to more bits Works well when the noise is small Need even more signal processing 26
Internal Speed Limitation Links need good quality clocks with low jitter That means you want them to settle to both Vdd, and Gnd If you make the clock to fast, it will not rail And that means it will be prone to jitter So one limitation for links is internal clock rate For power efficiency want FO on clock to be around 4 Need pulse width 3-4 times the slowest gate Gives around 8 FO4 clock For higher speed bit rates Need to generate multiple bits/clock Use non-static CMOS clock circuits (CML & inductors) 27
Simple Demultiplexing Receiver Input Data_E in ref pre latch Data_O clk clk 2-1 demux at the input Preconditioning stage: filter/integrate, can be clocked to avoid ISI Reject CM Sometimes not used Latch makes decision (4-FO4) 28
Simple Multiplexing Transmitter DDR: send a bit per clock edge Critical issues: 50% duty cycle Tbit > 4-FO4 Data_O Data_E output pulse width closure (%) 30 20 10 0 1 2 3 4 5 bit time (normalized to FO4) 29
I/O Clocking Issues Remember the clocking issues: Long path constraint (setup time) Short path constraint (hold time) Need to worry about them for I/O as well For I/O need to worry about a number of delays Clock skew between chips Data delay between chips Can be larger than a clock cycle (speed of light) Clock skew between external clock and internal clock This can be very large if not compensated It is essentially the insertion delay of the clock tree 30
System Clocking: Simple Synchronous Systems d1 CK X CK X D I CK C1 CK C2 D I d2 on-chip logic CK C1 CK C2 Long bit times compared to on chip delays: Rely on buffer delays to achieve adequate timing margin 31
PLLs: Creating Zero Delay Buffers PLL/DLL CK X CK C CK X D I on-chip logic D I CK C On-chip clock might be a multiple of system clock: Synthesize on-chip clock frequency On-chip buffer delays do not match Cancel clock buffer delay 32
Used to Argue About PLLs vs DLLs VCO VCDL clk clk N ref clk PD ref clk PD Filter Filter Second/third order loop: Stability is an issue Frequency synthesis easy Ref. Clk jitter gets filtered Phase error accumulates First order loop: Stability guaranteed Frequency synthesis problematic Ref. Clk jitter propagates Phase error does not accumulate 33
After Many Years of Research And many papers and products One can mess up either a DLL or PLL Each has it own strengths and weaknesses If designed correctly, either will work well Jitter will be dominated by other sources Many good designs have been published It is now a building block that is often reused We all have our favorites, mine is the dual-loop design And yes, people use ring oscillators Still an open question about how much LC helps (in system) 34
Clocking Structures Synchronous: Same frequency and phase Conventional buses t t F 0 Mesochronous Same frequency, unknown phase Fast memories Internal system interfaces MAC/Packet interfaces t A t A t B F 0 t B Plesiochronous: Almost the same frequency Mostly everything else today F 1 F 2 F 1 F 2 35
Source Synchronous Systems CK SRC PLL/DLL CK RCV data rcvr logic ref CK SRC data D 0 D 1 D 2 D 3 CK RCV Position on-chip sampling clock at the optimal point i.e. maximize timing margin 36
Serial Link Circuit CK R rcvr logic D IN D 0 D 1 D IN CDR CK R Recover incoming data fundamental frequency Position sampling clock at the optimal point 37