A 700 Mbps/pin CMOS Signalling Interface Using Current

advertisement
To be published in the IEEE Journal of Solid State Circuits April 1997
A 700 Mbps/pin CMOS Signalling Interface Using
Current Integrating Receivers*
Stefanos Sidiropoulos and Mark Horowitz
Center for Integrated Systems
Stanford University
Stanford, CA 94305
Correspondence Address:
Stefanos Sidiropoulos
Center for Integrated Systems, Rm. 125
Stanford University
Stanford, California 94305-4070
Phone: (415) 725-3669
FAX: (415) 725-6949
Abstract — A high speed CMOS signalling interface for application in multiprocessor
interconnection networks has been developed. The interface utilizes 1-V push-pull drivers, a
Delay Line PLL and sampling of the data on both edges of the clock. In order to increase the
noise immunity of the reception a current-integrating input pin sampler is used to receive
the incoming data. Chips fabricated in a 0.8 µm CMOS technology achieve transfer rates of
740 Mbits/sec/pin operating from a 3.3-V supply with a bit error rate of less than 10-14.
* Funding
for this work was provided by ARPA under contract DABT63-94-C-0054.
1
A 700 Mbps/pin CMOS Signalling Interface Using
Current Integrating Receivers*
Stefanos Sidiropoulos and Mark Horowitz
Center for Integrated Systems
Stanford University
Stanford, CA 94305
Correspondence Address:
Stefanos Sidiropoulos
Center for Integrated Systems, Rm. 125
Stanford University
Stanford, California 94305-4070
Phone: (415) 725-3669
FAX: (415) 725-6949
I. INTRODUCTION
Scaling of semiconductor technology and advances in circuit design led to a rapid increase in
the speed of digital and memory IC’s. This created a demand for higher pin bandwidth on CMOS
chips, which in turn led to the development of synchronous high-speed interfaces with diverse
system applications [1]-[6]. Although the developed interfaces utilize different transmission
media and have diverse external configurations and line voltage swings, they all share a common
characteristic which leads to increased noise sensitivity: the incoming data is sampled only once
per bit period. This paper describes an interface design which overcomes this problem by
utilizing current integrating receivers to effectively filter high frequency noise and increase noise
immunity [7]-[8].
Section II describes this interface architecture and the external signalling scheme. Section III
introduces the concept of current integrating input pin sampler and describes in detail the
implementation used in this design. Section IV describes the clocking circuit design including the
Delay Line PLL and the associated clock duty cycle adjusting circuits. Section V discusses the
prototype measurement results and concluding remarks follow in Section VII.
* Funding
for this work was provided by ARPA under contract DABT63-94-C-0054.
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
2
II. SYSTEM ARCHITECTURE AND EXTERNAL INTERFACE
In our system a data link consists of a set of data wires along with one clock wire. The data is
transmitted in phase with the clock as depicted in Figure 1. One unit of parallel information is
transmitted every half period of the clock. Given that the transmission line lengths of the clock
and data wires are carefully equalized, the edges of the reference clock coincide with potential
data transitions, so the receiver can use that timing information to position its sampling clock. Bus
based interfaces [1] usually have to synchronize their transmission to an existing bus clock thus
requiring a transmitter DLL or PLL. In this point-to-point system the reference clock is generated
by the transmitter itself (by transmitting an alternating one-zero data stream) thus eliminating the
need for a transmitter clock alignment circuit. On the receiver side the reference clock has to be
amplified to full CMOS levels and then buffered up to drive the receivers of the incoming parallel
data. A conventional receiver would position the sampling clock edges in the middle of the
incoming data eye and use the sampling clock to sample the data once per half-clock period. In
contrast this design employs a delay-locked loop to position the sampling clock in phase with the
incoming data and average the data during the clock phase by employing current integration. At
the end of the half-clock period the receiver determines whether the incoming data was mostly
high or low based on the averaged value. This integration of the incoming data makes the
reception of the signals insensitive to high frequency supply or reflection noise and improves the
interface performance. It should be noted that using an integrating receiver is effective when the
predominant limitation is noise as is the case with moderate length (< 5-m) system interconnects.
When the transmission is limited by the interconnect bandwidth some form of channel
equalization can be employed. For example current integration at the receiver side can be
combined with digital signal predistortion on the transmitter side [6] to achieve the desired
performance.
Figure 2 shows the block diagram of the interface. In order to minimize power consumption
and simultaneous switching noise the transmitter uses a simple push-pull series terminated output
buffer rather than the more conventional parallel terminated open drain driver design. Series
termination offers the advantage of zero static power dissipation since when the output is not
switching there is zero quiescent current on the transmission line. The transmitter’s push-pull
drivers are operated from an external to the chip 1-V supply and are series terminated with an
external resistor. Due to the low swing supply voltage of the drivers only NMOS transistors are
used in the final stage making its area smaller. To avoid reflections the series impedance of the
driver and its external termination resistor should be equal to the 50−Ω impedance of the
transmission line. The nominal impedance of the buffer was set to 20 Ω so a 30−Ω Ohms external
resistor is required to drive the 50−Ω line. The external resistor was used to decrease the effect
process variations and transistor nonlinearities have on the impedance matching. Process and
environmental variations that cause the buffer impedance to deviate from its nominal value will
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
3
have only a modest impact on the quality of the transmitted signals, since the buffer impedance is
only a fraction of the series driving impedance. Therefore no active impedance control [2] was
integrated in our design. The power dissipation of the transmitting buffers depends on the
electrical length of the wire and the frequency of switching. When the time of flight through the
wire is very short compared to the bit time, the buffer only dissipates dynamic power (1.22 mW/
100MHz). When the time of flight through the wire is longer than the bit time the buffer drives 10
mA of current until the signal is reflected back, which results to a worst case power dissipation of
10mW+1.22 mW/100MHz (including the power dissipated on the external resistor).
In order to compensate for the noise margin degradation that would be introduced from DC
level shifts between the transmitter and the receiver board, the signals are transmitted in a pseudodifferential form. A reference voltage is generated on the transmitter board by using a simple
resistive voltage divider and shipped to the receiver. On the receiver side each data line is input to
a pair of current integrating receivers (one for each clock half-period) which average the data
value relative to the input reference voltage. This way DC level shifts will not affect the reception
of the signals as long as they remain within the common mode range of the input pin sampler.
III. CURRENT INTEGRATING RECEIVER DESIGN
A. Concept of Operation
Conventional high speed interfaces usually function as depicted in Figure 3. A high speed
reference clock is transmitted in phase or in quadrature with the high speed parallel signals. In
order to minimize the system cost differential transmission is avoided. Instead a reference voltage
is generated on the transmitter side and send to the receiver to indicate the common mode voltage
of the transmitted data. At the receiver end the reference clock is amplified, buffered up and phase
shifted by 90o relative to the incoming data and reference clock by employing a Phase Locked
Loop. This quadrature positioning of the clock gives the input pin receiver maximum setup and
hold time to partially compensate for any skew that might be present between the reference clock
and the parallel data signals. However noise that might occur during sampling can cause the
receiver’s sampler to resolve the wrong value, degrading the interface performance. The sources
of such noise on the environment of a digital chip are numerous. One of the largest sources of
noise comes from the reference voltage being heavily coupled to the receiver’s ground. Therefore
any high frequency ground noise manifests itself as a differential mode offset at the input of the
receiver. This reference noise problem becomes even worse in the case where one chooses to
implement a full-duplex interface which superimposes the transmit and receive signals on the
same wire [2]-[3]. In this case the multiplexed reference voltage must track the transmitter’s
output, which is difficult to accomplish when the output switches. Increasing the signal amplitude
to compensate for the noise is only partly effective since a large fraction of the noise is
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
4
proportional to the signal swings. Using fully differential transmission reduces these problems at
the cost of additional pins and wires and increased power dissipation.
A cost effective solution to the noise problem would be to ensure that switching noise does not
occur during the sampling period by e.g. restricting the output buffer switching to happen at a
time instant different that the input pin sampling. This solution however poses onerous
restrictions on the system designer, and is possible only when the time of flight through the wires
is small compared to a bit time, or all wire delays are a fixed number of bit times. In general such
a solution might be impossible to implement in systems where more than one high-bandwidth
interfaces with unconstrained phase relationship between their clocks are integrated on the same
die. Another approach to the noise problem would be to sample the incoming data more than once
per bit period and use a digital majority voting scheme to determine the value of the transmitted
data. The disadvantages of such a solution are that the required power and area grow linearly with
the number of samples - the minimum required 3 samples would increase the power dissipated on
the sampling clock and the area occupied by the input samplers by a factor of 3. Additionally
positioning the sampling edges precisely with low intra-edge jitter is a difficult problem by itself
[9].
The solution we chose to implement is the analog equivalent of majority voting. The receiver
integrates current on a capacitor based on the polarity of the input voltage Vin-Vref and determines
the polarity of the incoming data at the end of the sampling period. This method requires only a
single clock defining the sampling/integration period and its power and area requirements are
moderate. Figure 4-(a) shows an idealized diagram of a current integrating receiver. This ideal
receiver consists of a current switch, a pair of load capacitors and two reset switches. The level of
clock phase φ indicates the input data-valid period. When φ is low the switches are on equalizing
the integrator output. When φ goes high the current switch steers current to the one branch of the
integrator or the other depending on whether the input is higher or lower than the reference
voltage. At the end of clock phase φ the polarity of the output differential voltage ∆Vo indicates
whether the input signal was mostly low or high during the integrating period. Any transient noise
that would cause the input to cross the reference voltage would not affect the correct reception of
the signal as long as the duration of the transient noise is less than half the valid bit period. Such
transient noise affects only the magnitude but not the polarity of voltage ∆Vo. Therefore as long as
the value of ∆Vo is larger than the offset of the amplifier following the first integrating stage the
correct reception of the input signal will not be affected. This implementation integrates the sign
of the input voltage rather than the input voltage itself as is the case with an ideal matched filter.
This is done to avoid the unequal output amplitudes that a matched filter receiver would produce
in case Vref was DC-shifted relative to the middle of Vin.
Figure 4-(b) shows the phase characteristics of the ideal integrator assuming that its output is
fed to an ideal sample and hold network. In this experiment the input data is a clock waveform of
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
5
the same frequency as the sampling clock. The phase shift between the input data and the clock
varies from 0o to 360o and the value of ∆Vo at the end of the integration period is plotted versus
the corresponding phase-shift. The resulting phase characteristics curve has a triangular shape
which corresponds to the phase characteristics of a quadrature phase detector. When the input
signal and the sampling clock are in phase or 180o out of phase the switch current is dumped in
only one of the integrating capacitors for the full bit duration. Therefore in that case the
differential output voltage ∆Vo has its maximum or minimum value of ± (I × tb)/C where I is the
switch current, tb is the bit time (half the clock period), and C the integrating capacitor size. When
the input signal and the sampling clock are in quadrature the switch current is dumped evenly on
each of the two integrating capacitors for half the bit time and ∆Vo ends up being zero. For our
integrating receiver this zero crossing of the phase-shift axis in Figure 4-(b) is the equivalent of
the center of the sampling uncertainty window. In the ideal receiver this center of the setup+hold
time is located in the middle of the bit time. However, in a real implementation the center of the
sampling uncertainty window moves around the ideal 90o point due to offsets in the current
switch and the following amplifying stage.
B. Current Integrator Implementation
In a system implementation the input pin sampler is required to have full swing outputs that are
held stable for a full clock cycle. Since the output of a simple current switch does not satisfy the
above requirements the implementation of a complete receiver utilizes some extra circuits. Figure
5-(a) shows the block diagram of the complete implementation of the current integrating receiver.
It consists of a current integrating stage, followed by a sample and hold circuit, an amplifier and a
differential latch. Two receivers are used in parallel per input pin to sample the data transmitted
on both the half-periods of the clock. Figure 5-(b) illustrates the timing of the receiver. During the
sampling period the differential voltage of the integrator is sampled while the amplifier and the
latch is being reset. At the end of phase φ the sample and hold network enters the hold state and
the amplifier amplifies the held voltage. Two inverter delays after the end of φ the latch is
triggered and the first stage and the sample and hold network are reset. This self-timed reset
facilitates equalization of all the intermediate nodes in the circuit, thus minimizing a potential
source of inter symbol interference.
A straightforward implementation of the current integrating stage is shown in Figure 6. In
order to accommodate the low input common mode voltage of our interface, the current switch is
implemented as a PMOS source coupled pair consisting of devices M1-M3. The reset switches are
implemented as NMOS devices S1, S2 and the load capacitors are shown here as the linear
elements C1, C2. In order for the first stage to achieve a behavior as close to the ideal integrator as
possible the source coupled pair must be able to steer almost all of its tail current with only a
fraction of the input differential voltage. An MOS differential pair behaves similarly to a current
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
6
switch only as long as the input differential voltage is large enough. We therefore define the
operating margin Vom† of the integrator to be the difference between the nominal input
differential voltage ∆Vin and the voltage required to completely steer the tail current. Another
important requirement for the first stage is that we want to maximize its output swing Vo =(I × tb)/
C, in order to make the reception less sensitive to on chip noise. To maximize the output swing
one needs to minimize the load capacitance. As long as M1-M2 can be kept in the saturation
region (meaning that the output voltage Vo is less than a body effected VTP) one can set C1=C2=0
and let the differential pair integrate its tail current on its parasitic drain junction capacitance.
Assuming a quadratic MOSFET model and setting the tail current I = (Vo × C)/tb and the
integration capacitor C = cd × W, we can calculate the operating margin of the integrator
implementation:
2 × Vo × cd × L
V om = ∆V in – V offs – ----------------------------------t b × µ p × C ox
(1)
where: ∆Vin is the input voltage swing (usually set by system constraints), Vo is the integrator
output swing, cd is the drain junction capacitance of M1 and M2 per unit of their width, L is the
length of M1 and M2, tb is the bit time, µp is the hole mobility and Cox the gate oxide capacitance.
Equation (1) shows that given the input and output swing requirements and a target operating
speed the operating margin of the integrator depends only on technology parameters and increases
as the process transconductance and the parasitic junction capacitance improve. In other words for
any given requirement of Vo and tb there is a set of pairs of widths of M1-M2 and tail currents
which determine the operating margin voltage. In a practical design one has to pick a tail current
and set the device widths based on a trade-off between minimizing power and making the
integrating capacitors large enough to minimize coupling effects. A secondary requirement is that
the tail current has to be large enough so that coupling on the bias line will not introduce a
significant integration error. In our case where tb=2nsec and ∆Vin=500mV, we set Vo to be 800
mV the width of M1-M2 to be 100 µm for 200 µA of nominal tail current which gives us a Vom of
300-350 mV over process and temperature - note however that these results are for the actual
implementation described below in which the charge injection error cancellation circuits double
the parameter cd.
For high current gain M1-M2 must run at low current per unit width which increases the effect
of parasitic capacitances Ci and Ct in Figure 6. The gate to drain overlap capacitance Ci of
that the operating margin Vom of the integrator is different from the noise margin V nm=∆Vin−Voffs of the
differential pair. If the input differential voltage is less than Vom the receiver will still work but the center of the sampling uncertainty window for high and low inputs will deviate from its ideal 90o point if VIH≠VIL. Nevertheless as
long as VIH=VIL and Vnm<VIH<Vom the center of setup and hold margin will not deviate - only the maximum output
differential voltage of the integrator will decrease.
† Note
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
7
transistor M1 couples the input transitions on the output integrating capacitor introducing a
systematic integration offset. The significant tail capacitance of the source coupled pair Ct creates
another systematic offset. Due to the fact that the input to the receiver is pseudo-differential, when
the input transitions the tail node must settle at a gate overdrive above min{Vin,Vref} before the
tail current is steered to one of the branches of the source coupled pair. The significant tail
capacitance Ct consumes or sources some charge to or from the output node and therefore
introduces another systematic offset into the integrator. These systematic offsets are manifested in
the phase characteristics of the integrator by increasing the output voltage of the integrator when
the input is low and decreasing it when the input is high. At the same time the time intervals
where the data is detected as high and low in Figure 4-(b) are correspondingly decreasing and
increasing which means that the sampling uncertainty window width effectively increases.
Simulation results show that the increase of the sampling uncertainty window width can be as
large as 500 psec, i.e. 25% of our 2-nsec target bit time.
C. Charge Injection Error Cancelation
The circuit in Figure 7 (described in more detail in [7]) was used in our first design in order to
compensate for the charge injection errors. The input differential pair consists of devices M1-M3
and the reset network consist of devices S1-S2. The differential pair is augmented by three
capacitively connected devices MC1-MC3. Transistors MC1 and MC2 are employed to cancel the
systematic offset introduced by the gate-to-drain capacitance of M1. These two devices have half
the width of the input devices and in normal operation they always remain off. This way the
transitions of the input node are coupled on both the outputs of the integrator and therefore the
differential mode error introduced by the transitions of Vin is turned to a common mode variation.
The offset introduced by the tail capacitance of the differential pair is canceled by transistor MC3.
When the input transitions MC3 boosts the tail capacitance to its quiescent voltage level and
therefore minimizes the charge injection error if the capacitances C t and MC3 are properly ratioed.
The phase characteristics simulated over three different process and environmental condition
corners at an operating frequency of 250 MHz are depicted in Figure 8. It can be seen that by
employing the cancelation techniques the worst-case sampling uncertainty window width
decreased from 500-psec to 180-psec. The imperfect cancelation seen in the “fast” and “slow”
simulation corners is mainly introduced by the mismatch in the ratio between gate and drain
capacitances in the tail boosting circuit. These two parameters can change disproportionately over
process variations and increase the sampling uncertainty window width.
To decrease the process sensitivity in the charge injection cancelation circuits our current
design uses the circuit shown in Figure 9. This figure shows the design of the current integrating
stage along with the associated sample and hold circuit. The current integrator in this design
consists of two differential pairs with cross coupled outputs. The main differential pair consists of
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
8
transistors M1-M3 and is run at normal current levels. The source coupled transistors of the charge
injection cancelation differential pair MC1-MC2 are identical to M1-M2. In addition the total width
of MC3 and MC4 is equal to that of M3 causing the capacitances on the tail nodes of the two
differential pairs to be the same. Therefore the cancelation differential pair provides a parasitic
induced error that is equal in magnitude and opposite in polarity to that induced by the main
differential pair. In this way all the parasitic induced differential errors are transformed to a
common mode variation which is small enough not to affect the operation of the second stage.
The implication of the canceling differential pair is that the output swing ∆Vo of the integrator is
reduced by a fraction equal to the ratio of the current of the canceling differential pair I C to that of
the main differential pair IM, which in turn suggests that we would want to minimize that ratio.
However making the ratio of currents arbitrarily small affects the canceling action of the second
differential pair. The requirement that bounds the lower limit of the offset canceling current Ic is
that the node tailC should reach its quiescent voltage within the integration period in order for the
parasitic induced error of the two differential pairs to be equal. Simulation results indicate that
choosing the ratio of IC to IM to be 4 gives us a worst case 20% margin over all process corners
for a 1.6-nsec bit time. Figure 10 shows the phase characteristics of this alternative design. It can
be seen that employing the offset-canceling differential pair reduced the systematic errors to
below 50 psec (2.5% of our target bit time). The remaining small error is due to the nonlinear
nature of the tail junction capacitance: since the offset canceling differential pair is run at a lower
current than the second differential pair node tail settles at a higher voltage than node tailC
making its capacitance slightly higher and introducing an imbalance between the charge injected
by the parasitics of the two source coupled pairs.
Transistors S1-S10 in Figure 9 form the sample and hold network and the integrator reset
switches. To compensate for any overlap that might be present between the resetting phase φ and
the sampling phase φ the reset network is formed as a stack qualified by a delayed version of φ.
Phase ψ is generated locally by delaying φ through two inverters. This way the first stage is reset
only after the sample and hold switches S1, S2 have been completely shut off.
D. Current Integrator Biasing
The fact that the current integrator has no explicit load capacitance makes its output swing
sensitive to process variations of the parasitic drain capacitance. In addition the output voltage of
all integrators depends on the integration time. To reduce the variations caused by both of these
effects the integrator stage of all the receivers is biased by the circuit shown in Figure 11. This
replica bias circuit dynamically adjusts the current through the PMOS current sources so that the
integrator output voltage ∆Vo is held constant and independent of the value of the parasitic
integration capacitor and the operating clock frequency. The output of an integrator replica is low
pass filtered through MR1-MC1 and subsequently drives an operational amplifier which compares
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
9
it with a preset voltage. The amplifier adjusts the current through the integrator replica so that the
low-pass filtered output voltage VLPF remains equal to the external preset voltage. Compensation
for the feedback biasing circuit is accomplished with the explicit capacitor formed by transistor
Mc. The dominant pole of the circuit is set to 1-MHz, well below the minimum operating
frequency of the receiver clock and the first non dominant pole formed by MR1-MC1.
E. Amplifier and Latch Design
Figure 12 shows the schematic diagram of the amplifier and the latch. The amplifier buffers the
output of the sample and hold network isolating the latch kick-back and also level shifts the low
common mode voltage of the sampled integrator output voltage to improve the latch performance.
The combination of cross coupled and diode connected loads forms a high differential impedance
which increases its small signal gain while the diode clamps limit the maximum output swing
reducing kick-back and facilitating faster reset. In order to prevent variations of the bandwidth of
the amplifier to introduce inter symbol interference an explicit reset switch is used at its output.
The differential latch converts the low swing output of the amplifier to a full swing CMOS signal.
The output of the precharged latch is held stable for a full clock cycle by a simple cross coupled
S-R latch. The width of the sampling uncertainty window of the receiver is affected not only from
systematic charge injection offsets in the first stage but also by mismatches between nominally
identical devices in the amplifier and the following latch. Simulation results assuming 25-mV
VTH mismatch between nominally identical devices in all stages indicate that the maximum
increase in the sampling uncertainty window is less than 80 psec.
IV. RECEIVER CLOCKING CIRCUITS
The task of the receiver’s Delay Locked Loop is to position the sampling clock at an optimum
point for integrating the incoming data independent of variations in process, voltage, temperature
and sampler setup and hold time. The current integrating receivers show very little variation in
their setup and hold time window and the optimum point of their sampling clock is exactly in
phase with the incoming data. Therefore the need for a receiver DLL would diminish if the
reference clock was a full swing signal that could drive the on chip input pin samplers. However,
since the reference clock is a 1-V swing signal that needs to be amplified and buffered up to drive
the input pin samplers, a DLL is needed to cancel the amplification and buffering delay.
Figure 13 shows a block diagram for the receiver DLL. The DLL consists of a conventional
core (delay line, charge pump and phase detector) along with a controlling finite state machine
and a pair of duty cycle adjusters. Conventional DLL designs use an amplified version of the
reference clock as the input to the delay line [10]. This approach has the main disadvantage that
the jitter inherently present in the reference clock will be propagated to the sampling clock.
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
10
Therefore if the reference clock moves due to noise the sampling clock will move a time instant t
later - where t is the delay through the delay line and the clock buffers. This all-pass filter
behavior creates a phase shift between the sampling clock and the input data at high noise
frequencies. This phase shift peaks at 180o at frequencies which are odd multiples of 1/(2×t). To
eliminate this effect this design uses a differential ECL level clock as the delay line input. This
clock has the same frequency as the reference clock but it carries much less phase noise. In a large
chip such as a multiprocessor router the cost of the extra two pins required for this clock can be
amortized over the multiple interfaces integrated on the same die.
The DLL phase detector compares the received reference clock with the sampling clock
generated by the DLL. The phase detector is implemented as a sampling variation of the input
integrating receiver. A pair of NMOS pass transistors sample the input reference clock at the
rising edge of the sampling clock and feed the sampled value to a regular input pin receiver. In
order to compensate for the delay introduced in the phase detector by the input sampling network
all the regular integrating receivers are augmented with the same sampling transistors with their
gates tied to the positive supply. The output of the phase detector is integrated by the charge pump
producing the control voltage Vcp. In order to limit the amount of dither jitter at lower frequencies
the phase detector output is qualified by a pulse with fixed width [10]. This way a fixed charge
packet is delivered to the filter capacitor every cycle keeping the dithering of the control voltage
around the locking point fixed and independent of the operating frequency.
The delay line is implemented as a series of eight current starved delay elements. In order to
improve the supply sensitivity of the DLL we used differential delay elements with symmetric
impedance loads [9]. Figure 14 shows the delay element schematic diagram. The control voltage
Vcp is a buffered version of the charge pump control voltage, while V cn is generated by a replica
feedback biasing circuit which ensures that the delay through the elements stays constant
independent of variations on supply and temperature.
In our system the reference clock and the quiet clock input to the delay line might have an
arbitrary phase relationship with each other. Therefore there is no guarantee that upon reset the
DLL will not be in a situation where it will be trying to lock close to or below its minimum delay
point. To evade this problem a finite state machine controls the DLL capture. Upon reset the
control voltage is set to produce a delay that is in the worst process corner 0.5 nsec more than the
minimum. When the reference clock is activated the finite state machine ignores it for the first 8
cycles. If during the subsequent 16 clock cycles the phase detector produces a “down” signal the
finite state machine phase shifts the sampling clock by 180o - a multiplexing differential delay
element in the delay line is used for this purpose. This way the maximum required delay range of
the delay line is T/2+0.5 ns (where T is the clock period). Resetting the delay line above its
minimum delay point gives the DLL margin across temperature induced drifts of the reference
clock. In order to minimize the loop acquisition time while keeping the dither jitter low the FSM
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
11
also controls the charge pump current. After the decision on the relative phase between the input
and reference clocks has been made the FSM increases the charge pump current by a factor of 5
for approximately 300 nsec. After that time interval and when the DLL has obtained coarse lock
the current is brought back to the normal operating levels to obtain low dither jitter [11]. In this
implementation no particular care was taken for minimizing the propagation of metastable states
from the phase detector output to the FSM. This problem can be solved easily by delaying the
phase detector output through a flip-flop chain before driving the FSM.
Since data is transferred on both the half-clock periods any variation of the sampling clock
from the optimal 50% duty cycle point would cause one of the two input samplers to integrate the
wrong data item, degrading the timing and noise margins of the input pin samplers. Variations in
the clock duty cycle can be either inherited from the ECL input clock to the delay line, or induced
by the amplifiers and the clock buffers at the delay line output. To compensate for these effects the
DLL two duty cycle adjusters (DCA) are employed. The first DCA is used at the input of the
delay line. This input duty cycle adjuster uses two differential delay elements connected in a
feedback loop with NMOS capacitors that remove the AC component of the voltages [9]. The
output of this circuit is tied to the output of the first delay element and compensates for any duty
cycle errors present in the input clock. The second DCA is embedded in the final stage which
converts the low swing clock output of the delay line to full CMOS swing. The schematic
diagram of the this converter is shown Figure 15. Two amplifiers with current mirror load and an
extra port are employed to generate the signals c and c. The extra port of the amplifier is
connected to the output of a band limited delay element which low pass filters the sampling clock
signals. If the sampling clock has any duty cycle imperfection the differential output voltage of
the low pass filter will induce a input offset to the clock amplifier. This negative feedback will
cause the duty cycle of the sampling clock to be adjusted. A second stage current mirror amplifier
increases the output swing before driving the first stage of the clock buffer inverters.
V. EXPERIMENTAL RESULTS
To evaluate the interface performance an experimental prototype was fabricated in a 0.8-µm
(1.0 µm drawn) digital CMOS process. The 2.5×2.5mm2 die (Figure 16) contains the receiver
DLL, four current integrating receivers and eight output drivers. Three of these drivers constitute
the transmitter side of the chip while the rest of them are used to monitor buffered versions of the
receiver’s internal signals. The prototype chip was packaged in 40-pin dual in-line package. To
ameliorate the effects of the large pin inductance of the package used, the high speed signals were
routed through the pins with the lowest inductance (~10 nH). Additionally a total of 16 pins were
dedicated to the chip power: 6 for the ground/substrate node, 4 for the 3.3-V positive supply and 4
for the 1-V output driver supply.
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
12
To assess the bit error rate of the interface, the transmitter can selectively transmit either an
externally configurable 8 bit long data-stream or a pseudo random bit sequence (PRBS). The
receiver contains the corresponding bit error rate (BER) testing circuits. Due to area constraints
the on-chip PRBS generation and error detection circuit can generate a 2×(27-1) long sequence.
Normally a longer PRBS sequence would be desirable to detect any bandwidth limitations. In our
case where the input pin samplers are reset every cycle the use of a shorter PRBS does not affect
the validity of the measured BER.
In our experimental set-up we used two-sided printed circuit boards for the transmitter and
receiver chips. The packages of the transmitting and receiving chips were mounted on PCB
through ZIF sockets and the high speed signals were carried through 1-m long coaxial cables.
Figure 17 shows the measured transfer rate versus the operating supply voltage. The bit error rate
for this measurement was at least less than 10-11 for all cases. At the low end of supply voltages
(2.7-V) the interface achieves an operating speed of 540 Mbps/pin. At the high operating voltage
end the achievable speed was limited from the ability of the pulse generator used in the set-up to
produce a clean reference clock beyond 455 MHz. A more persistent measurement at the nominal
operating point of 3.3-V revealed that the actual BER at a 740 Mbps/pin rate is less than 10-14
(three days of continuous operation did not yield a single error).
Figure 18 shows an eye diagram of the sampled data output of one the receiver’s samplers
(after being buffered and driven off chip). In this experiment the chip operates in a loopback mode
at a 740 Mbps/pin transfer rate and with a peak-to-peak on chip supply noise of 200 mV. The
pseudo random data eye experiences a jitter of 180 ps peak-peak (28 ps RMS). Furthermore
simulation results show an overall static supply sensitivity of 0.7 ps/mV.
The sampling uncertainty window of the interface was measured by keeping the reference
clock at a fixed position and varying the skew between the reference clock and the data. In this
experiment we assumed that the sampling uncertainty window was violated if the bit error rate
was more than 10-9. The worst case sampling uncertainty window of the system width was found
to be 280 psec with its center located 80 psec from the center of the DLL locking point. Referring
back to Figure 5-(b) the sampling uncertainty window has an a width of 25.2o and its center is
located -6.3o off the ideal 90o point. Note however that this is the composite uncertainty window
of the system since it includes transmitter, DLL, and receiver offsets. In order to compare the new
implementation of the integrating receiver with the initial process sensitive implementation
(Figure 7) the experiment was repeated with the input data being a clock waveform and all the
major on-chip noise sources turned off. In that case the window width is 50 psec showing an
improvement of a factor of 3 relative to our previous design [7]. The sensitivity of the input pin
sampler was measured by decreasing the 1-V buffer supply and scaling the appropriately the
reference voltage. It was found that the receiver can still operate correctly with a BER<10-11
when the input pin voltage is 50 mV around the reference.
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
13
To evaluate the effectiveness of the duty cycle adjuster we varied the duty cycle of the input
clock and measured the duty cycle of the sampling clock with the two DCA’s selectively enabled
and disabled. The results are shown in Figure 19. It can be seen that with both DCA’s enabled the
chip can accommodate an up to 10% duty cycle distortion while the sampling clock would still be
within 1% of its nominal value. Due to the nature of the experiment the effectiveness of the input
DCA is more pronounced in this case.
The maximum power dissipation of the chip operating in loopback mode at 740 Mbps/pin
from a 3.3-V supply was measured to be 300 mW.
VI. CONCLUSION
Single ended high-speed parallel links have been sensitive to high frequency noise induced on
their reference lines. This paper described a method to reduce the effect of this noise, and a
complete interface using this technique. The interface uses an input pin receiver that integrates the
incoming data over its valid period rather than sampling it. To reduce the power dissipated in the
output drivers a 1-V swing NMOS push-pull design is used. To reduce the jitter on the sampling
clock the receiver’s DLL is implemented by using differential buffers with replica feedback
biasing.
A prototype implementation of the described circuit has been integrated in a 1-µm drawn
CMOS technology. The prototype achieves a nominal transfer rate of 700 Mbits/sec/pin with a
BER of less than 10-14. The interface has a sampling uncertainty window of 280 ps maintaining a
BER of less than 10-9 when operating at the edge of the uncertainty window. Although full-duplex
operation is not supported by the current implementation the receiver’s insensitivity to high
frequency noise in the input signals makes it particularly well suited for such applications. The
current design can achieve a maximum transfer rate of 900 Mbps/pin which demonstrates that
higher than 1 Gbit/sec/pin operation can be achieved with faster sub-micron technologies.
The interface described in this paper has been designed for application in multiprocessor
interconnection networks. However, the interface architecture and the integrating receiver design
are general and can be used in other applications such as high speed processor-to-memory
interfaces or ATM switching systems.
ACKNOWLEDGEMENTS
The authors would like to thank Tom Lee, Norm Jouppi and Mark Johnson for helpful
comments in the initial stages of this design. They are also indebted to Ken Yang and Birdy
Amrutur for assistance and stimulating discussions.
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
14
REFERENCES
[1]
N. Kushiyama et. al., “A 500Mbyte/sec Data-Rate 4.5M DRAM,” IEEE Journal of Solid
State Circuits, vol. 28, no. 4, April 1993
[2]
T. Takahashi, et. al., “A CMOS Gate Array with 600 Mb/s Simultaneous Bidirectional I/O
Circuits,” IEEE Journal of Solid State Circuits, vol. 30, no. 12. Dec 1995
[3]
R. Mooney, C. Dike, S. Borkar, “A 900 Mb/s Bidirectional Signalling Scheme,” IEEE
Journal of Solid State Circuits, vol. 30, no. 12. Dec 1995.
[4]
D. Cecchi et. al., “A 1GB/s SCI link in 0.8µm BiCMOS,” International Solid State
Circuits Conference Digest of Technical Papers, pp. 326-327, Feb 1995.
[5]
S. Sidiropoulos, K. Yang, M. Horowitz, “A CMOS 500 Mbps/pin Synchronous Point to
Point Link Interface,” IEEE Symposium on VLSI Circuits, Jun. 1994.
[6]
A. Widmer et. al., “Single-Chip 4x500 Mbaud CMOS Transceiver,” International Solid
State Circuits Conference Digest of Technical Papers, pp. 126-127, Feb. 1996.
[7]
S. Sidiropoulos, M. Horowitz, “Current Integrating Receivers for High Speed System
Interconnects,” IEEE Custom Integrated Circuits Conference, May 1995
[8]
S. Sidiropoulos, M. Horowitz, “A 700 Mbps/pin CMOS Signalling Interface Using
Current Integrating Receivers,” IEEE Symposium on VLSI Circuits, Jun. 1996.
[9]
J. Maneatis, M. Horowitz, “Precise Delay Generation Using Coupled Oscillators,” IEEE
Journal of Solid State Circuits, vol. 28, no. 12, Dec. 1993
[10]
M. Johnson, E. Hudson, “A Variable Delay Line PLL for CPU-Coprocessor
Synchronization,” IEEE Journal of Solid State Circuits, vol. 23, no. 5, Oct. 1988.
[11]
T. Lee, et. al., “A 2.5 V CMOS Delay-Locked Loop for an 18 Mbit, 500 MB/s DRAM,”
IEEE Journal of Solid State Circuits, vol. 29, no. 12. Dec 1994.
SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface...
FIGURE CAPTIONS
Fig. 1
Interface timing.
Fig. 2
Interface block diagram.
Fig. 3
Conventional interface block diagram and timing.
Fig. 4
Idealized current integrating receiver (a), and its phase characteristics (b).
Fig. 5
Block diagram of the input pin receiver (a), and its timing (b).
Fig. 6
Baseline current integrator implementation.
Fig. 7
Initial current integrator implementation.
Fig. 8
Phase characteristics of the initial current integrator implementation.
Fig. 9
Improved current integrator implementation.
Fig. 10 Phase characteristics of the improved implementation.
Fig. 11 Current integrator replica feedback biasing.
Fig. 12 Amplifier and latch schematic diagram.
Fig. 13 Receiver DLL block diagram.
Fig. 14 Delay element schematic diagram.
Fig. 15 Output duty cycle adjuster schematic diagram.
Fig. 16 Prototype die photograph.
Fig. 17 Prototype operating range.
Fig. 18 Received data eye diagram.
Fig. 19 Duty cycle adjuster effectiveness.
15
Download