To be published in the IEEE Journal of Solid State Circuits April 1997 A 700 Mbps/pin CMOS Signalling Interface Using Current Integrating Receivers* Stefanos Sidiropoulos and Mark Horowitz Center for Integrated Systems Stanford University Stanford, CA 94305 Correspondence Address: Stefanos Sidiropoulos Center for Integrated Systems, Rm. 125 Stanford University Stanford, California 94305-4070 Phone: (415) 725-3669 FAX: (415) 725-6949 Abstract — A high speed CMOS signalling interface for application in multiprocessor interconnection networks has been developed. The interface utilizes 1-V push-pull drivers, a Delay Line PLL and sampling of the data on both edges of the clock. In order to increase the noise immunity of the reception a current-integrating input pin sampler is used to receive the incoming data. Chips fabricated in a 0.8 µm CMOS technology achieve transfer rates of 740 Mbits/sec/pin operating from a 3.3-V supply with a bit error rate of less than 10-14. * Funding for this work was provided by ARPA under contract DABT63-94-C-0054. 1 A 700 Mbps/pin CMOS Signalling Interface Using Current Integrating Receivers* Stefanos Sidiropoulos and Mark Horowitz Center for Integrated Systems Stanford University Stanford, CA 94305 Correspondence Address: Stefanos Sidiropoulos Center for Integrated Systems, Rm. 125 Stanford University Stanford, California 94305-4070 Phone: (415) 725-3669 FAX: (415) 725-6949 I. INTRODUCTION Scaling of semiconductor technology and advances in circuit design led to a rapid increase in the speed of digital and memory IC’s. This created a demand for higher pin bandwidth on CMOS chips, which in turn led to the development of synchronous high-speed interfaces with diverse system applications [1]-[6]. Although the developed interfaces utilize different transmission media and have diverse external configurations and line voltage swings, they all share a common characteristic which leads to increased noise sensitivity: the incoming data is sampled only once per bit period. This paper describes an interface design which overcomes this problem by utilizing current integrating receivers to effectively filter high frequency noise and increase noise immunity [7]-[8]. Section II describes this interface architecture and the external signalling scheme. Section III introduces the concept of current integrating input pin sampler and describes in detail the implementation used in this design. Section IV describes the clocking circuit design including the Delay Line PLL and the associated clock duty cycle adjusting circuits. Section V discusses the prototype measurement results and concluding remarks follow in Section VII. * Funding for this work was provided by ARPA under contract DABT63-94-C-0054. SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... 2 II. SYSTEM ARCHITECTURE AND EXTERNAL INTERFACE In our system a data link consists of a set of data wires along with one clock wire. The data is transmitted in phase with the clock as depicted in Figure 1. One unit of parallel information is transmitted every half period of the clock. Given that the transmission line lengths of the clock and data wires are carefully equalized, the edges of the reference clock coincide with potential data transitions, so the receiver can use that timing information to position its sampling clock. Bus based interfaces [1] usually have to synchronize their transmission to an existing bus clock thus requiring a transmitter DLL or PLL. In this point-to-point system the reference clock is generated by the transmitter itself (by transmitting an alternating one-zero data stream) thus eliminating the need for a transmitter clock alignment circuit. On the receiver side the reference clock has to be amplified to full CMOS levels and then buffered up to drive the receivers of the incoming parallel data. A conventional receiver would position the sampling clock edges in the middle of the incoming data eye and use the sampling clock to sample the data once per half-clock period. In contrast this design employs a delay-locked loop to position the sampling clock in phase with the incoming data and average the data during the clock phase by employing current integration. At the end of the half-clock period the receiver determines whether the incoming data was mostly high or low based on the averaged value. This integration of the incoming data makes the reception of the signals insensitive to high frequency supply or reflection noise and improves the interface performance. It should be noted that using an integrating receiver is effective when the predominant limitation is noise as is the case with moderate length (< 5-m) system interconnects. When the transmission is limited by the interconnect bandwidth some form of channel equalization can be employed. For example current integration at the receiver side can be combined with digital signal predistortion on the transmitter side [6] to achieve the desired performance. Figure 2 shows the block diagram of the interface. In order to minimize power consumption and simultaneous switching noise the transmitter uses a simple push-pull series terminated output buffer rather than the more conventional parallel terminated open drain driver design. Series termination offers the advantage of zero static power dissipation since when the output is not switching there is zero quiescent current on the transmission line. The transmitter’s push-pull drivers are operated from an external to the chip 1-V supply and are series terminated with an external resistor. Due to the low swing supply voltage of the drivers only NMOS transistors are used in the final stage making its area smaller. To avoid reflections the series impedance of the driver and its external termination resistor should be equal to the 50−Ω impedance of the transmission line. The nominal impedance of the buffer was set to 20 Ω so a 30−Ω Ohms external resistor is required to drive the 50−Ω line. The external resistor was used to decrease the effect process variations and transistor nonlinearities have on the impedance matching. Process and environmental variations that cause the buffer impedance to deviate from its nominal value will SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... 3 have only a modest impact on the quality of the transmitted signals, since the buffer impedance is only a fraction of the series driving impedance. Therefore no active impedance control [2] was integrated in our design. The power dissipation of the transmitting buffers depends on the electrical length of the wire and the frequency of switching. When the time of flight through the wire is very short compared to the bit time, the buffer only dissipates dynamic power (1.22 mW/ 100MHz). When the time of flight through the wire is longer than the bit time the buffer drives 10 mA of current until the signal is reflected back, which results to a worst case power dissipation of 10mW+1.22 mW/100MHz (including the power dissipated on the external resistor). In order to compensate for the noise margin degradation that would be introduced from DC level shifts between the transmitter and the receiver board, the signals are transmitted in a pseudodifferential form. A reference voltage is generated on the transmitter board by using a simple resistive voltage divider and shipped to the receiver. On the receiver side each data line is input to a pair of current integrating receivers (one for each clock half-period) which average the data value relative to the input reference voltage. This way DC level shifts will not affect the reception of the signals as long as they remain within the common mode range of the input pin sampler. III. CURRENT INTEGRATING RECEIVER DESIGN A. Concept of Operation Conventional high speed interfaces usually function as depicted in Figure 3. A high speed reference clock is transmitted in phase or in quadrature with the high speed parallel signals. In order to minimize the system cost differential transmission is avoided. Instead a reference voltage is generated on the transmitter side and send to the receiver to indicate the common mode voltage of the transmitted data. At the receiver end the reference clock is amplified, buffered up and phase shifted by 90o relative to the incoming data and reference clock by employing a Phase Locked Loop. This quadrature positioning of the clock gives the input pin receiver maximum setup and hold time to partially compensate for any skew that might be present between the reference clock and the parallel data signals. However noise that might occur during sampling can cause the receiver’s sampler to resolve the wrong value, degrading the interface performance. The sources of such noise on the environment of a digital chip are numerous. One of the largest sources of noise comes from the reference voltage being heavily coupled to the receiver’s ground. Therefore any high frequency ground noise manifests itself as a differential mode offset at the input of the receiver. This reference noise problem becomes even worse in the case where one chooses to implement a full-duplex interface which superimposes the transmit and receive signals on the same wire [2]-[3]. In this case the multiplexed reference voltage must track the transmitter’s output, which is difficult to accomplish when the output switches. Increasing the signal amplitude to compensate for the noise is only partly effective since a large fraction of the noise is SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... 4 proportional to the signal swings. Using fully differential transmission reduces these problems at the cost of additional pins and wires and increased power dissipation. A cost effective solution to the noise problem would be to ensure that switching noise does not occur during the sampling period by e.g. restricting the output buffer switching to happen at a time instant different that the input pin sampling. This solution however poses onerous restrictions on the system designer, and is possible only when the time of flight through the wires is small compared to a bit time, or all wire delays are a fixed number of bit times. In general such a solution might be impossible to implement in systems where more than one high-bandwidth interfaces with unconstrained phase relationship between their clocks are integrated on the same die. Another approach to the noise problem would be to sample the incoming data more than once per bit period and use a digital majority voting scheme to determine the value of the transmitted data. The disadvantages of such a solution are that the required power and area grow linearly with the number of samples - the minimum required 3 samples would increase the power dissipated on the sampling clock and the area occupied by the input samplers by a factor of 3. Additionally positioning the sampling edges precisely with low intra-edge jitter is a difficult problem by itself [9]. The solution we chose to implement is the analog equivalent of majority voting. The receiver integrates current on a capacitor based on the polarity of the input voltage Vin-Vref and determines the polarity of the incoming data at the end of the sampling period. This method requires only a single clock defining the sampling/integration period and its power and area requirements are moderate. Figure 4-(a) shows an idealized diagram of a current integrating receiver. This ideal receiver consists of a current switch, a pair of load capacitors and two reset switches. The level of clock phase φ indicates the input data-valid period. When φ is low the switches are on equalizing the integrator output. When φ goes high the current switch steers current to the one branch of the integrator or the other depending on whether the input is higher or lower than the reference voltage. At the end of clock phase φ the polarity of the output differential voltage ∆Vo indicates whether the input signal was mostly low or high during the integrating period. Any transient noise that would cause the input to cross the reference voltage would not affect the correct reception of the signal as long as the duration of the transient noise is less than half the valid bit period. Such transient noise affects only the magnitude but not the polarity of voltage ∆Vo. Therefore as long as the value of ∆Vo is larger than the offset of the amplifier following the first integrating stage the correct reception of the input signal will not be affected. This implementation integrates the sign of the input voltage rather than the input voltage itself as is the case with an ideal matched filter. This is done to avoid the unequal output amplitudes that a matched filter receiver would produce in case Vref was DC-shifted relative to the middle of Vin. Figure 4-(b) shows the phase characteristics of the ideal integrator assuming that its output is fed to an ideal sample and hold network. In this experiment the input data is a clock waveform of SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... 5 the same frequency as the sampling clock. The phase shift between the input data and the clock varies from 0o to 360o and the value of ∆Vo at the end of the integration period is plotted versus the corresponding phase-shift. The resulting phase characteristics curve has a triangular shape which corresponds to the phase characteristics of a quadrature phase detector. When the input signal and the sampling clock are in phase or 180o out of phase the switch current is dumped in only one of the integrating capacitors for the full bit duration. Therefore in that case the differential output voltage ∆Vo has its maximum or minimum value of ± (I × tb)/C where I is the switch current, tb is the bit time (half the clock period), and C the integrating capacitor size. When the input signal and the sampling clock are in quadrature the switch current is dumped evenly on each of the two integrating capacitors for half the bit time and ∆Vo ends up being zero. For our integrating receiver this zero crossing of the phase-shift axis in Figure 4-(b) is the equivalent of the center of the sampling uncertainty window. In the ideal receiver this center of the setup+hold time is located in the middle of the bit time. However, in a real implementation the center of the sampling uncertainty window moves around the ideal 90o point due to offsets in the current switch and the following amplifying stage. B. Current Integrator Implementation In a system implementation the input pin sampler is required to have full swing outputs that are held stable for a full clock cycle. Since the output of a simple current switch does not satisfy the above requirements the implementation of a complete receiver utilizes some extra circuits. Figure 5-(a) shows the block diagram of the complete implementation of the current integrating receiver. It consists of a current integrating stage, followed by a sample and hold circuit, an amplifier and a differential latch. Two receivers are used in parallel per input pin to sample the data transmitted on both the half-periods of the clock. Figure 5-(b) illustrates the timing of the receiver. During the sampling period the differential voltage of the integrator is sampled while the amplifier and the latch is being reset. At the end of phase φ the sample and hold network enters the hold state and the amplifier amplifies the held voltage. Two inverter delays after the end of φ the latch is triggered and the first stage and the sample and hold network are reset. This self-timed reset facilitates equalization of all the intermediate nodes in the circuit, thus minimizing a potential source of inter symbol interference. A straightforward implementation of the current integrating stage is shown in Figure 6. In order to accommodate the low input common mode voltage of our interface, the current switch is implemented as a PMOS source coupled pair consisting of devices M1-M3. The reset switches are implemented as NMOS devices S1, S2 and the load capacitors are shown here as the linear elements C1, C2. In order for the first stage to achieve a behavior as close to the ideal integrator as possible the source coupled pair must be able to steer almost all of its tail current with only a fraction of the input differential voltage. An MOS differential pair behaves similarly to a current SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... 6 switch only as long as the input differential voltage is large enough. We therefore define the operating margin Vom† of the integrator to be the difference between the nominal input differential voltage ∆Vin and the voltage required to completely steer the tail current. Another important requirement for the first stage is that we want to maximize its output swing Vo =(I × tb)/ C, in order to make the reception less sensitive to on chip noise. To maximize the output swing one needs to minimize the load capacitance. As long as M1-M2 can be kept in the saturation region (meaning that the output voltage Vo is less than a body effected VTP) one can set C1=C2=0 and let the differential pair integrate its tail current on its parasitic drain junction capacitance. Assuming a quadratic MOSFET model and setting the tail current I = (Vo × C)/tb and the integration capacitor C = cd × W, we can calculate the operating margin of the integrator implementation: 2 × Vo × cd × L V om = ∆V in – V offs – ----------------------------------t b × µ p × C ox (1) where: ∆Vin is the input voltage swing (usually set by system constraints), Vo is the integrator output swing, cd is the drain junction capacitance of M1 and M2 per unit of their width, L is the length of M1 and M2, tb is the bit time, µp is the hole mobility and Cox the gate oxide capacitance. Equation (1) shows that given the input and output swing requirements and a target operating speed the operating margin of the integrator depends only on technology parameters and increases as the process transconductance and the parasitic junction capacitance improve. In other words for any given requirement of Vo and tb there is a set of pairs of widths of M1-M2 and tail currents which determine the operating margin voltage. In a practical design one has to pick a tail current and set the device widths based on a trade-off between minimizing power and making the integrating capacitors large enough to minimize coupling effects. A secondary requirement is that the tail current has to be large enough so that coupling on the bias line will not introduce a significant integration error. In our case where tb=2nsec and ∆Vin=500mV, we set Vo to be 800 mV the width of M1-M2 to be 100 µm for 200 µA of nominal tail current which gives us a Vom of 300-350 mV over process and temperature - note however that these results are for the actual implementation described below in which the charge injection error cancellation circuits double the parameter cd. For high current gain M1-M2 must run at low current per unit width which increases the effect of parasitic capacitances Ci and Ct in Figure 6. The gate to drain overlap capacitance Ci of that the operating margin Vom of the integrator is different from the noise margin V nm=∆Vin−Voffs of the differential pair. If the input differential voltage is less than Vom the receiver will still work but the center of the sampling uncertainty window for high and low inputs will deviate from its ideal 90o point if VIH≠VIL. Nevertheless as long as VIH=VIL and Vnm<VIH<Vom the center of setup and hold margin will not deviate - only the maximum output differential voltage of the integrator will decrease. † Note SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... 7 transistor M1 couples the input transitions on the output integrating capacitor introducing a systematic integration offset. The significant tail capacitance of the source coupled pair Ct creates another systematic offset. Due to the fact that the input to the receiver is pseudo-differential, when the input transitions the tail node must settle at a gate overdrive above min{Vin,Vref} before the tail current is steered to one of the branches of the source coupled pair. The significant tail capacitance Ct consumes or sources some charge to or from the output node and therefore introduces another systematic offset into the integrator. These systematic offsets are manifested in the phase characteristics of the integrator by increasing the output voltage of the integrator when the input is low and decreasing it when the input is high. At the same time the time intervals where the data is detected as high and low in Figure 4-(b) are correspondingly decreasing and increasing which means that the sampling uncertainty window width effectively increases. Simulation results show that the increase of the sampling uncertainty window width can be as large as 500 psec, i.e. 25% of our 2-nsec target bit time. C. Charge Injection Error Cancelation The circuit in Figure 7 (described in more detail in [7]) was used in our first design in order to compensate for the charge injection errors. The input differential pair consists of devices M1-M3 and the reset network consist of devices S1-S2. The differential pair is augmented by three capacitively connected devices MC1-MC3. Transistors MC1 and MC2 are employed to cancel the systematic offset introduced by the gate-to-drain capacitance of M1. These two devices have half the width of the input devices and in normal operation they always remain off. This way the transitions of the input node are coupled on both the outputs of the integrator and therefore the differential mode error introduced by the transitions of Vin is turned to a common mode variation. The offset introduced by the tail capacitance of the differential pair is canceled by transistor MC3. When the input transitions MC3 boosts the tail capacitance to its quiescent voltage level and therefore minimizes the charge injection error if the capacitances C t and MC3 are properly ratioed. The phase characteristics simulated over three different process and environmental condition corners at an operating frequency of 250 MHz are depicted in Figure 8. It can be seen that by employing the cancelation techniques the worst-case sampling uncertainty window width decreased from 500-psec to 180-psec. The imperfect cancelation seen in the “fast” and “slow” simulation corners is mainly introduced by the mismatch in the ratio between gate and drain capacitances in the tail boosting circuit. These two parameters can change disproportionately over process variations and increase the sampling uncertainty window width. To decrease the process sensitivity in the charge injection cancelation circuits our current design uses the circuit shown in Figure 9. This figure shows the design of the current integrating stage along with the associated sample and hold circuit. The current integrator in this design consists of two differential pairs with cross coupled outputs. The main differential pair consists of SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... 8 transistors M1-M3 and is run at normal current levels. The source coupled transistors of the charge injection cancelation differential pair MC1-MC2 are identical to M1-M2. In addition the total width of MC3 and MC4 is equal to that of M3 causing the capacitances on the tail nodes of the two differential pairs to be the same. Therefore the cancelation differential pair provides a parasitic induced error that is equal in magnitude and opposite in polarity to that induced by the main differential pair. In this way all the parasitic induced differential errors are transformed to a common mode variation which is small enough not to affect the operation of the second stage. The implication of the canceling differential pair is that the output swing ∆Vo of the integrator is reduced by a fraction equal to the ratio of the current of the canceling differential pair I C to that of the main differential pair IM, which in turn suggests that we would want to minimize that ratio. However making the ratio of currents arbitrarily small affects the canceling action of the second differential pair. The requirement that bounds the lower limit of the offset canceling current Ic is that the node tailC should reach its quiescent voltage within the integration period in order for the parasitic induced error of the two differential pairs to be equal. Simulation results indicate that choosing the ratio of IC to IM to be 4 gives us a worst case 20% margin over all process corners for a 1.6-nsec bit time. Figure 10 shows the phase characteristics of this alternative design. It can be seen that employing the offset-canceling differential pair reduced the systematic errors to below 50 psec (2.5% of our target bit time). The remaining small error is due to the nonlinear nature of the tail junction capacitance: since the offset canceling differential pair is run at a lower current than the second differential pair node tail settles at a higher voltage than node tailC making its capacitance slightly higher and introducing an imbalance between the charge injected by the parasitics of the two source coupled pairs. Transistors S1-S10 in Figure 9 form the sample and hold network and the integrator reset switches. To compensate for any overlap that might be present between the resetting phase φ and the sampling phase φ the reset network is formed as a stack qualified by a delayed version of φ. Phase ψ is generated locally by delaying φ through two inverters. This way the first stage is reset only after the sample and hold switches S1, S2 have been completely shut off. D. Current Integrator Biasing The fact that the current integrator has no explicit load capacitance makes its output swing sensitive to process variations of the parasitic drain capacitance. In addition the output voltage of all integrators depends on the integration time. To reduce the variations caused by both of these effects the integrator stage of all the receivers is biased by the circuit shown in Figure 11. This replica bias circuit dynamically adjusts the current through the PMOS current sources so that the integrator output voltage ∆Vo is held constant and independent of the value of the parasitic integration capacitor and the operating clock frequency. The output of an integrator replica is low pass filtered through MR1-MC1 and subsequently drives an operational amplifier which compares SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... 9 it with a preset voltage. The amplifier adjusts the current through the integrator replica so that the low-pass filtered output voltage VLPF remains equal to the external preset voltage. Compensation for the feedback biasing circuit is accomplished with the explicit capacitor formed by transistor Mc. The dominant pole of the circuit is set to 1-MHz, well below the minimum operating frequency of the receiver clock and the first non dominant pole formed by MR1-MC1. E. Amplifier and Latch Design Figure 12 shows the schematic diagram of the amplifier and the latch. The amplifier buffers the output of the sample and hold network isolating the latch kick-back and also level shifts the low common mode voltage of the sampled integrator output voltage to improve the latch performance. The combination of cross coupled and diode connected loads forms a high differential impedance which increases its small signal gain while the diode clamps limit the maximum output swing reducing kick-back and facilitating faster reset. In order to prevent variations of the bandwidth of the amplifier to introduce inter symbol interference an explicit reset switch is used at its output. The differential latch converts the low swing output of the amplifier to a full swing CMOS signal. The output of the precharged latch is held stable for a full clock cycle by a simple cross coupled S-R latch. The width of the sampling uncertainty window of the receiver is affected not only from systematic charge injection offsets in the first stage but also by mismatches between nominally identical devices in the amplifier and the following latch. Simulation results assuming 25-mV VTH mismatch between nominally identical devices in all stages indicate that the maximum increase in the sampling uncertainty window is less than 80 psec. IV. RECEIVER CLOCKING CIRCUITS The task of the receiver’s Delay Locked Loop is to position the sampling clock at an optimum point for integrating the incoming data independent of variations in process, voltage, temperature and sampler setup and hold time. The current integrating receivers show very little variation in their setup and hold time window and the optimum point of their sampling clock is exactly in phase with the incoming data. Therefore the need for a receiver DLL would diminish if the reference clock was a full swing signal that could drive the on chip input pin samplers. However, since the reference clock is a 1-V swing signal that needs to be amplified and buffered up to drive the input pin samplers, a DLL is needed to cancel the amplification and buffering delay. Figure 13 shows a block diagram for the receiver DLL. The DLL consists of a conventional core (delay line, charge pump and phase detector) along with a controlling finite state machine and a pair of duty cycle adjusters. Conventional DLL designs use an amplified version of the reference clock as the input to the delay line [10]. This approach has the main disadvantage that the jitter inherently present in the reference clock will be propagated to the sampling clock. SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... 10 Therefore if the reference clock moves due to noise the sampling clock will move a time instant t later - where t is the delay through the delay line and the clock buffers. This all-pass filter behavior creates a phase shift between the sampling clock and the input data at high noise frequencies. This phase shift peaks at 180o at frequencies which are odd multiples of 1/(2×t). To eliminate this effect this design uses a differential ECL level clock as the delay line input. This clock has the same frequency as the reference clock but it carries much less phase noise. In a large chip such as a multiprocessor router the cost of the extra two pins required for this clock can be amortized over the multiple interfaces integrated on the same die. The DLL phase detector compares the received reference clock with the sampling clock generated by the DLL. The phase detector is implemented as a sampling variation of the input integrating receiver. A pair of NMOS pass transistors sample the input reference clock at the rising edge of the sampling clock and feed the sampled value to a regular input pin receiver. In order to compensate for the delay introduced in the phase detector by the input sampling network all the regular integrating receivers are augmented with the same sampling transistors with their gates tied to the positive supply. The output of the phase detector is integrated by the charge pump producing the control voltage Vcp. In order to limit the amount of dither jitter at lower frequencies the phase detector output is qualified by a pulse with fixed width [10]. This way a fixed charge packet is delivered to the filter capacitor every cycle keeping the dithering of the control voltage around the locking point fixed and independent of the operating frequency. The delay line is implemented as a series of eight current starved delay elements. In order to improve the supply sensitivity of the DLL we used differential delay elements with symmetric impedance loads [9]. Figure 14 shows the delay element schematic diagram. The control voltage Vcp is a buffered version of the charge pump control voltage, while V cn is generated by a replica feedback biasing circuit which ensures that the delay through the elements stays constant independent of variations on supply and temperature. In our system the reference clock and the quiet clock input to the delay line might have an arbitrary phase relationship with each other. Therefore there is no guarantee that upon reset the DLL will not be in a situation where it will be trying to lock close to or below its minimum delay point. To evade this problem a finite state machine controls the DLL capture. Upon reset the control voltage is set to produce a delay that is in the worst process corner 0.5 nsec more than the minimum. When the reference clock is activated the finite state machine ignores it for the first 8 cycles. If during the subsequent 16 clock cycles the phase detector produces a “down” signal the finite state machine phase shifts the sampling clock by 180o - a multiplexing differential delay element in the delay line is used for this purpose. This way the maximum required delay range of the delay line is T/2+0.5 ns (where T is the clock period). Resetting the delay line above its minimum delay point gives the DLL margin across temperature induced drifts of the reference clock. In order to minimize the loop acquisition time while keeping the dither jitter low the FSM SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... 11 also controls the charge pump current. After the decision on the relative phase between the input and reference clocks has been made the FSM increases the charge pump current by a factor of 5 for approximately 300 nsec. After that time interval and when the DLL has obtained coarse lock the current is brought back to the normal operating levels to obtain low dither jitter [11]. In this implementation no particular care was taken for minimizing the propagation of metastable states from the phase detector output to the FSM. This problem can be solved easily by delaying the phase detector output through a flip-flop chain before driving the FSM. Since data is transferred on both the half-clock periods any variation of the sampling clock from the optimal 50% duty cycle point would cause one of the two input samplers to integrate the wrong data item, degrading the timing and noise margins of the input pin samplers. Variations in the clock duty cycle can be either inherited from the ECL input clock to the delay line, or induced by the amplifiers and the clock buffers at the delay line output. To compensate for these effects the DLL two duty cycle adjusters (DCA) are employed. The first DCA is used at the input of the delay line. This input duty cycle adjuster uses two differential delay elements connected in a feedback loop with NMOS capacitors that remove the AC component of the voltages [9]. The output of this circuit is tied to the output of the first delay element and compensates for any duty cycle errors present in the input clock. The second DCA is embedded in the final stage which converts the low swing clock output of the delay line to full CMOS swing. The schematic diagram of the this converter is shown Figure 15. Two amplifiers with current mirror load and an extra port are employed to generate the signals c and c. The extra port of the amplifier is connected to the output of a band limited delay element which low pass filters the sampling clock signals. If the sampling clock has any duty cycle imperfection the differential output voltage of the low pass filter will induce a input offset to the clock amplifier. This negative feedback will cause the duty cycle of the sampling clock to be adjusted. A second stage current mirror amplifier increases the output swing before driving the first stage of the clock buffer inverters. V. EXPERIMENTAL RESULTS To evaluate the interface performance an experimental prototype was fabricated in a 0.8-µm (1.0 µm drawn) digital CMOS process. The 2.5×2.5mm2 die (Figure 16) contains the receiver DLL, four current integrating receivers and eight output drivers. Three of these drivers constitute the transmitter side of the chip while the rest of them are used to monitor buffered versions of the receiver’s internal signals. The prototype chip was packaged in 40-pin dual in-line package. To ameliorate the effects of the large pin inductance of the package used, the high speed signals were routed through the pins with the lowest inductance (~10 nH). Additionally a total of 16 pins were dedicated to the chip power: 6 for the ground/substrate node, 4 for the 3.3-V positive supply and 4 for the 1-V output driver supply. SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... 12 To assess the bit error rate of the interface, the transmitter can selectively transmit either an externally configurable 8 bit long data-stream or a pseudo random bit sequence (PRBS). The receiver contains the corresponding bit error rate (BER) testing circuits. Due to area constraints the on-chip PRBS generation and error detection circuit can generate a 2×(27-1) long sequence. Normally a longer PRBS sequence would be desirable to detect any bandwidth limitations. In our case where the input pin samplers are reset every cycle the use of a shorter PRBS does not affect the validity of the measured BER. In our experimental set-up we used two-sided printed circuit boards for the transmitter and receiver chips. The packages of the transmitting and receiving chips were mounted on PCB through ZIF sockets and the high speed signals were carried through 1-m long coaxial cables. Figure 17 shows the measured transfer rate versus the operating supply voltage. The bit error rate for this measurement was at least less than 10-11 for all cases. At the low end of supply voltages (2.7-V) the interface achieves an operating speed of 540 Mbps/pin. At the high operating voltage end the achievable speed was limited from the ability of the pulse generator used in the set-up to produce a clean reference clock beyond 455 MHz. A more persistent measurement at the nominal operating point of 3.3-V revealed that the actual BER at a 740 Mbps/pin rate is less than 10-14 (three days of continuous operation did not yield a single error). Figure 18 shows an eye diagram of the sampled data output of one the receiver’s samplers (after being buffered and driven off chip). In this experiment the chip operates in a loopback mode at a 740 Mbps/pin transfer rate and with a peak-to-peak on chip supply noise of 200 mV. The pseudo random data eye experiences a jitter of 180 ps peak-peak (28 ps RMS). Furthermore simulation results show an overall static supply sensitivity of 0.7 ps/mV. The sampling uncertainty window of the interface was measured by keeping the reference clock at a fixed position and varying the skew between the reference clock and the data. In this experiment we assumed that the sampling uncertainty window was violated if the bit error rate was more than 10-9. The worst case sampling uncertainty window of the system width was found to be 280 psec with its center located 80 psec from the center of the DLL locking point. Referring back to Figure 5-(b) the sampling uncertainty window has an a width of 25.2o and its center is located -6.3o off the ideal 90o point. Note however that this is the composite uncertainty window of the system since it includes transmitter, DLL, and receiver offsets. In order to compare the new implementation of the integrating receiver with the initial process sensitive implementation (Figure 7) the experiment was repeated with the input data being a clock waveform and all the major on-chip noise sources turned off. In that case the window width is 50 psec showing an improvement of a factor of 3 relative to our previous design [7]. The sensitivity of the input pin sampler was measured by decreasing the 1-V buffer supply and scaling the appropriately the reference voltage. It was found that the receiver can still operate correctly with a BER<10-11 when the input pin voltage is 50 mV around the reference. SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... 13 To evaluate the effectiveness of the duty cycle adjuster we varied the duty cycle of the input clock and measured the duty cycle of the sampling clock with the two DCA’s selectively enabled and disabled. The results are shown in Figure 19. It can be seen that with both DCA’s enabled the chip can accommodate an up to 10% duty cycle distortion while the sampling clock would still be within 1% of its nominal value. Due to the nature of the experiment the effectiveness of the input DCA is more pronounced in this case. The maximum power dissipation of the chip operating in loopback mode at 740 Mbps/pin from a 3.3-V supply was measured to be 300 mW. VI. CONCLUSION Single ended high-speed parallel links have been sensitive to high frequency noise induced on their reference lines. This paper described a method to reduce the effect of this noise, and a complete interface using this technique. The interface uses an input pin receiver that integrates the incoming data over its valid period rather than sampling it. To reduce the power dissipated in the output drivers a 1-V swing NMOS push-pull design is used. To reduce the jitter on the sampling clock the receiver’s DLL is implemented by using differential buffers with replica feedback biasing. A prototype implementation of the described circuit has been integrated in a 1-µm drawn CMOS technology. The prototype achieves a nominal transfer rate of 700 Mbits/sec/pin with a BER of less than 10-14. The interface has a sampling uncertainty window of 280 ps maintaining a BER of less than 10-9 when operating at the edge of the uncertainty window. Although full-duplex operation is not supported by the current implementation the receiver’s insensitivity to high frequency noise in the input signals makes it particularly well suited for such applications. The current design can achieve a maximum transfer rate of 900 Mbps/pin which demonstrates that higher than 1 Gbit/sec/pin operation can be achieved with faster sub-micron technologies. The interface described in this paper has been designed for application in multiprocessor interconnection networks. However, the interface architecture and the integrating receiver design are general and can be used in other applications such as high speed processor-to-memory interfaces or ATM switching systems. ACKNOWLEDGEMENTS The authors would like to thank Tom Lee, Norm Jouppi and Mark Johnson for helpful comments in the initial stages of this design. They are also indebted to Ken Yang and Birdy Amrutur for assistance and stimulating discussions. SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... 14 REFERENCES [1] N. Kushiyama et. al., “A 500Mbyte/sec Data-Rate 4.5M DRAM,” IEEE Journal of Solid State Circuits, vol. 28, no. 4, April 1993 [2] T. Takahashi, et. al., “A CMOS Gate Array with 600 Mb/s Simultaneous Bidirectional I/O Circuits,” IEEE Journal of Solid State Circuits, vol. 30, no. 12. Dec 1995 [3] R. Mooney, C. Dike, S. Borkar, “A 900 Mb/s Bidirectional Signalling Scheme,” IEEE Journal of Solid State Circuits, vol. 30, no. 12. Dec 1995. [4] D. Cecchi et. al., “A 1GB/s SCI link in 0.8µm BiCMOS,” International Solid State Circuits Conference Digest of Technical Papers, pp. 326-327, Feb 1995. [5] S. Sidiropoulos, K. Yang, M. Horowitz, “A CMOS 500 Mbps/pin Synchronous Point to Point Link Interface,” IEEE Symposium on VLSI Circuits, Jun. 1994. [6] A. Widmer et. al., “Single-Chip 4x500 Mbaud CMOS Transceiver,” International Solid State Circuits Conference Digest of Technical Papers, pp. 126-127, Feb. 1996. [7] S. Sidiropoulos, M. Horowitz, “Current Integrating Receivers for High Speed System Interconnects,” IEEE Custom Integrated Circuits Conference, May 1995 [8] S. Sidiropoulos, M. Horowitz, “A 700 Mbps/pin CMOS Signalling Interface Using Current Integrating Receivers,” IEEE Symposium on VLSI Circuits, Jun. 1996. [9] J. Maneatis, M. Horowitz, “Precise Delay Generation Using Coupled Oscillators,” IEEE Journal of Solid State Circuits, vol. 28, no. 12, Dec. 1993 [10] M. Johnson, E. Hudson, “A Variable Delay Line PLL for CPU-Coprocessor Synchronization,” IEEE Journal of Solid State Circuits, vol. 23, no. 5, Oct. 1988. [11] T. Lee, et. al., “A 2.5 V CMOS Delay-Locked Loop for an 18 Mbit, 500 MB/s DRAM,” IEEE Journal of Solid State Circuits, vol. 29, no. 12. Dec 1994. SIDIROPOULOS and HOROWITZ: A 700 Mbps/pin CMOS Signalling Interface... FIGURE CAPTIONS Fig. 1 Interface timing. Fig. 2 Interface block diagram. Fig. 3 Conventional interface block diagram and timing. Fig. 4 Idealized current integrating receiver (a), and its phase characteristics (b). Fig. 5 Block diagram of the input pin receiver (a), and its timing (b). Fig. 6 Baseline current integrator implementation. Fig. 7 Initial current integrator implementation. Fig. 8 Phase characteristics of the initial current integrator implementation. Fig. 9 Improved current integrator implementation. Fig. 10 Phase characteristics of the improved implementation. Fig. 11 Current integrator replica feedback biasing. Fig. 12 Amplifier and latch schematic diagram. Fig. 13 Receiver DLL block diagram. Fig. 14 Delay element schematic diagram. Fig. 15 Output duty cycle adjuster schematic diagram. Fig. 16 Prototype die photograph. Fig. 17 Prototype operating range. Fig. 18 Received data eye diagram. Fig. 19 Duty cycle adjuster effectiveness. 15