Uploaded by guangguangzhige

A 56-Gbps PAM4 Receiver With Low-Overhead Techniques

advertisement
672
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019
A 56-Gb/s PAM4 Receiver With Low-Overhead
Techniques for Threshold and Edge-Based DFE
FIR- and IIR-Tap Adaptation in 65-nm CMOS
Ashkan Roshan-Zamir , Student Member, IEEE, Takayuki Iwai, Yang-Hang Fan,
Ankur Kumar, Student Member, IEEE, Hae-Woong Yang, Student Member, IEEE, Lee Sledjeski,
John Hamilton, Soumya Chandramouli, Member, IEEE, Arlo Aude, Member, IEEE,
and Samuel Palermo , Senior Member, IEEE
Abstract— This paper presents a four-level pulse amplitude
modulation (PAM4) quarter-rate receiver that efficiently
compensates for moderate channel loss in a robust manner
through background adaptation of the receiver thresholds and
equalization taps. The front-end utilizes an input single-stage
continuous-time linear equalizer (CTLE) to boost the main cursor
and relax the pre-cursor cancellation requirement, requiring only
a 2-tap pre-cursor feed-forward equalizer (FFE) on the transmitter side. A 2-tap decision feedback equalizer (DFE) follows that
includes one finite impulse response (FIR) tap and one infinite
impulse response (IIR) tap to cancel first post-cursor and long-tail
inter-symbol interference (ISI), respectively. In addition to the
per-slice main three data samplers, a single error sampler is
utilized for background threshold control and an edge-based sampler performs both phase-locked loop (PLL)-based clock and data
recovery (CDR) phase detection and generates information for
background DFE tap adaptation. Fabricated in general purpose
(GP) 65-nm CMOS, the 56-Gb/s receiver achieves 4.63 mW/Gb/s
and compensates for up to 20.8-dB loss at a bit error rate (BER)
< 10−12 when operated with a 2-tap FFE transmitter.
Index Terms— Decision feedback equalizer (DFE), DFE adaptation, four-level pulse amplitude modulation (PAM4), infinite
impulse response (IIR), receiver, serial link, threshold adaptation.
I. I NTRODUCTION
S
UPPORTING increased bandwidth demand in datacenters
and high-performance computing systems requires higher
per-lane electrical I/O data rates, motivating the development
of recent high-speed I/O standards that utilize four-level
Manuscript received July 7, 2018; revised October 4, 2018; accepted
October 31, 2018. Date of publication December 18, 2018; date of current
version February 21, 2019. This paper was approved by Guest Editor Nan
Sun. This work was supported by SRC under Grant 1836.143. (Corresponding
author: Ashkan Roshan-Zamir.)
A. Roshan-Zamir was with the Analog and Mixed Signal Center, Electrical
and Computer Engineering Department, Texas A&M University, College
Station, TX 77843 USA. He is now with Texas Instruments Incorporated,
Santa Clara, CA 95051 USA (e-mail: ashkanroshan@tamu.edu).
T. Iwai was with the Analog and Mixed Signal Center, Electrical and
Computer Engineering Department, Texas A&M University, College Station,
TX 77843 USA. He is now with Toshiba Memory Corporation, Kawasaki
212-8520, Japan.
Y.-H. Fan, A. Kumar, H.-W. Yang, and S. Palermo are with the Department
of Electrical and Computer Engineering, Texas A&M University, College
Station, TX 77843 USA.
L. Sledjeski, J. Hamilton, S. Chandramouli, and A. Aude are with Texas
Instruments Incorporated, Duluth, GA 30096 USA.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/JSSC.2018.2881278
pulse amplitude modulation (PAM4) with higher spectral
efficiency [1], [2]. However, as shown in Fig. 1, PAM4 signaling increases system complexity at both the transmitter
and receiver sides. Since the feed-forward equalizer (FFE) at
the transmitter must be implemented on both the MSB and
LSB signals, this can result in a high level of output stage
segmentation with large FFE tap counts [3]. At the receiver,
the multi-level decisions require multiple data samplers and
equalizers with sufficient linearity to cancel long-tail multilevel inter-symbol interference (ISI). This paper focuses on
improving the receiver efficiency, while also minimizing the
transmit-side equalization requirements.
While ADC-based receivers [4], [5] are well suited for
PAM4 signaling due to their inherent multi-level detection and
robust digital equalization, their power can be prohibitive for
moderate channel loss applications. This motivates a powerefficient mixed-signal receiver front-end solution. However,
relative to NRZ receivers, transmit swing limitations make
PAM4 receivers more sensitive to noise and residual ISI.
This necessitates stringent ISI cancellation, and can result
in transmit-side FFEs with three or more taps [3], [5], [7]
and receive-side multi-stage continuous-time linear equalizers
(CTLEs) [5]–[7] and decision feedback equalizers (DFEs) with
large tap counts when implemented with conventional finite
impulse response (FIR) feedback filters [6]. Improvement in
DFE efficiency is possible with architectures that combine
conventional FIR and infinite impulse response (IIR) feedback
filters [8], [9]. However, it is difficult to support channels with
over 15 dB of loss at Nyquist utilizing a DFE-only approach
due to excessive sampler sensitivity requirements [10].
Reliable PAM4 receiver operation requires robust configuration of both the equalization settings and sampler thresholds.
DFE taps must by adaptively tuned to support operation over
a wide range of channels. These varying equalization settings
result in different data samplers’ multi-level threshold values.
An approach that utilizes a standard minimum mean square
error algorithm requires four extra error samplers at each of
the four PAM4 levels [6]. Another technique that involves
the symmetrical adaptation of the high and low thresholds
in the presence of Gray coding [11] requires two extra error
samplers and can be sensitive to a transmitter and front-end
non-linearity. Overall, both the equalization tap weight and
0018-9200 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
ROSHAN-ZAMIR et al.: 56-Gb/s PAM4 RECEIVER WITH LOW-OVERHEAD TECHNIQUES
673
Fig. 1. Conceptual PAM4 transceiver with transmit- and receive-side equalization, adaptation for equalization taps and sampler thresholds, and clock recovery
circuitry.
threshold adaptations should be performed with minimal hardware overhead and also offer compatibility with efficient clock
and data recovery (CDR) architectures that support PAM4
modulation.
The efficiency of the receiver CDR system is also important.
While phase interpolator (PI)-based CDRs are often employed
to generate the optimal sampling time from multi-phase highspeed reference clocks [6], [7], [11], this involves the design
of a separate dedicated phase-locked loop (PLL). In addition,
high-speed global clock distribution and per-channel multiphase generation circuitry are required in a multi-channel
receiver system.
This paper presents a 56-Gb/s mixed-signal PAM4 receiver
that is targeted for moderate channel loss applications and
addresses the aforementioned issues [12]. Section II compares
different equalization configurations for operating over a moderate loss channel with low complexity. The PAM4 receiver
architecture that employs a single-stage CTLE and DFE with
only 1-FIR tap and 1-IIR tap to efficiently cancel long-tail
ISI is detailed in Section III, along with a discussion on the
bang-bang phase detector (BBPD) PLL-based CDR that recovers the clock using only one per-slice edge sampler. Section IV
describes the proposed background sampler threshold adaptation scheme that does not rely on equal PAM4-level spacing
and uses only one additional per-slice sampler that periodically
scans the top and bottom PAM4 eyes to compensate for nonlinearity. Also discussed is that through utilizing the CDR edge
samplers’ information, the DFE adaptation scheme of [13] is
extended for PAM4 operation with the addition of independent
per-slice tap values for mismatch robustness. Experimental
results from a general purpose (GP) 65-nm CMOS prototype
is presented in Section V. Finally, Section VI concludes this
paper.
II. E QUALIZATION C OMPARISONS
The frequency response of a representative refined electrical channel that has a moderate 20.8 dB of loss at the
14-GHz Nyquist frequency for 56-Gb/s PAM4 modulation
is shown in Fig. 2(a). This smoothly decreasing response
is caused by skin effect and dielectric loss, with minimal
performance degradation due to reflections. Fig. 2(b) shows
the channel’s non-equalized 28-GS/s pulse response, which is
well characterized by a fast rising-edge with one significant
pre-cursor ISI term and a slow-decaying long-tail ISI on the
falling edge with significant ISI terms up to 10th post-cursor.
Transmission of this un-equalized data results in a very poor
bit error rate (BER), as shown in the voltage and timing
margin curves of Fig. 2(c) and (d), respectively. Overall, given
the heightened sensitivity of PAM4 to residual ISI, these ISI
terms should be sufficiently cancelled to achieve the target
BER. A potential power-efficient approach is to not utilize
any CTLE and have a DFE-only receiver with a 1-FIR tap
to cancel the large first post-cursor and 2-IIR taps, with one
IIR tap optimized for the fast-decaying close-in ISI and the
other IIR tap optimized for the slow-decaying tail [10]. While
the equalized pulse response, in this case, displays minimal post-cursor ISI, there still exists a significant pre-cursor
term that is not cancelled by the DFE-only receiver. This
still results in a poor BER > 10−3 for this moderate loss
channel.
The large pre-cursor ISI term can potentially be cancelled
by utilizing a 2-tap transmit FFE in combination with the
DFE with 1-FIR and 2-IIR taps. However, utilizing a large
negative pre-cursor tap results in both attenuation of the main
cursor and additional ISI now present in the second pre-cursor
position. While this improves the BER to near 10−9 , forward
error correction (FEC) would still generally be required in
the system. Overall, achieving a better BER would require a
higher complexity transmitter with more pre-cursor taps.
Fixing the transmitter complexity to a 2-tap FFE, further
performance improvement is possible by introducing receiveside CTLE. However, as discussed in Section I, relying solely
on CTLE for long-tail ISI cancellation can require multiple
high-bandwidth stages that employ inductive peaking and
consume a large amount of area and power. Nonetheless,
a CTLE with high-frequency peaking offers the benefits of
cancelling pre-cursor ISI, boosting the main cursor, and can-
674
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019
Fig. 2. Refined electrical channel. (a) S21 response. (b) Simulated 28-GS/s pulse responses and 56-Gb/s PAM4 (c) voltage and (d) timing margins with
various equalizer configurations. Only the middle eye voltage margins are shown for clarity due to different top and bottom threshold levels changing with
the varying equalizer configurations.
celling close-in post-cursor ISI in a manner similar to the
first DFE IIR tap. Thus, a mixed receive-side equalization
approach is investigated that consists of a single-stage CTLE,
with a conservative 6-dB high-frequency peaking optimized
to provide relative pre-cursor ISI attenuation, followed by a
DFE with 1-FIR tap for the large first post-cursor ISI and only
one IIR tap, starting at the second post-cursor, optimized for
the long-tail ISI. As depicted in the equalized pulse response,
although similar pre-cursor ISI cancellation is achieved, the
boost in the main cursor makes the system performance less
susceptible to residual ISI. The resultant bathtub curves show
a BER = 10−12 with timing and voltage margins of 0.22 UI
and 18 mV, respectively. Note that at this performance level a
1-tap FIR, 2-tap IIR DFE-only receiver without CTLE would
require a much more complex 4-tap FFE transmitter. In the
65-nm CMOS technology utilized in this paper, it is estimated
that scaling a 56-Gb/s PAM4 transmitter from 2-taps to 4-taps
would consume an additional 97 mW. This is much higher
than the 11 mW consumed in the receive-side CTLE.
III. R ECEIVER A RCHITECTURE
Based on Section II system analysis, a PAM4 receiver with
a single-stage CTLE and a 1-tap FIR, 1-tap IIR DFE is
proposed (Fig. 3). After the input CTLE, a quarter-rate DFE
follows, which consists of five samplers. Three data samplers
implement a 2-bit flash ADC for PAM4 symbol detection,
one error sampler periodically scans the top and bottom eyes
for threshold tuning, and one edge sampler provides information for both CDR phase locking and DFE tap adaptation.
The outputs of the four receiver slices are further deserialized
to 1/8 symbol rate, with the data and edge samples driving the
CDR’s PAM4 BBPD. At this point, the data samples are also
probed out for external BER testing. All the data, error, and
edge samples are then further deserialized to 1/32 symbol rate
for processing by the DFE tap and threshold adaptation logic.
A detailed block diagram of the equalizer data path is shown
in Fig. 4. Eight total quarter-rate phase clocks are used for
data and edge detection, with each receiver slice operating
with a single pair of data and edge clocks. This quarter-rate
architecture reduces clocking power and relaxes timing of the
current mode logic (CML) samplers by giving them extra time
to recover from previous decisions. PAM4 symbol detection
is performed with the three data samplers, with the middle
sampler threshold set to zero and the top and bottom samplers’
thresholds set to ±2/3 the post-equalized amplitude of the
received signal by the threshold adaptation circuitry. The error
sampler is clocked by the same data clock as the main three
data samplers and has a threshold that is periodically scanned
to track the top and bottom PAM4 eyes in order to provide
threshold adaptation information. Timing recovery and DFE
adaptation information are provided by the edge sampler,
whose threshold is set to zero. The DFE FIR tap, which cancels
the first post-cursor ISI, is efficiently realized by feeding back
the data samplers’ 3-bit thermometer-coded output bits directly
ROSHAN-ZAMIR et al.: 56-Gb/s PAM4 RECEIVER WITH LOW-OVERHEAD TECHNIQUES
Fig. 3.
56-Gb/s PAM4 receiver with threshold and DFE tap adaptation.
Fig. 4.
Equalizer data path.
to three equally weighted summer inputs embedded in the data,
error, and edge samplers. This minimizes the DFE FIR tap
critical path delay to meet the stringent 1-UI timing. In order
to minimize the samplers’ internal loading, the DFE IIR tap
is subtracted from the sampler input with preceding CML
summers. This is possible due to the DFE IIR tap starting
at the second post-cursor to cancel the long-tail ISI. The IIR
tap signal is generated by re-serializing the quarter-rate data
samplers’ outputs and passing this full-rate data through a
low-pass filter.
675
Fig. 5. Single-stage CTLE. (a) Schematic and simulated frequency response
with different (b) capacitor DAC and (c) resistor DAC settings.
Fig. 5 shows the single-stage CTLE with manually tunable
3-bit degeneration resistor and capacitor DACs. At a minimum
0-dB gain setting, tuning the capacitor provides up to 6 dB
of peaking close to the 14-GHz Nyquist frequency. This
bandwidth is achieved at a reasonable power efficiency by
employing shunt inductive peaking. Tuning the resistor DAC
provides near 6 dB of low-frequency gain control.
The choice of sampler topology is critical to achieving
reliable PAM4 operation at high data rates. While strong-arm
samplers [15] and modified double-tail latch versions [16]
676
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019
Fig. 6. CML sampler with DFE FIR-tap and threshold control (a) schematic, (b) simulated normalized FIR-tap offset weight versus differential input
amplitude, (c) FIR tap weight and offset versus DAC code, and (d) impulse sensitivity function.
have advantages that include no dc power, small aperture
time, high gain, and CMOS-level outputs, their multi-stage
implementation can increase delay. Conversely, single-stage
CML samplers [17] can provide higher bandwidth and reduced
delay, while suffering from reduced gain and static power
consumption. Given that this design is targeted for an aggressive 56-Gb/s data rate for the 65-nm CMOS technology,
CML samplers are chosen. Fig. 6(a) shows the schematic
of the CML slicer with the embedded DFE FIR tap and
additional threshold/offset control pairs. In order to avoid
noise propagation in the DFE loop, the input signal must
be amplified sufficiently to make a true decision and fully
switch the CML slicer DFE FIR tap pairs [18]. As shown
in Fig. 6(b), the CML sampler requires a 14-mV differential
input amplitude to achieve switching in the feedback tap
pairs equivalent to 90% of the DFE tap weight at 56 Gb/s.
Independent DFE FIR-tap weights set the tail currents with
6-bit resolution on a per-slice basis to compensate for the
mismatch between the receiver slices, achieving more than
150 mV of range [Fig. 6(c)]. The sampler threshold and offset
is controlled through the DAC-generated Voff /Vth voltage with
7-bit (1-bit sign, 6-bit amplitude) resolution and a maximum
range of more than 250 mV [Fig. 6(c)]. The sampler has
a simulated 11-ps aperture time, obtained from the impulse
sensitivity function shown in Fig. 6(d).
Fig. 7 shows the single-IIR MUX that combines the thermometer quarter-rate data from all the slices and serializes it to
full rate using a current-mode architecture. A tunable RC load
implements the IIR filter, with the time constant controlled
from 22 to 175 ps through the coarse-tuning 3-bit resistor DAC
and the fine-tuning 3-bit capacitor DAC. IIR tap amplitude
control is achieved with the tunable tail current. Per-slice
summers combine the IIR tap signal with the CTLE output.
The input pair that is driven by the CTLE is degenerated to
achieve the required linear range, with additional −12- to 6-dB
gain control provided by the tunable degeneration resistor.
As the IIR tap cancellation starts from second post-cursor
and has a relatively small amplitude, its input pair does not
require degeneration. Here, the IIR summation is done on
a per-slice basis to minimize the summer to slicer routing
and isolate slicer kickback. At the default settings of 6-dB
CTLE peaking and 0-dB summer gain, the entire receiver front
end has a simulated 1-dB compression point of the 324-mV
differential input amplitude. This is sufficient to handle more
than 600 mVppd of input swing.
The PLL-based CDR shown in Fig. 8 provides a power efficient solution to both generate the eight quarter-rate clocks and
adjust their phase to track the incoming data. A BBPD receives
the 1/8 rate data and edge samples and filters out all but the
symmetric transitions to avoid asymmetric PAM4 transitioninduced jitter [19]. In order to reduce loop latency, the BBPD
works with eight parallel early/late signals controlling an
eight-segment charge pump. This parallel charge pump drives
the loop filter to produce the control voltage for a 14-GHz
ROSHAN-ZAMIR et al.: 56-Gb/s PAM4 RECEIVER WITH LOW-OVERHEAD TECHNIQUES
Fig. 7.
DFE IIR-tap MUX, filter, and per-slice input summer.
Fig. 8.
PAM4 PLL-based CDR architecture.
LC voltage-controlled oscillator (VCO) [20]. In addition to
the primary resonator tank, oscillator phase noise is reduced
with tanks also in the source of both cross-coupled transistor
pairs [21]. After the VCO, quarter-rate clocks are generated
by a CML divide-by-two block and then converted to CMOS
levels. Static CMOS phase interpolators then efficiently generate the eight clock phases for the quarter-rate data and
edge samplers. Per-phase skew calibration is achieved with
tunable delay buffers preceding the samplers. The CDR loop
677
bandwidth is tuned through the loop filter’s 4-bit capacitor
DAC and the charge pump’s 3-bit current DAC. Minimal jitter
tolerance peaking is achieved with damping factor control
provided by the loop filter’s 4-bit resistor DAC.
IV. T HRESHOLD AND DFE TAP A DAPTATION
Given that PAM4 receiver sampler thresholds and equalization settings can vary with channel conditions, adaptive
tuning is necessary to support operation over a wide range of
678
Fig. 9.
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019
Background sampler threshold adaptation algorithm.
channels. This process is further complicated in the presence
of transmitter and/or front-end non-linearity. This section
describes the background sampler threshold adaptation scheme
that utilizes only one additional per-slice sampler and the DFE
tap adaptation scheme that utilizes the CDR edge samplers’
information.
A. Sampler Threshold Adaptation
Transmitter- and receiver-side non-linearity, which is often
compressive, can result in margin reduction for the top and
bottom eyes. This necessitates increased accuracy in the
top/bottom eye threshold placement relative to the nominally
zero threshold center eye. Threshold adaptation is achieved
with the error sampler periodically estimating the top and
bottom eye heights in order to place the data thresholds
TH1,3 in the middle of these eyes. An initial foreground
calibration step is performed where all 20 samplers are set
to zero offset/threshold by shorting the input to the common
mode and adjusting the per-sampler Voff /Vth DAC codes.
On a per-slice basis, the top sampler’s threshold TH3 is then
incremented up by 1 LSB (Error Offset 1) to come to the
initial condition shown in Fig. 9. The initial coarse adaptation
steps are based on uniform symbol statistics, with both the
top sampler TH3 and error THER increased until a 25% one
detection probability is achieved by the error sampler. Also,
in parallel, the bottom sampler TH1 is stepped at the same
rate in an open-loop manner to improve convergence speed.
The data statistics are computed by averaging 256 symbols per
slice for each threshold step decision. This value is selected
to minimize convergence time with minimal threshold code
dithering. At the end of State 1, the error sampler THER,1 is
residing at the bottom of the top eye and the top sampler TH3
is 1 LSB inside the eye. Next, the polarity of the error sampler
threshold is inverted and then fine-tuned to converge to the top
of the bottom eye based on a 25% zero detection criteria,
with the bottom sampler TH1 following the error sampler
THER,2 by −1 LSB difference (State 2). This independent
top and bottom threshold tuning eliminates errors caused by
PAM4 asymmetry and level spacing mismatch.
In order not to rely on uniform statistics, the process
then transitions to monitoring the relative values of the error
sampler and the bottom/top samplers to track the eye edges in
States 3 and 4, respectively. It should be noted that at the
end of the first State 4, the top and bottom slicers are in
a sub-optimal position inside the eye. While ideally the top
and bottom samplers should be following the error sampler
with ±1/2 eye height, respectively, in States 4 and 3, due to
the lack of eye height estimation at this point, there is only
±1 LSB difference. Next, in order to get an estimation of
the top eye height in State 5, the data samplers’ thresholds
are fixed and the error sampler THER,5 is increased until the
discrepancy is detected between the error sampler and the top
sampler outputs. This implies that the error sampler THER,5
has reached the top edge of the top eye at the end of State
5. This is repeated to find the error sampler THER,6 that
corresponds to the bottom of the bottom eye at the end of
State 6. At this point, the top and bottom eye heights are now
independently found. This results in the following optimum
top and bottom threshold settings:
THER,3 + THER,6
2
THER,4 + THER,5
.
TH3 =
2
TH1 =
(1)
(2)
ROSHAN-ZAMIR et al.: 56-Gb/s PAM4 RECEIVER WITH LOW-OVERHEAD TECHNIQUES
Fig. 11.
Fig. 10.
PAM4 DFE FIR and IIR-tap adaptation logic tables.
Next, the TH1 and TH3 thresholds are placed in the middle
of their corresponding eye when the process goes back to
States 3 and 4 for monitoring of the top of the bottom eye
and bottom of the top eye, respectively. The algorithm then
periodically rotates between States 3–6 to track eye height
and optimal threshold position. While not implemented in this
prototype, the middle eye threshold could also be adjusted
with no front-end hardware overhead by adding extra middle
eye monitoring states in the threshold adaptation algorithm.
However, due to the reduced non-linearity present in the
smaller level middle eye, this is not as critical as the top and
bottom eye threshold placement.
B. DFE FIR and IIR Tap Adaptation
The edge-based DFE tap background adaptation logic tables
are shown in Fig. 10, which is modified from [13] to allow
for PAM4 operation and independent per-slice DFE FIR-tap
control. Similar to the BBPD logic, the DFE tap adaptation works with symmetric PAM4 data transitions in order
to improve convergence. When a symmetric transition is
detected, the correlation between the edge sample and the
sign of the previous symbols determines the residual ISI
polarity from the corresponding symbol. Adjusting the DFE
tap weights based on this edge information works to maximize
the horizontal timing margin, which correlates with improved
vertical eye height. Given that the CTLE is optimized for
679
Chip micrograph of the 56-Gb/s PAM4 receiver.
pre-cursor cancellation, the DFE FIR-tap is adapted to cancel
the large first post-cursor ISI and the DFE-IIR tap, which starts
with an additional 1-UI delay, is adapted to cancel the long
tail ISI that is considered to start at the second post-cursor
location.
As the DFE FIR-tap cancels the first post-cursor, if the
D−1 symbol polarity matches the edge sample ISI polarity,
this implies that the tap value is too small and the FIR-tap
counter is incremented and vice versa. As PAM4 receivers
require improved sensitivity, independent per-slice adaptation is implemented for the DFE FIR-taps to compensate
for mismatch in the four receiver slices. The DFE IIR-tap
amplitude is set in a similar manner utilizing the D−2 polarity, as this IIR tap compensates for long-tail ISI after the
first post-cursor. Adjustment of the DFE IIR-tap time constant is determined by the correlation from either D−3 or
D−4 and the edge sample. The use of one common DFE
IIR-tap mux allows for the adaptation of only a single set of
IIR values.
Due to the increased sensitivity of PAM4 to residual ISI,
a larger 6-bit resolution is used for all the DFE FIR tap weight,
IIR amplitude, and IIR time constant settings relative to a
previous NRZ receiver that utilized 5-bit resolution [13]. For
each code step, the DFE tap adaptation logic makes decisions
based on the correlation between edge and data samples
averaged over 256 symbols per slice for the FIR settings and
1024 symbols for the IIR weight and time constant settings.
Similar to the threshold adaptation procedure, these values
are selected to minimize convergence time with minimal code
dithering.
V. E XPERIMENTAL R ESULTS
Fig. 11 shows the chip micrograph of the PAM4 receiver,
which was fabricated in a GP 65-nm CMOS process and
occupies a total active area of 0.51 mm2 . The CTLE is placed
close to the bottom right input pads. At the CTLE output
is the DFE circuitry that is followed by the deserialization
logic where the data signals are multiplexed out of the chip
for BER testing. The CDR circuitry is above this, with the
680
Fig. 12.
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019
PAM4 receiver test setup and channel responses.
LC VCO placed roughly in the middle of the chip and the
clock divider and phase generator placed near the DFE to
minimize routing. In addition to the clocks from the CDR,
the receiver can also be tested via an external bypass clock to
measure timing margin. This clock comes in from the bottom
of the chip through an inductively peaked clock buffer. As the
synthesized threshold and DFE tap adaptation logic run at
a relatively low speed, it is placed at the top of the chip.
An on-chip DAC is included to provide an analog monitor
of the sampler thresholds and DFE tap coefficient adaptation
convergence behavior.
The receiver is characterized utilizing two test channels in
the experimental setup shown in Fig. 12. Channel 1 consists
of a 2-in FR4 test channel trace and the 1-in Rogers RX board
trace, with SMA cables between the pattern generator and
boards and three total sets of SMA connectors. Channel 2
has the same components, except for a longer 4-in FR4 test
channel trace. A PAM4 pattern generator with 1-main and
1-pre-cursor FFE taps generates PRBS15 data which passes
through Channel 1 and Channel 2 with 16.1 and 20.8 dB of
loss at 14 GHz, respectively. The on-die 1/8 rate data MUX
at the receiver output allows for independent verification of
the MSB or LSB outputs with an NRZ BER tester. In order
to measure timing bathtub curves, the CDR is bypassed and
the receiver is clocked with an external half-rate clock from a
pattern generator. This allows for a programmable phase shift
to measure the BER at different sampling times. Fig. 13(a)
shows the transmitter PAM4 pre-channel eye diagram without
any equalization with 600-mVppd swing. Co-optimizing the
2-tap pre-cursor FFE with the receiver equalization results in a
completely closed eye at the output of Channel 2 [Fig. 13(b)].
For all the subsequent reported testing results, the receiver
CTLE is set to have 0-dB dc gain and 6-dB high-frequency
peaking.
Utilizing the on-chip monitor DAC, Fig. 14 shows the DFE
tap coefficients and sampler thresholds convergence for both
Fig. 13. (a) 56-Gb/s eye diagram before Channel 2 without equalization
and (b) after Channel 2 with 2-tap pre-cursor FFE.
channels. All of the DFE taps settle within 2 µs, with the
higher loss Channel 2 settings displaying both higher FIR and
IIR tap values. Both channels have FIR[1:4] tap values that
are slightly different due to mismatches in the receiver slices
and residual skew between the sampling clock phases. The
initial threshold adaptation procedure completes within 16 µs,
with the higher loss Channel 2 having lower absolute threshold
values due to the reduced swing at the sampler input. After
the initial convergence, the error samplers continue to scan the
top and bottom eyes for background threshold optimization.
The CDR bypass mode is used to measure the receiver’s
combined MSB/LSB BER timing bathtub curves with various
receiver equalization configurations for Channel 1 and 2, as,
respectively, shown in Fig. 15(a) and (b). Utilizing a DFE
is a necessity, as both channels display poor BER with the
CTLE-only configuration. While an optimized combination of
CTLE and a DFE with only the FIR tap enabled allows a
BER <10−7 for Channel 1, the BER is worse than 10−2 for
Channel 2 due to significant ISI from the second post-cursor
and beyond. Enabling the DFE IIR tap allows for efficient
long-tail ISI cancellation to achieve 0.22 UI and 0.19 UI
timing margin at a BER = 10−12 for Channels 1 and 2,
respectively. It should be noted that the transmitter-side FFE
still plays a critical role in pre-cursor cancellation, as only
ROSHAN-ZAMIR et al.: 56-Gb/s PAM4 RECEIVER WITH LOW-OVERHEAD TECHNIQUES
681
Fig. 14. Measured DFE tap adaptation operating over (a) Channel 1 and (b) Channel 2, and measured sampler threshold adaptation operating over (c) Channel 1
and (d) Channel 2. The edge sampler values are omitted and only error sampler 1 is shown for clarity.
Fig. 15. Measured 56-Gb/s receiver timing bathtub curves operating over (a) Channel 1 and (b) Channel 2 and receiver voltage bathtub curves utilizing a
2-tap TX FFE and RX CTLE and 1-tap FIR 1-tap IIR RX DFE operating over (c) Channel 1 and (d) Channel 2.
a BER >10−7 is achieved for Channel 2 without transmitter
FFE. Utilizing the maximum receiver equalization configuration, the CDR is then enabled. This involves initially coarsely
setting the VCO frequency by manually forcing the loop filter
output voltage through an analog MUX and then allowing
the closed-loop PLL-based CDR to achieve phase lock. The
682
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019
TABLE I
PAM4 R ECEIVER P ERFORMANCE C OMPARISONS
Fig. 16. Measured PAM4 jitter tolerance (BER = 10−9 ) operating over
Channel 2.
Fig. 17.
data sampler thresholds are then manually adjusted from their
converged values to measure the voltage bathtub curves of
Fig. 15(c) and (d). Worst eye voltage margins of 23 and
14 mV are achieved at a BER = 10−12 for Channel 1
and Channel 2, respectively. Jitter tolerance measurements
were also performed using the high-loss Channel 2 with the
CDR enabled and the background equalization and threshold
adaptation running. Fig. 16 shows that the CDR has more
than 6 MHz of bandwidth with 0.12 UI of high-frequency
jitter tolerance at a BER = 10−9 . This exceeds the CEI-56G-
56-Gb/s receiver power breakdown.
VSR mask with margin, as the specification only requires a
BER = 10−6 .
Fig. 17 shows the 56-Gb/s receiver power breakdown. The
receiver consumes 259 mW of power, with CML comparators
and clocking circuits having the most contribution. Table I
summarizes the receiver performance and compares it with
other PAM4 receivers operating over 32 Gb/s. The receiver
achieves a power efficiency of 4.63 mW/Gb/s, which is superior to the ADC-based design of [5], and the mixed-signal front
ROSHAN-ZAMIR et al.: 56-Gb/s PAM4 RECEIVER WITH LOW-OVERHEAD TECHNIQUES
end of [7] that utilizes a two-stage CTLE and an additional TX
FFE tap. Employing the DFE IIR-tap allows for a reduction
in the total tap count relative to [6], while also extending
the maximum supported channel loss. Compared to [11],
the presented work extends the maximum achievable data rate
in a similar process.
VI. C ONCLUSION
This paper has presented a 56-Gb/s PAM4 quarter-rate
receiver that efficiently compensates for moderate channel loss
in a robust manner through background adaptation of the
receiver thresholds and equalization taps. Combining CTLE
and a DFE with only one FIR and one IIR taps allows for low
receive-side equalization complexity and operation with only a
2-tap transmit-side FFE. In addition to the per-slice main three
data samplers, a single error sampler is utilized for both CDR
phase detection and DFE tap adaptation with independent
per-slice values for the required PAM4 sensitivity. Sampler
threshold adaptation is also achieved with a single per-slice
error sampler that periodically scans the top and bottom
PAM4 eyes. Overall, the proposed PAM4 receiver architecture
enables transmission over channels with up to 20 dB of loss
at Nyquist with only a 2-tap pre-cursor transmitter-side FFE,
while improving the power efficiency compared to the state-ofthe-art receivers operating at similar data rates over channels
with comparable channel loss.
ACKNOWLEDGMENT
The authors would like to thank Texas Instruments for
providing laboratory equipment access and S. Finn, P. Crinion,
T. K. Chin, W. Haque, K. Jakoush, and A. Rane for measurement assistance.
R EFERENCES
[1] CEI-56G-VSR_PAM4 Very Short Reach Interface, document OIF
2014.230.07, Optical Internetworking Forum, Jun. 2016.
[2] IEEE P802.3bs 200 Gb/s and 400 Gb/s Ethernet Task Force. Accessed:
Nov. 2016. [Online]. Available: http://www.ieee802.org/3/bs/
[3] P. Upadhyaya et al., “A fully adaptive 19-to-56Gb/s PAM-4 wireline
transceiver with a configurable ADC in 16nm FinFET,” in IEEE ISSCC
Dig. Tech. Papers, Feb. 2018, pp. 108–110.
[4] S. Kiran, S. Cai, Y. Luo, S. Hoyos, and S. Palermo, “A 32 Gb/s ADCbased PAM-4 receiver with 2-bit/stage SAR ADC and partially-unrolled
DFE,” in Proc. IEEE Custom Integr. Circuits Conf. (CICC), Apr. 2018,
pp. 1–4.
[5] Y. Frans et al., “A 56-Gb/s PAM4 wireline transceiver using a 32-way
time-interleaved SAR ADC in 16-nm FinFET,” IEEE J. Solid-State
Circuits, vol. 52, no. 4, pp. 1101–1110, Apr. 2017.
[6] J. Im et al., “A 40-to-56 Gb/s PAM-4 receiver with ten-tap direct
decision-feedback equalization in 16-nm FinFET,” IEEE J. Solid-State
Circuits, vol. 52, no. 12, pp. 3486–3502, Dec. 2017.
[7] P.-J. Peng, J.-F. Li, L.-Y. Chen, and J. Lee, “A 56 Gb/s PAM-4/NRZ
transceiver in 40 nm CMOS,” in IEEE ISSCC Dig. Tech. Papers,
Feb. 2017, pp. 110–111.
[8] O. Elhadidy, A. Roshan-Zamir, H.-W. Yang, and S. Palermo, “A 32 Gb/s
0.55 mW/Gbps PAM4 1-FIR 2-IIR tap DFE receiver in 65-nm CMOS,”
in Proc. Symp. VLSI Circuits, Jun. 2015, pp. C224–C225.
[9] S. Shahramian and A. C. Carusone, “ A 0.41 pJ/Bit 10 Gb/s hybrid 2 IIR
and 1 discrete-time DFE tap in 28 nm-LP CMOS,” IEEE J. Solid-State
Circuits, vol. 50, no. 7, pp. 1722–1735, Jul. 2015.
[10] A. Roshan-Zamir, O. Elhadidy, H.-W. Yang, and S. Palermo, “A reconfigurable 16/32 Gb/s dual-mode NRZ/PAM4 SerDes in 65-nm CMOS,”
IEEE J. Solid-State Circuits, vol. 52, no. 9, pp. 2430–2447, Sep. 2017.
683
[11] L. Tang, W. Gai, L. Shi, X. Xiang, K. Sheng, and A. He, “A 32 Gb/s
133 mW PAM-4 transceiver with DFE based on adaptive clock phase
and threshold voltage in 65 nm CMOS,” in IEEE ISSCC Dig. Tech.
Papers, Feb. 2018, pp. 114–116.
[12] A. Roshan-Zamir et al., “A 56 Gb/s PAM4 receiver with low-overhead
threshold and edge-based DFE FIR and IIR-tap adaptation in 65 nm
CMOS,” in Proc. IEEE Custom Integr. Circuits Conf. (CICC), Apr. 2018,
pp. 1–4.
[13] S. Shahramian, B. Dehlaghi, and A. C. Carusone, “A 16 Gb/s 1 IIR
+ 1 DT DFE compensating 28 dB loss with edge-based adaptation
converging in 5µ s,” in IEEE ISSCC Dig. Tech. Papers, Jan./Feb. 2016,
pp. 410–411.
[14] A. Roshan-Zamir, O. Elhadidy, H. W. Yang, and S. Palermo,
“A 16/32 Gb/s dual-mode NRZ/PAM4 SerDes in 65 nm CMOS,” in
Proc. IEEE Compound Semiconductor Integr. Circuit Symp. (CSICS),
Oct. 2016, pp. 1–4.
[15] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703–1714,
Nov. 1996.
[16] E. Mensink, D. Schinkel, E. A. M. Klumperink, E. van Tuijl, and
B. Nauta, “Power efficient gigabit communication over capacitively
driven RC-limited on-chip interconnects,” IEEE J. Solid-State Circuits,
vol. 45, no. 2, pp. 447–457, Feb. 2010.
[17] M. Mizuno et al., “A GHz MOS adaptive pipeline technique using
MOS current-mode logic,” IEEE J. Solid-State Circuits, vol. 31, no. 6,
pp. 784–791, Jun. 1996.
[18] Y. Lu and E. Alon, “Design techniques for a 66 Gb/s 46 mW 3-tap
decision feedback equalizer in 65 nm CMOS,” IEEE J. Solid-State
Circuits, vol. 48, no. 12, pp. 3243–3257, Dec. 2013.
[19] N. Qi et al., “A 51Gb/s, 320mW, PAM4 CDR with baud-rate sampling
for high-speed optical interconnects,” in Proc. IEEE Asian Solid-State
Circuits Conf. (A-SSCC), Nov. 2017, pp. 89–92.
[20] J. Lee and B. Razavi, “A 40-Gb/s clock and data recovery circuit in
0.18-µm CMOS technology,” IEEE J. Solid-State Circuits, vol. 38,
no. 12, pp. 2181–2190, Dec. 2003.
[21] E. Hegazi, H. Sjoland, and A. A. Abidi, “A filtering technique to lower
LC oscillator phase noise,” IEEE J. Solid-State Circuits, vol. 36, no. 12,
pp. 1921–1930, Dec. 2001.
Ashkan Roshan-Zamir (S’14) received the B.Sc.
and M.Sc. degrees in electrical engineering from
the University of Tehran, Tehran, Iran, in 2010 and
2013, respectively, and the Ph.D. degree in electrical
engineering from Texas A&M University, College
Station, TX, USA, in 2018.
He was a Design Intern with Samsung Semiconductor Inc., San Jose, CA, USA, in 2015, where
he was involved in the design of clock and data
recovery systems. He was a Design Intern with
Texas Instruments Incorporated, Duluth, GA, USA,
in 2017, where he was involved in the design of integrated circuits for
high-speed wireline communication. Since 2018, he has been a Design
Engineer with Texas Instruments Incorporated, Santa Clara, CA, USA, where
he is involved in designing integrated mixed-signal circuits and circuits for
wireline and optical communication. His current research interests include
analog and mixed-signal integrated circuits, high-speed circuits for electrical
and optical communication, and clock and data recovery circuits.
Takayuki Iwai received the B.Eng. and M.Eng.
degrees from Waseda University, Tokyo, Japan,
in 2004 and 2006, respectively.
In 2006, he joined Toshiba Corporation, Kawasaki,
Japan, where he was involved in the design of
embedded DRAM, stacked chip SoC DRAM, and
high-speed I/O circuit. From 2016 to 2018, he was
a Visiting Scholar with Texas A&M University,
College Station, TX, USA, where he was involved in
research on high-speed electrical and optical transceiver circuits. Since 2017, he has been with Toshiba
Memory Corporation, Kawasaki. His current research interests include highspeed analog and mixed-signal integrated circuits, high-speed electrical and
optical transceiver circuits, and high-speed clocking circuits.
684
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019
Yang-Hang Fan received the B.S. degree in
engineering and system science and the M.S.
degree from the Institute of Electronics Engineering,
National Tsing Hua University, Hsinchu, Taiwan,
in 2007 and 2009, respectively. He is currently
pursuing the Ph.D. degree in electrical engineering
with Texas A&M University, College Station, TX,
USA.
From 2011 to 2015, he was with Faraday Technology, Hsinchu, Taiwan, where he worked on
the design of mixed-signal integrated circuits for
high-speed wireline communication. Since 2016, he has been a Research
Assistant with the Analog and Mixed Signal Center, Texas A&M University.
Since 2018, he has been an Intern with Hewlett Packard Enterprise, Palo
Alto, CA, USA. His current research interests include mixed-signal integrated
circuits and high-speed electrical and optical link circuits.
Ankur Kumar (S’18) received the B.E. degree
(Hons.) in electrical and electronics engineering and
the M.Sc. degree (Hons.) in mathematics from the
Birla Institute of Technology and Science, Pilani,
India, in 2014. He is currently pursuing the Ph.D.
degree in electrical engineering with Texas A&M
University, College Station, TX, USA.
In 2014, he was a Design Intern with STMicroelectronics Pvt. Ltd., Greater Noida, India. From
2014 to 2016, he was a Senior Systems Engineer
with Hewlett Packard Enterprise, Bangalore, India.
In 2016, he joined Texas A&M University. Since 2018, he has been a
Design Intern with Texas Instruments Incorporated, Duluth, GA, USA, where
he is involved in the design of integrated circuits for high-speed wireline
communication. His current research interests include the design of high-speed
and low-power circuits for electrical and optical communication and clock and
data recovery circuits.
Hae-Woong Yang (S’13) was born in Seoul,
South Korea. He received the B.S. and M.E.
degrees in electrical and computer engineering from
Texas A&M University, College Station, TX, USA,
in 2007 and 2009, respectively, and the Ph.D. degree
from the Analog and Mixed Signal Center, Texas
A&M University, in 2018.
His interests are in low-power high-speed electrical link circuits, clock generation circuits, and signal
integrity.
Dr. Yang was a co-recipient of the Student Best
Paper Award in the 2014 Midwest Symposium on Circuits and Systems.
Lee Sledjeski received the B.S. degree in electrical engineering from the University of Connecticut,
Storrs, CT, USA, in 1989.
He is currently a Staff Applications Engineer with
Texas Instruments (TI) Incorporated, Duluth, GA,
USA, and is heavily involved in high-speed interfaces and direct customer support. He works closely
with other engineers at TI to develop signal conditioning solutions for multi-gigabit communication
and storage standards.
John Hamilton, photograph and biography not available at the time of
publication.
Soumya Chandramouli (S’05–M’08) received the
B.S. degree in electrical and computer engineering
(summa cum laude) from the Lafayette College,
Easton, PA, USA, in 2002, and the M.S. and Ph.D.
degrees in electrical and computer engineering from
the Georgia Institute of Technology, Atlanta, GA,
USA, in 2004 and 2007, respectively.
In 2008, she joined the Interface Group, National
Semiconductor, Duluth, GA, USA, as an Analog
Circuit Designer. She is currently an Analog Design
Manager and a Design Lead in the high-speed signal
conditioning product line and a member and a Group Technical Staff with
Texas Instruments Incorporated, Duluth. She has authored or co-authored
15 conference and journal papers. She holds three patents.
Dr. Chandramouli was a recipient of the MTT Undergraduate Research
Scholarship in 2001. She has served as a reviewer for the IEEE T RANS ACTIONS ON C IRCUITS AND S YSTEMS : E XPRESS B RIEFS .
Arlo Aude (S’93–M’95) received the bachelor’s
degree of science in electrical engineering from the
Georgia Institute of Technology, Atlanta, GA, USA,
in 1995.
He worked with Harris Semiconductor, Melbourne, FL, USA, as a Test Engineer. He is currently
a Senior Member of Technical Staff and a Design
Technologist for the High-Speed Signal Conditioning Group, Texas Instruments Incorporated, Atlanta,
GA, USA. He is an Eagle Scout with The Boy
Scouts of America, Tampa, FL, USA, and an alumnus of Psi Upsilon Fraternity. He is the inventor or co-inventor of over
35 patents in many disciplines. He has authored or co-authored over 12 journal
and industry papers.
Samuel Palermo (S’98–M’07–SM’17) received the
B.S. and M.S. degrees in electrical engineering from
Texas A&M University, College Station, TX, USA,
in 1997 and 1999, respectively, and the Ph.D. degree
in electrical engineering from Stanford University,
Stanford, CA, USA, in 2007.
From 1999 to 2000, he was with Texas Instruments
Incorporated, Dallas, TX, USA, where he worked
on the design of mixed-signal integrated circuits for
high-speed serial data communication. From 2006 to
2008, he was with Intel Corporation, Hillsboro, OR,
USA, where he worked on high-speed optical and electrical I/O architectures.
In 2009, he joined the Electrical and Computer Engineering Department, Texas
A&M University, where he is currently an Associate Professor. His research
interests include high-speed electrical and optical interconnect architectures,
RF photonics, high-performance clocking circuits, and integrated sensor
systems.
Dr. Palermo is a member of Eta Kappa Nu. He has also previously served
as a Distinguished Lecturer for the IEEE Solid-State Circuits Society and on
the IEEE CASS Board of Governors. He was a recipient of the 2013 NSFCAREER Award, the Texas A&M University Department of Electrical and
Computer Engineering Outstanding Professor Award in 2014, the Best Student
Paper at the 2014 Midwest Symposium on Circuits and Systems, the Engineering Faculty Fellow Award in 2015, and the Best Student Paper at the
2016 Dallas Circuits and Systems Conference. He was a co-recipient of the
Jack Raper Award for Outstanding Technology-Directions Paper at the 2009
International Solid-State Circuits Conference. He was an Associate Editor of
the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS –II. He is currently
an Associate Editor of IEEE S OLID -S TATE C IRCUITS L ETTERS .
Download