Low-power High-Speed CMOS I/Os: Design Challenges and

advertisement
IBM Research GmbH
Zurich Research Laboratory
Rüschlikon, Switzerland
Low-power High-Speed CMOS I/Os:
Design Challenges and Solutions
Thomas Toifl
TWEPP 2012
Sept. 20th, 2012
I/O Link Technology group @ IBM ZRL
What we do
- I/O PHY circuit design
- backplane and chip-to-chip (e.g. processor, memory)
CPU
6” trace
Socket
- Distance: 10-100cm
24” trace
Card
- Attenuation: 10-40dB
- Data rate: 3-40 Gbit/s
- Energy/bit: <10pJ/bit
- Chip area : 0.1 mm2
2
Connector
CPU
Package
Board
Technical goals to meet
Package
Socket
Card
6” trace
2
General research directions
Data rate
(e.g. 28Gb/s )
Power efficiency
(e.g. 5pJ/bit)
Distance (=allowable channel attenuation
~equalizer complexity
(e.g. 28Gbit @ 30dB channel attenuation measured at fbaud/2)
Side conditions: chip area, testability, standard compliance, reliability etc.
3
3
Outline
▪ Introduction
• High-Speed I/Os in the wireline link space
• Equalization techniques
▪ Low-power Transmitter
• SST vs. CML driver
• Ultra-low power TX Avoid TX-FFE
▪ Low-power Receiver
•
•
•
•
Power-optimal amplification
Incomplete Settling / Integrating Buffer
Low-power DFE Architecture
Low-power clock generation
▪ Digital approaches
• ADC
• DFE implementations
• Analog vs. Digital
4
4
High-speed I/Os compared with other wireline links
High-speed
serial (backplane
or chip-to-chip)
Gbit Ethernet
1000Base-T
802.3ab
10Gbit Ethernet
10GBase-T
802.3an
Bidirectional
-
YES
YES
Data rate/lane
[Gbps]
3-20
2 * 0.25
2*2.5
Rate/channel
BW (HSS=1)
1
2
10
Coding
NRZ
PAM-5 + Trellis
PAM-16 + LDPC
Equalization
£5 tap FFE
£15 tap DFE
12 tap FFE
14 tap DFE
16 tap THP
Echo canc
-
60 Taps
>200 Taps
X-talk canc
-
75 tap NEXT
NEXT+FEXT
Energy/bit
x1
x10
x50
Case #1 High-speed I/Os (up to now): Simple, analog, rather low-power
Case #2 Ethernet over copper: Digital (ADC) + lots of signal processing
Goal: More complex equalization at low power
5
I/O Standards Roadmaps versus Data Rate
60
Proposed
In Development
Available
50
Gb/s
40
30
20
10
0
FibreChannel
Ethernet
InfiniBand
PCI-Express
OIF CEI
SAS
Current standards mostly in the 10-14Gb/s range
– OIF CEI pushing aggressively to 25G, still in “early adoption” phase
Expect most “major” standards to move to 25Gb/s regime within
the next 12-18 months
– Ethernet, InfiniBand, FibreChannel all in development at those speeds
6
© 2012 IBM Corporation
Background— Limited Bandwidth Channels Create Inter-Symbol
Inteference; Equalization Required to Support High Data Rates
Dispersive nature of channel causes significant spreading
of data pulse
Decision
Threshold
0010100
0011110
Adjacent symbols smear into each other, resulting in
inter-symbol interference (ISI) and potential data errors
7
© 2012 IBM Corporation
Architecture of current high-speed I/Os
Feed-forward equalizer (FFE)
d0(n)
T
c(-1)
T
c (0)
T
c(1)
c (2)
cyclically
TX
VTx
Driver
Channel
Clock
generation
Clock
tracking
loop
VRx
RX
d1(n-1)
d1(n-2)
T
d1(n-3)
T
d1(n-4)
T
…
d1(n-8)
T
x8
h(1)
h(2)
h(3)
h(8)
Decision-feedback equalizer (DFE)
Continuous-time linear
equalizer (CTLE)
8
8
Low-power Design Areas
Clock generation
for CDR
Transmitter
Clock Gen
Data path
TX
Latch
DFE feedback
Input data path
Amplifier / CTLE / DFE summation
9
Sampling &
Amplification
DFE architecture
Outline
▪ Introduction
• High-Speed I/Os in the wireline link space
• Equalization techniques
▪ Low-power Transmitter
• SST vs. CML driver
• Ultra-low power TX Avoid TX-FFE
▪ Low-power Receiver
•
•
•
•
Power-optimal amplification
Incomplete Settling / Integrating Buffer
Low-power DFE Architecture
Low-power clock generation
▪ Digital approaches
• ADC
• DFE implementations
• Analog vs. Digital
10
10
CML vs. voltage-mode (SST) driver
CML (Current mode logic)
• 1Vppd swing
• Load current : +/- 5mA
• Total current: 20mA
SST (Source-Series Terminated)
• 1Vppd swing
• Load current : +/- 5mA
• Total current: 5mA
- Power is proportional to output swing in both cases
- SST driver allows different termination options (differential, to GND, to VDD)
11
Half-rate SST Driver
- Transistors are small due to large gate overdrive
- Small input capacitance → low power in pre-driver
C. Menolfi et al., "A 16Gb/s Source-Series Terminated Transmitter in 65nm SOI ," ISSC 2007.
12
TX with 4-tap Feed-forward equalizer (FFE)
FFE waveform example
ck2
- 3.6 mW/Gbps @ 16 Gb/s and 1V swing
C. Menolfi et al., "A 16Gb/s Source-Series Terminated Transmitter in 65nm SOI ," ISSC 2007.
13
28 Gb/s SST-TX
125um
T-Coil
ESD
720mV
ESD
T-Coil
180 um
290 um
Operating speed: up to 28Gb/s at VDDmin=0.95V
35.7ps
4 tap bit-rate FIR feed-forward equalization (FFE)
+/- 20ps programmable True/Comp. output skew (enabled by SST concept)
+/- 5% Duty-cycle control
Adjustable driver impedance
Full ESD compliance: 2kV HBM, 100V MM, 250V CDM
High-frequency impedance matching with on-chip T-Coils, <-10dB return loss
Power consumption = 7mW/Gbps (175mW)
C. Menolfi et al., "A 28Gb/s Source-Series Terminated Transmitter in 32nm SOI“, ISSCC 2012.
14
14
Ultra-Low-power Transmitter Concept
• Equalization done on RX side only
• Avoid TX-FFE
(1) Low power (1mW/Gbps for TX for 1V swing)
(2) Less amplification of TX clock jitter (see next slides)
TX
Channel
NO Feed-forward equalizer
(FFE)
15
CTLE
Continuous-Time
Linear Equalizer
(CTLE)
n-tap
DFE
Received
Data
Decision-Feedback
Equalizer (DFE)
15
Jitter Amplification in High-Loss Channels
Jitter event Case I : Low-loss channel
Voltage error in sampling point
Case II : High-loss channel :
Jitter is distributed along multiple UIs Jitter amplification
Case II can be converted to Case I by
(1) Continues time linear equalizer (CTLE) at RX side
(2) Continues time linear equalizer (CTLE) at TX side
(3) Discrete time RX FFE
But not: TX FFE (since it is agnostic of actual jitter value)
TX FFE sub-optimal, should be replaced with (1)-(3) if possible
16
16
Jitter Amplification: Comparison TX FFE vs. RX FFE
Clock
Jitter
TX
FFE
Channel
TX-FFE
RX
Clock
Jitter
TX
Channel
RX-FFE
FFE
RX
Courtesy Troy Beukema
- 25% duty cycle variation applied = high frequency TX jitter at fb/2
- Same equalizer coefficients for TX-FFE and RX-FFE
RX-FFE suffers much less from TX jitter amplification than TX-FFE
Should make use of this fact in future standard definitions
17
17
Outline
▪ Introduction
• High-Speed I/Os in the wireline link space
• Equalization techniques
▪ Low-power Transmitter
• SST vs. CML driver
• Ultra-low power TX Avoid TX-FFE
▪ Low-power Receiver
•
•
•
•
Power-optimal amplification
Incomplete Settling / Integrating Buffer
Low-power DFE Architecture
Low-power clock generation
▪ Digital approaches
• ADC
• DFE implementations
• Analog vs. Digital
18
18
Power x Delay
Power optimal Amplification
[14]
SPA = Single-Pole Amp
OMA = Optimized Multi-pole
RSA = Regenerative Amp
Amplification
- Energy/bit = Power * Delay
- Regenerative amplification most power efficient
J. Wu and B. Wooley, "A 100-MHz Pipelined CMOS Comparator," JSSC 1988.
19
CML vs. StrongARM (=DCVS) comparator latch
φ
φ
vout
vout
voutb
voutb
vin
vinb
φ
vin
φ
vinb
φ
φ
vbias
τi =
20
Cn
2Cn
≅
gmn − gdsn − 1 / R 0.9 gmn
τi =
Cn + C p
gmn + gm p − gdsn − gds p
≅
1.25Cn
0.9 gmn
For Vdd=1V transistors are in similar operating point (Vgs=Vds=0.5V)
CML latch achieves smallest τi when R=2/gmn
StrongARM latch intrinsically faster
P/N ratio was >2, now approaching 1 due to strain engineering
CML vs. StrongARM Latch Energy Consumption
φ
vout
voutb
voutb
vin
vinb
φ
vin
φ
φ
vinb
vbias
CML-latch
φ
StrongARM-latch
30
Energy / Vdd2 CL
vout
φ
25
20
15
CMLspeed
10
CMLoptim
5
StrongARM
0
80
100
120
140
160
Tcycle [ps]
CML latch optimized for speed or power
Comparison for given Tcycle and amplification A = exp(5)
StrongARM latch ≈2x as power efficient at Tcycle=100ps
21
180
200
Incomplete Settling → Integrating buffer
CL
[9]
ts/ττ = 0
RL
vbias
vbias
ts= Tcycle/2
τ=RLCL
Arbitrary Units
10
CL CL
8
4
2
0
0
RL→∞, ts/τ→0
Time
Noise
6
Power
1
2
3
4
5
First, introduce sampling & reset in data path
For given CL
τ=RLCLis varied by varying load resistance RL
Power and noise are function of normalized settling time ts/τ
Power can be reduced significantly for RL→∞, integrator)
But: Noise rises significantly when ts/τ < 1.5
Can drive higher load at same noise with lower power
E. Iroaga and B. Murmann, "A 12-Bit 75-MS/s Pipelined ADC Using Incomplete Settling", JSSC 2007.
22
6
ts/τ
Normalized settling time
Low-Power Decision Feedback Equalizer (DFE)
DFE principle of operation
Direct DFE
Low-power DFE
Speculative DFE
Integrating DFE
Switched-cap feedback
Low-power DFE at high-speed (28 Gbps)
23
Integrating DFE with fast switched-cap path and 3-tap
speculation
Principle of DFE
sdnn
vin(t)
+/-1
Direct DFE
h-3
T
Fundamental problem
h-2
h-1
T
T
critical path
t=1UI
Vint=AVin+dn-4*h4+…+dn-15*h15
Have to close the to timing loop in the DFE in 1 UI
Low-power solutions : Relax timing lower supply voltage
24
Use speculative DFE (see next slide)
Improve speed of DFE feedback switched-cap feedback
Speculation in DFE
d
snn
vin(t)
+/-1
Direct DFE
h-3
T
h-2
h-1
T
T
DFE feedback
path ∆t=1UI
-h1
dn-1
vin(t)
+h1
DFE with one
speculative tap
2:1
dn
+/-1
h-4
h-3
h-2
T
T
T
∆t=2UI
T
Timing relaxed by 1UI - tMUX
25
25
CTLE + 1-tap speculative DFE results in power-efficient solution
26
Good solution for smooth channels
Avoids analog DFE summation
Power = 0.7pJ/bit @ 20Gb/s
J. Proesel et al., VLSI 2011
Summation node: Current-Integrating DFE
Resistors replaced with
resettable capacitors
Integrating buffer instead of resistively loaded buffer
Lower power
1.4 pJ/bit for 2 taps, 90nm CMOS
M. Park, et al., "A 7 Gb/s 9.3 mW 2-Tap Current Integrating DFE Receiver", ISSCC 2007.
27
IBM Zurich Research Laboratory
Integrating DFE with I-feedback or switched-cap feedback
Vint=AVin+dn-4*h4+…+dn-15*h15
φ
CL
φ
φ
CL
CL
vin
vin d-2
φ
d-2 d-8
d-8
CL
vin
vin
h2
h2
h8
h8
…
d-2
h2
…
h8
I- feedback
[Parks and Bulzacchelli, ISSCC 2007]
d-8
Vreg
Vreg
Switched-cap feedback
SC-DFE (VLSI 2011)
I-feedback is replaced with charge feedback SC-DFE
Enables faster feedback loop (see next slide)
T. Toifl et al. “A 2.6mW/Gbps 12.5Gbps RX with 8-tap Switched-Cap DFE in 32nm CMOS”, VLSI 2012
28
28
IBM Zurich Research Laboratory
DFE: Current feedback vs. switched-cap feedback
Current feedback: previous symbol decision must arrive before integration
Clock
INTEGRATE
INTEGRATE
RESET
SAMPLE
vo, vo
SAMPLE
hn
hn
Clock
vo, vo
hn
INTEGRATE
RESET
SAMPLE
INTEGRATE
SAMPLE
hn
SC-feedback: previous symbol decision may arrive later more margin
29
IBM Zurich Research Laboratory
16
16
16
12 4 8
16
16
16
17um
1.5
100
1
50
0.5
0
0
-50
-0.5
-100
-1
-150
-1.5
-64 -48 -32 -16 0
16 32 48 64
Code
• ±63 steps (6bit+sign) for taps 2-8
• 1 LSB = 250aF
• Excellent linearity
• Very Small area (no I-DAC needed)
30
• Caps made of min finger M1+M2
• Size of entire SC-DFE tap < 70um2
in 32nm (including DAC)
DNL [LSB]
12 4 8
150
Voltage [V]
SC-DFE sign
logic+cap driver
cells
Voltage [mV]
SC-DFE DAC Implementation and Measurement
Low-power DFE at 28 Gbit/s
28 Gbps : 1UI = 35ps
Half-rate receiver reaches limit
• Hard to achieve required amplification in integrator
• 14GHz CMOS clocking difficult
• Duty cycle hard to control, need small FO
• Electromigration requires to add lots of metal in the layout
→ Quarter-rate receiver more power-efficient
Single-tap speculative DFE too slow
• Feedback time only tfb=70ps
→ Three-tap speculation relaxes timing (tfb=210ps)
→ Can reduce supply voltage and use higher fan-out
31
3 Tap Speculation
+h1+h2+h3
dn-3
+h1+h2-h3
+h1-h2+h3
+h1-h2-h3
vin(t)
-h1+h2+h3
-h1+h2-h3
h-6
T
h-5
2:1
dn-2
dn-3
2:1
2:1
dn-1
dn-3
2:1
dn
+/-1
2:1
dn-2
h-4
-h1-h2+h3
dn-3
2:1
-h1-h2-h3
T
2:1
DFE feedback
path ∆t=4UI
T
T
T
T
Now 8 latches
Power ~ 150fJ/bit per latch moderate penalty
Critical path now 4UIs
Timing relaxed by 3UI - 3tMUX
32
32
IBM Zurich Research Laboratory
DFE + Demux slice (1 slice of 4 total)
h1+h2+h3
L0 s0p/n
dp/n-3
dp/n-2
h1+h2-h3
L1 s1p/n
h1-h2+h3
L2 s2p/n 5:1
h1-h2-h3
L3 s3p/n
φint
h1±h2±h3
LA0
vin
dp/n-1
sa0p/n
vint
-Amp
∫-Amp
T/H
-h1+h2+h3
L4 s4p/n
dp/n-3
dp/n-2
dp/n0
2:1
-h1+h2-h3
SCDFE
h5
SCDFE
h6
…
SCDFE
h15
SCDFE
h4
L5 s5p/n
-h1-h2+h3
L6 s6p/n 5:1
-h1-h2-h3
d0
d-1
L7
FIFO
(shared between
slices 0/2 and 1/3)
-h1±h2±h3
LA1
sgn(h4)
s7p/n
PG
XOR
sa1p/n
10
10:1
spy
Actually 10 comparator latches per slice : 8 active + 2 used for calibration
33
IBM Zurich Research Laboratory
Layout and power breakout @ 30Gb/s
CTLE
0.2mW/Gbps
T-coil
&
ESD
CTLE
Clkbuf & CML2CMOS
0.13mW/Gbps
Clk gen & bias
0
2
1
3
I/Q gen
(CML Clock div)
0.27mW/Gbps
250um
SC-DFE
12 cells
DFE and Demux
2.5mW/Gbps
Digital
Total: 92mW = 3.1mW/Gbps
Core DFE size = 200x90µm
Need to calibrate 40 comparator offsets and 48 DFE taps
T. Toifl, et al., "A 3.1mW/Gbps 30Gbps Quarter-Rate Triple-Speculation 15-tap SC-DFE RX Data Path
in 32nm CMOS", VLSI 2012.
34
IBM Zurich Research Laboratory
30Gb/s
DD21 [dB]
0
-10
-20
-30dB @ 15GHz
-30
-40
-50
0
PRBS 31, No TX FFE
5
10
15
20
Frequency [GHz]
0
-2
-4
30Gb/s
-6
-8
-10
-12
TX FFE = [0 67 -29]
0
Power =92mW @ VDD=1.15V = 3.1pJ/bit
35
25
50
75
100
Phase position [%UI]
IBM Zurich Research Laboratory
Low-power RX clock generation
Clock generation for quarter-rate CDR system
Phase-programmable
Design example:
• 40Gb/s RX using P-PLL
36
PLL (P-PLL)
IBM Zurich Research Laboratory
Quarter-rate Clock and Data Recovery
CDR loop adjusts timing
Data
From MultiPhase Clock
Generator
φ0
φ1
φ2
φ3
φ4
φ5
φ6
φ7
D0
D1
D2
D3
4 Received bits per clock cycle
2 Sample bits/symbol (data, edge)
→ Need 4x2 clock phases
→ Need to shift 8 clocks simultaneously
37
IBM Zurich Research Laboratory
Quarter-rate Dual-loop architecture
φref
PLL/DLL
Polyphase
k Phases
Digital DLL
8 Phase
Rotators
Phase rotators:
- Area
- Power
Data
40 Gbps
8 Phases
8 Sampling
Latches
Digital
Loop Filter
4 Data Bits
Previous solution
38
- Mismatch
- Need at least 2 ref phases
IBM Zurich Research Laboratory
Quarter-rate Dual-loop architecture
φref
PLL/DLL
Polyphase
k Phases
Digital DLL
φref
8 Phase
Rotators
Data
40 Gbps
Data
40 Gbps
8 Phases
8 Sampling
Latches
Digital DLL
P-PLL
8 Phases
8 Sampling
Latches
Digital
Loop Filter
4 Data Bits
Digital
Loop Filter
4 Data Bits
⇒ Phase-Programmable PLL (P-PLL)
Phases can be programmed by digital value Enables digital CDR loop
Provides tight lock to high-reference frequency low jitter
39
IBM Zurich Research Laboratory
40 Gb/s CDR Architecture using P-PLL
d0-7
4:8 Demux
Data
40 Gbit/s
8 Phases
d0-3
e0-7
Early/Late
Signal Gen
e0-3
VCO
φ0 φ2n … φ14
288 steps / 4 UI
φref
…x 8
10 GHz
XOR Phase
Detector
8 Phases
2
Loop filter
Sampling
φ1 φ2n+1 … φ15
Vc
α0
α1
α2
E/L
Digital
loop Filter
V/I
Conv
2
α-DAC
up/dn
α-DAC
Signal Gen
18
α7
P-PLL
288 steps / 4 UI
[37] T. Toifl et al., "A 72 mW 0.03 mm2 inductorless 40 Gb/s CDR in 65 nm SOI CMOS," ISSCC 2007.
40
PRBS15
Checker
error_out
IBM Zurich Research Laboratory
Chip Photograph and Layout of 40Gb/s RX
This work
Tbps/mm2
1.5
Loop
Filter
CDR
Logic
+
PRBS15
checker
(digital)
Samplers +
4:8 Demux
VCO
Phase Detect.
data
φref
190 um
150 um
1.0
SiGe
45 Gb/s
0.5
0.0
[37]
CMOS
40 Gb/s
CMOS
44 Gb/s
[26]
0.0
CMOS
25 Gb/s
[24]
CMOS
[23] 40 Gb/s
[25]
0.2
0.4
0.6
Tbps/W
Power : 1.8pJ/bit @ 40 Gbps
Area: <0.03mm2 in 65nm CMOS, no inductors uses
41
Outline
▪ Introduction
• High-Speed I/Os in the wireline link space
• Equalization techniques
▪ Low-power Transmitter
• SST vs. CML driver
• Ultra-low power TX Avoid TX-FFE
▪ Low-power Receiver
• Power-optimal amplification
• Incomplete Settling / Integrating Buffer
• Low-power DFE Architecture
▪ Digital approaches
• ADC
• DFE implementations
• Analog vs. Digital
42
42
ADC-based I/Os
▪ New standards emerging operating with PAM-4
• IEEE 802.3bj (100Gb Ethernet over backplane)
• PAM-4 mode for high channel loss (>40dB)
▪ Requirements for ADC
• Rather low resolution (<5bits)
• Low latency (for closing CDR loop)
Flash ADC
• High conversion rates
Interleave Flash ADC
Low input capacitance required
Example: Low-power flash ADC with small input cap
43
43
Low-power flash ADC with small input cap
clk
Clock
receiver
30x12 Bit
Memory
Clock
generation
Comp
Slice 0
clkb
Term+
ESD
vinb
T/H
Integrator
…
Comp
Slice 29
Thermometer to Binary Encoder
vin
Bubble correction
Comp
Slice 1
5
- Integrating buffer drives big load with small input capacitance
- Comparator slice consists of offset-adjustable SenseAmp latch
(no reference ladder required)
44
44
Low-power flash ADC results
latch threshold memory
buf
30 threshold-adjustable
DCVS latches
65 mm
65
µm
latch threshold memory
120 mm
120
µm
Technology
32nm CMOS SOI
Sampling frequency
5GHz
ENOB @ DC
4.4
ENOB @ fs/2
4.1
DNL, INL
<0.3 LSB, <0.3 LSB
Power consumption
19mW @ 1V supply
FOM
230fJ/conversion step
Input cap
50fF
ADC cell can be time-interleaved to achieve higher conversion rates
45
45
Digital DFE
▪ Relatively simple and low-power
• Case I : low speed (<5Gbit/s)
• DFE loop can be implemented by digital adder
• Critical path is adder and comparator
• Taps can also be combined in LUT
ADC
T
c5
c4
T
46
T
c3
T
≥
T
c2
T
sn
dn
c1
T
T
46
Digital DFE
▪ Relatively simple and low-power
• Case II : small number of DFE taps (n ≤ 5)
• Digital loop unrolling requires 2n comparators
- Comparator power: <100uW/Gbps
- Critical path: 2:1 MUX
ADC
sn
c00000
- Hard problem is long DFE (eg 15 taps)
at high speed (eg 28Gb/s)
c00001
c00010
c00011
dn
32:1
MUX
…
c11111
T
47
T
T
T
T
47
Analog vs. Digital: RX power comparison
▪ Power: Optimized Analog RX with 15-tap DFE
•
•
•
•
Clock path: 2mW/Gbps
Linear equalizer (CTLE): 1.5mW/Gbps
DFE: 3mW/Gbps for 15-tap DFE @ 25Gb/s
Total: 6.5 mW/Gbps
▪ Power: ADC based RX with 15-tap digital DFE
•
•
•
•
•
Clock path: 1mW/Gbps (no edge samples required with Mueller-Müller CDR)
Linear equalizer (CTLE): 1.5mW/Gbps
ADC: 4mW/Gbps
DSP: FFE: 3mW/Gbps + DFE: 3mW/Gbps
Total: 12.5mW/Gbps
ADC power alone is ≥ DFE power
Analog will stay lowest power solution for DFE equalization
Why ? Analog well suited for DFE: fast summation with moderate accuracy
BUT: gap will shrink as technology scales
48
48
Summary
▪ Low-power TX techniques
• SST drivers: low-power, multi-standard
• Avoid TX-FFE: Low power, less jitter amplification
▪ Low-power RX Techniques
• Regenerative amplification, StrongARM (DCVS) latch
• Integrating DFE using switched-cap feedback
• P-PLL phase generation
▪ Digital (ADC-based) I/O approaches
• ADC : Flash ADC <4mW/Gbps
• Analog is lower power solution for long DFE at high speeds
• High-speed I/Os will follow route of Gb Ethernet over UTP
49
49
Acknowledgements
50
IBM
Miromico
Christian Menolfi
Robert Reutemann
Peter Buchmann
Michael Ruegg
Marcel Kossel
Daniele Gardellini
Matthias Brändli
Andrea Prati
Thomas Morf
Giovanni Cervelli
Pier Andrea Francese
John Bulzacchelli
Troy Beukema
Tod Dickson
Dan Dreps
Steve Baumgartner
Fran Keyser
John Berqkvist
Martin Schmatz
Thank you for your attention!
51
51
Additional Slides
52
52
Summary of Equalization Techniques
Feed-forward Equalization (FFE)
– FIR filter, usually implemented at TX side
Continuous-Time Linear Equalizers (CTLE)
– Increase high-frequency gain and/or decrease low-frequency
gain to compensate for low-pass channel characteristics.
Decision Feedback Equalization (DFE)
– Feed back previous bit decisions to cancel postcursor ISI
caused by those bits.
53
© 2012 IBM Corporation
Source-synchronous Links Application Area
Important parameters:
• throughput (Gb/s per pin)
• power (mW per Gb/s or equivalently pJ per bit)
• limited die area (µm2 per Gb/s, fit beneath C4 balls)
• Latency (memory links, SMP links)
54
54
Switched-cap DFE principle
sgn(c2)
d-2
0
C2
16C
Vreg
sgn(c4)
d-4
0
…
16C
…
0
C4
16C
1
0
8C
4C
4C
2C
2C
1C
1C
1
…
Vreg
sign logic
1
0
8C
CL
1
16C
16C
C3
Vreg
sgn(c8)
d-8
CL
1
1
FF
½-rate
FIFO
16C
Vreg
sgn(c3)
d-3
FF
Vob
Vo
0
C8
vo
single SC-cap DFE tap
- Current feedback replaced by charge feedback
- Capacitive DAC (6 bits+sign, +/-0..63): 4 binary (1,2,4,8) + 3 bits therm-coded (16,16,16)
- unused caps disconnected by PFET pass transistors (reduced cap loading)
- Transistors used as switches (digital design style, don't care about 'analog' parms like gm, gds)
- DFE timing margin is improved wrt to integrating DFE with current feedback
- Addition of charges is highly linear: needed for large number (e.g. 20) DFE or X-talk cancellation
55
55
TX Architecture
data in
d0
2:1
d2
even
odd
d1
d3
FIR
shift
reg.
4-tap
ck4/ck4b
tap0
(even/odd)
tap1
(even/odd)
tap2
(even/odd)
tap3
(even/odd)
out+
out-
data out
mux4:2
32 SST
units
clock in
ck2/ck2b
ck2
cml
CML-to-CMOS
clock generator
tapsel[0:31]
data/term[0:3]
enable
config. reg.
tunen,tunep
sign[0:3]
▪ Half-rate architecture
▪ 4-tap feed-forward equalizer (FFE)
▪ Programmable output impedance
56
courtesy Christian Menolfi
[1],[5]
56
Jitter Amplification in High-Loss Channels
- Jitter can be approximated by Dirac impulse at edge position [1]
- Jitter is then converted to voltage noise in the sampling point
TX jitter folded with channel response
[1] V. Stojanović et al., "Optimal Linear Precoding with Theoretical and Practical Data Rates in High-Speed Serial-Link Backplane Communication", ICC 2004
57
57
Sampling in Data Path vs. continuous time data path
T/H
∆V
∆V
L
L
Cs
Sampling is done by T/H at the input
Total input capacitance is N·Cs/2
Advantages
- Signal processing now in discrete time domain
-> Reduced bandwidth requirements due to
-> Sub-rate
-> Can use reset to erase history
-> Buffers can use incomplete settling or integration
58
IBM Zurich Research Laboratory
P-PLL – Key points:
Advantages
• Clocks from VCO go directly into the latches
• No need for phase rotators
• Clock path is very short
• Phase-rotation with XOR phase detectors is inherently linear
• High-frequency noise on input clock signal is filtered out
Disadvantages
• Phase noise is accumulated due to PLL operation
But: PLL bandwidth is extremely high (>1 GHz for 10GHz clock)
Noise is attenuated to large extent
• Phase-rotation now in feedback-path
But: No influence on CDR due to high PLL bandwidth
59
References
SST- Transmitter
[1] C. Menolfi, T. Toifl, P. Buchmann, M. Kossel, T. Morf, J. Weiss, M. Schmatz, "A 16Gb/s Source-Series
Terminated Transmitter in 65nm CMOS SOI," ISSCC Dig. Tech Papers, pp. 446-447, Feb. 2007.
[2] M. Kossel, C. Menolfi, J. Weiss, P. Buchmann, G. von Bueren, L. Rodoni, T. Morf, T. Toifl, M. Schmatz, "A TCoil-Enhanced 8.5Gb/s High-Swing source-Series-Terminated Transmitter in 65nm Bulk CMOS", IEEE
Journal of Solid-State Circuits, Vol. 43, pp. 2905-2920, Dec. 2008.
[3] R. Philpott, J. Humble, R. Kertis, K. Fritz, B. Gilbert, E. Daniel, "A 20Gb/s SerDes transmitter with adjustable
source impedance and 4-tap feed-forward equalization in 65nm bulk CMOS," IEEE Custom Integrated
Circuits Conference (CICC), pp.623-626, 21-24, 2008.
[4] W. Dettloff, J. Eble, Lei Luo, P. Kumar, F. Heaton, T. Stone, B. Daly, "A 32mW 7.4Gb/s protocol-agile sourceseries-terminated transmitter in 45nm CMOS SOI," ISSCC Dig. Tech Papers, pp.370-371, 7-11 Feb. 2010.
[5] C. Menolfi, T. Toifl, M. Rueegg, M. Braendli, P. Buchmann, M. Kossel, T. Morf, "A 14Gb/s high-Swing Thinoxide device SST TX in 45nm CMOS SOI", ISSCC 2011.
60
60
References
Latch Modeling and Optimization
[6] B. Wicht, T. Nirschl, D. Schmitt-Landsiedel, "Yield and Speed Optimization of a Latch-Type Voltage Sense
Amplifier", IEEE Journal of Solid-State Circuits, vol. 39, pp. 1148 - 1158, July 2004.
[7] T. Toifl, C. Menolfi, M. Ruegg, R. Reutemann, P. Buchmann, M. Kossel, T. Morf, J. Weiss, M. Schmatz, "A 22Gb/s PAM-4 receiver in 90-nm CMOS SOI technology," IEEE Journal of Solid-State Circuits, Vol. 41, pp.
954 - 965, April 2006.
[8] P. Haydari, R. Mohanavelu, "Design of Ultrahigh-Speed Low-Voltage CMOS CML Buffers and Latches", IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, Vol.12, No. 10, October 2004.
[9] T. Chalvatzis, K.H.K. Yau, P. Schvan, M.T. Yang, S.P. Voinigescu, "A 40-Gb/s Decision Circuit in 90-nm
CMOS," Proceedings of the 32nd European Solid-State Circuits Conference, Vol. 32, pp. 512 - 515,
September 2006.
[10] Y. Okaniwa, H. Tamura, M. Kibune, D. Yamazaki, T. Cheung, J. Ogawa, N. Tzartzanis, W. W. Walker, and T.
Kuroda, "A 40-Gb/s CMOS clocked comparator with bandwidth modulation technique," IEEE Journal of
Solid-State Circuits, vol. 40, pp. 1680 - 1687, August 2005.
[11] P. Nuzzo, F. De Bernardinis, P. Terreni, G. Van der Plas, "Noise Analysis of Regenerative Comparators for
Reconfigurable ADC Architectures", IEEE Trans. on Circuits and Systems I, Vol. 55, No 6, pp 1441-1454,
July 2008.
[12] J. Kim. B. Leibowitz, J. Ren, C. Madden, "Simulation and Analysis of Random Decision Errors in Clocked
Comparators", IEEE Trans on Circuits and Systems I, Vol. 56, pp 1844-1857, August. 2009.
[13] M. Jeeradit, J. Kim, B. Leibowitz, P. Nikaeen, V. Wang, B. Garlepp, C. Werner, "Characterizing Sampling
Aperture of Clocked Comparators", pp 68-69, IEEE Symposium on VLSI Circuits, June 2008.
61
61
References
Low-power techniques
[14] J. Wu and B. Wooley., "A 1OO-MHz Pipelined CMOS Comparator," IEEE Journal of Solid-State Circuits, Vol.
23, pp. 1379 - 1385, December 1988.
[15] M. Choi, A. Abidi, "A 6-b 1.3-Gsample/s A/D converter in 0.35-um CMOS," IEEE Journal of Solid-State
Circuits, Vol. 36, pp. 1847 - 1858, Dec 2001.
[16] E. Iroaga and B. Murmann, "A 12-Bit 75-MS/s Pipelined ADC Using Incomplete Settling", IEEE Journal of
Solid-State Circuits, Vol. 42, pp. 784 - 756, April 2007.
[17] M. Park, J, Bulzacchelli, M. Beakes, D. Friedman, "A 7Gb/s 9.3mW 2-Tap Current-Integrating DFE Receiver",
ISSCC Dig. Tech Papers, pp. 230-231, Feb. 2007.
[18] T. Dickson, J. Bulzacchelli, D. Friedman, "A 12-Gb/s 11-mW Half-Rate Sampled 5-Tap Decision Feedback
Equalizer With Current-Integrating Summers in 45-nm SOI CMOS Technology", IEEE Journal of SolidState Circuits, pp 1298-1305, Vol. 44, April 2009
[19] K. Fukuda et al,"A 12.3-mW 12.5Gb/s Complete Transceiver in 65-nm CMOS Process", IEEE Journal of
Solid-State Circuits, Vol. 45, Dec. 2010
62
62
References
DFE Implementations
[20] R. Payne, B. Bhakta, S. Ramaswamy, S. Wu, J. Powers, P. Landman, U. Erdogan, A. Yee, R. Gu, L. Wu, B.
Parthasarathy, K. Brouse, W. Mohammed, K. Heragu, V. Gupta, L. Dyson, W. Lee "A 6.25Gb/s Binary Adaptive
DFE with First Post-Cursor Tap Cancellation for Serial Backplane Communications", ISSCC Dig. Tech Papers,
pp. 68-69, Feb. 2005.
[21] B. Leibowitz, J. Kizer, H. Lee, F. Chen, A. Ho, M. Jeeradit, A. Bansal, T. Greer, S. Li, R. Farjad-Rad, W.
Stonecypher, Y. Frans, B. Daly, F. Heaton, B. W. Garlepp, C. W. Werner, N. Nguyen, V. Stojanović, J L. Zerbe,
"A 7.5 Gb/s 10-Tap DFE Receiver with First Tap Partial Response, Spectrally Gated Adaptation, and 2nd-Order
Data Filtered CDR," ISSCC Dig. Tech Papers, pp. 228-229, Feb. 2007.
[22] T. Beukema, M. Sorna, K. Selander, S. Zier, B. L. Ji, P. Murfet, J. Mason, W. Rhee, H. Ainspan, B. Parker, and M.
Beakes, "A 6.4-Gb/s CMOS SerDes core with feed-forward and decision-feedback equalization," IEEE Journal of
Solid-State Circuits, vol. 40, pp. 2633 - 2645, December 2005.
[23] J. F. Bulzacchelli, M. Meghelli, S. V. Rylov, W. Rhee, A. V. Rylyakov, H. A. Ainspan, B. D. Parker, M. P. Beakes,
A. Chung, T. J. Beukema, P. K. Pepeljugoski, L. Shan, Y. H. Kwark, S. Gowda, and D. J. Friedman, "A 10-Gb/s
5-tap DFE/4-tap FFE transceiver in 90-nm CMOS technology," IEEE Journal of Solid-State Circuits, vol. 41, pp.
2885 - 2900, December 2006.
[24] A. Emami-Neyestanak, A. Varzaghani, J. Bulzacchelli, A., C. K. Ken Yang and D. Friedman, "A Low-Power
Receiver with Switched-Capacitor Summation DFE", Symp.VLSI Circuits Dig. Tech. Papers, June 2006.
[25] K. J. Wong, A. Rylyakov, and C. K. Ken Yang, "A 5-mW 6-Gb/s quarter-rate sampling receiver with a 2-tap DFE
using soft decisions," Symp. VLSI Circuits Dig. , pp. 190 - 191, June 2006.
[26] A. Garg, A. C. Carusone, and S. P. Voinigescu, "A 1-tap 40-Gb/s look-ahead decision feedback equalizer in 0.18µm SiGe BiCMOS technology," IEEE Journal of Solid-State Circuits, vol. 41, pp. 2224 - 2232, October 2006.
63
63
References
Source-Synchronous RX /Clean up PLL/ P-PLL
[27] T. Toifl, C. Menolfi, M. Ruegg, R. Reutemann, P. Buchmann, M. Kossel, T. Morf, J. Weiss, M. Schmatz, "A
0.94-ps-RMS-jitter 0.016-mm2 2.5-GHz multiphase generator PLL with 360º digitally programmable phase
shift for 10-Gb/s serial links", IEEE Journal of Solid-State Circuits, Vol. 40, pp. 2700 - 2712, Dec 2005.
[28] E. Prete, D. Scheideler, A. Sanders, “A 100mW 9.6Gb/s Transceiver in 90nm CMOS for Next- Generation
Memory Interfaces,” ISSCC Dig. Tech. Papers, vol. 49, pp. 88-89, Feb. 2006.
[29] R. Palmer, J. Poulton, W. J. Dally, J. Eyles, A. M. Fuller, T. Greer, M. Horowitz, M. Kellam, F. Quan, F.
Zarkeshvari, “A 14mW 6.25Gb/s Transceiver in 90nm CMOS for Serial Chip-to-Chip Communications”,
ISSCC Dig. Tech Papers, pp. 440-441, Feb. 2007.
[30] J. Poulton, R. Palmer, A.M. Fuller, T. Greer, J. Eyles, W.J. Dally, M. Horowitz, "A 14-mW 6.25-Gb/s Transceiver
in 90-nm CMOS", IEEE Journal of Solid-State Circuits, vol. 42, pp. 2745 - 2755, December 2007.
[31] R. Reutemann, M. Ruegg, F. Keyser, J. Bergkvist, D. Dreps, T. Toifl, M. Schmatz, " 4.5 mW/Gb/s 6.4 Gb/s
22+1-Lane Source Synchronous Receiver Core With Optional Cleanup PLL in 65 nm CMOS", IEEE Journal
of Solid-State Circuits, Vol. 45, pp. 2850 - 2860, Dec 2010.
[32] A. Agrawal, A. Lie, P. Kumar Hanumolu, G. Wei, "An 8 x 5 Gb/s Parallel Receiver With Collaborative Timing
Recovery", IEEE Journal of Solid-State Circuits, vol. 44, pp. 3120 - 3130, Nov. 2009.
[33] B. Leibowitz, R. Palmer, J. Poulton, Y. Frans, S. Li, J. Wilson, M. Bucher, A. Fuller, J. Eyles, M. Aleksic, T.
Greer, N. Nguyen, "A 4.3 GB/s Mobile Memory Interface With Power-Efficient Bandwidth Scaling", IEEE
Journal of Solid-State Circuits, vol. 45, pp. 889 - 898, April 2010
[34] N. Kurd, S. Bhamidipati, C. Mozak, J. Miller, P. Mosalikanti, et al. ", A Family of 32 nm IA Processors," IEEE
Journal of Solid-State Circuits, vol.46, no.1, pp.119-130, Jan. 2011
64
64
References
25-40 Gbps CDR Circuits
[35] J. Lee, B. Razavi, “A 40-Gb/s clock and data recovery circuit in 0.18-µm CMOS technology,” IEEE Journal of
Solid-State Circuits, vol. 38, pp. 2181 - 2190, Dec. 2003.
[36] C. Kromer, G. Sialm, C. Menolfi, M. Schmatz, F. Ellinger, H. Jackel “A 25Gb/s CDR in 90nm CMOS for HighDensity Interconnects,” ISSCC Dig. Tech Papers, pp. 326-327, Feb. 2006.
[37] T. Toifl, C. Menolfi, P. Buchmann, C. Hagleiter, M. Kossel, T. Morf, J. Weiss, M. Schmatz, “A 72mW 0.03mm2
Inductorless 40Gb/s CDR in 65nm SOI CMOS”, ISSCC Dig. Tech Papers, pp. 410-411, Feb. 2007.
[38] D. Kucharski, K. Kornegay, “2.5 V 43-45 Gb/s CDR Circuit and 55 Gb/s PRBS Generator in SiGe Using a LowVoltage Logic Family,” IEEE Journal of Solid-State Circuits, vol. 41, pp. 2154 - 2165, Sept. 2006.
[39] N. Nedovic, N. Tzartzanis, H. Tamura, F. Rotella, M. Wiklund, J. Ogawa, W. Walker, “A 40-44Gb/s 3x
Oversampling CMOS CDR/1:16 DEMUX,” ISSCC Dig. Tech Papers, pp. 224-225, Feb. 2007.
[40] C. Liao, S. Liu; , "A 40 Gb/s CMOS Serial-Link Receiver With Adaptive Equalization and Clock/Data Recovery,"
IEEE Journal of Solid-State Circuits, vol.43, no.11, pp 2492-2502, Nov. 2008.
[41] S. Kaeriyama, Y. Amamiya, H. Noguchi, Z. Yamazaki, et al. "A 40 Gb/s Multi-Data-Rate CMOS Transmitter
and Receiver Chipset With SFI-5 Interface for Optical Transmission Systems," IEEE Journal of Solid-State
Circuits, vol.44, no.12, pp.3568-3579, Dec. 2009.
[42] N. Nedovic, A. Kristensson, S. Parikh, S. Reddy et al., "A 3 Watt 39.8–44.6 Gb/s Dual-Mode SFI5.2 SerDes
Chip Set in 65 nm CMOS," IEEE Journal of Solid-State Circuits, , vol.45, no.10, pp.2016-2029, Oct. 2010.
65
65
References
High-Speed ADC and Digital I/O Implementations
[43] H. Chung, A. Rylyakov, Z. T. Deniz, J. Bulzacchelli, W. Gu-Yeon, and F. Daniel, “A 7.5-GS/s 3.8-ENOB 52-mW
flash ADC with clock duty cycle control in 65 nm CMOS,” in Proc. Symp. VLSI Circuits, 2009, pp. 268–269.
[44] K. Deguchi, N. Suwa, M. Ito, T. Kumamoto, and T. Miki, “A 6-bit 3.5-GS/s 0.9-V 98-mW flash ADC in 90-nm
CMOS,” IEEE J. Solid-State Circuits, vol. 43, no. 10, pp. 2303–2310, Oct. 2008.
[45] S. Sarvari, T. Tahmoureszadeh, A. Sheikholeslami, H. Tamura, M. Kibune, "A 5Gb/s Speculative DFE for 2x
Blind ADC-based Receivers in 65-nm CMOS", Symp.VLSI Circuits Dig. Tech. Papers, June 2010.
[46] C. Huang, C. Wang, J. Wu, "A CMOS 6-Bit 16-GS/s Time-Interleaved ADC with Digital Background
Calibration", Symp.VLSI Circuits Dig. Tech. Papers, June 2010.
[47] M. El-Chammas, B. Murmann, "A 12-GS/s 81-mW 5-bit Time-Interleaved Flash ADC with Background Timing
Skew Calibration", Symp.VLSI Circuits Dig. Tech. Papers, June 2010.
[48] E. Chen, C. Ken Yang, "ADC-Based Serial I/O Receivers", IEEE Transactions on Circuits and Systems I, Vol.
57, No. 9, Sept. 2010.
[49] M. Harwood, N. Warke, R. Simpson, T. Leslie et al., “A 12.5 Gb/s SerDes in 65nm CMOS using a baud-rate
ADC with digital equalization and clock recovery,” ISSCC Dig. Tech Papers, pp. 436-591, Feb. 2007.
[50] H. Yamaguchi, H. Tamura, et al, “A 5 Gb/s transceiver with an ADC-based feedforward CDR and CMA adaptive
equalizer in 65 nm CMOS,” ISSCC Dig. Tech Papers, pp. 168–169, Feb. 2010.
[51] O. Tyshchenko, A. Sheikholeslami, H. Tamura, M. Kibune, H. Yamaguchi, J. Ogawa, " A 5-Gb/s ADC-Based
Feed-Forward CDR in 65 nm CMOS", IEEE J. Solid-State Circuits, vol. 45, no. 10, pp. 1091–1098, June
2010.
66
66
References
Link modelling
[52] V. Stojanović, A. Amirkhany, M. Horowitz, "Optimal Linear Precoding wit Theoretical and Practical Data Rates in
High-Speed Serial-Link Backplane Communication", International Conference on Communications (ICC),
2004.
[53] G. Balamurugan, N. Shanbag, "Modeling and Mitigation of Jitter in Multi-Gbps Source-Synchronous I/O Links",
Proceedings of the 21st International Conference on Computer Design (ICCD), 2003.
[54] G. Balamurugan, B. Caspar, J. Jaussi, M. Masuri, F. O'Mahony, "Modeling and Analysis of High-Speed I/O
Links", IEEE Transactions on Advanced Packaging, Vol. 32, Feb. 2009.
[55] J. Ren, D. Oh, S. Chang, "Hybrid Statistical and Time-Domain Simulation Methodology for High-Speed Links",
DesignCon 2010.
[56] S. Chun, G. Peterson, R. Mandrekar, D. Dreps, M. Sorna, T. Beukema, "Method to Determine Optimum
Equalization for Maximum Eye in High-Speed Computer System", IEEE 19th Conference on Electrical
Performance of Electronic Packaging and Systems (EPEPS), vol., no., pp.293-296, Oct. 2010.
67
67
References
Additional/General Papers
[57] M. Lee, W. Dally, P. Chiang, "Low-power area-efficient high-speed I/O circuit techniques," IEEE Journal of SolidState Circuits, vol. 35, pp. 1591 - 1599, November 2000.
[58] J. Kim, M. Horowitz," Adaptive supply serial links with sub-1-V operation and per-pin clock recovery," IEEE
Journal of Solid-State Circuits, vol. 37, pp. 1403 - 1413, November 2002.
[59] K. J. Wong, H. Hatamkhani, M. Mansuri, and C. K. Ken Yang, "A 27-mW 3.6-Gb/s I/O transceiver," IEEE Journal
of Solid-State Circuits, vol. 39, pp. 602 - 612, April 2004.
[60] B. Casper, J. Jaussi, F. O'Mahony, M. Mansuri, K. Canagasaby, J. Kennedy, E. Yeung, and R. Mooney, "A
20Gb/s forwarded clock transceiver in 90nm CMOS," IEEE International Solid-State Circuits Conference, vol.
XLIX, pp. 90 - 91, February 2006.
[61] R. Gonzalez, B. Gordon, and M. A. Horowitz, "Supply and threshold voltage scaling for low power CMOS," IEEE
Journal of Solid-State Circuits, vol. 32, pp. 1210 - 1216, August 1997.
68
68
Acronyms
ASST
BER
CDR
CML
CTLE
DCVS
DFE
DLL
FFE
FFT
FS-CMOS
INL
ISI
LSSD
PLL
P-PLL
PPF
RX
SC-DFE
S.E.
SOI
SST
VCO
T/H
TX
69
At-Speed Structural Test
Bit Error Rate
Clock and Data recovery
Current Mode Logic
Continuous Time Linear Equalizer
Differential Cascode Voltage Switch (circuit)
Decision Feedback Equalizer
Delay Locked Loop
Feed-forward Equalizer
Fast Fourier Transform
Full-swing CMOS (=inverter based clocking)
Integral Non-Linearity
Intersymbol Interference
Level Sensitive Scan Design
Phase-Locked Loop
Phase Programmable PLL
Poly-phase filter
Receiver
Switched-Cap DFE
Single-Ended
Silicon On Insulator
Source-Series Terminated
Voltage Controlled Oscillator
Track and Hold
Transmitter
69
Download