Analog Devices
TigerSHARC® DSP
Family
Presented By: Mike Lee and
Mike Demcoe
Date: April 8th, 2002
1
TigerSHARC Architectural
Overview
High performance, 128-bit successor to the ADSP-2106x SHARC
family
ADSP-TS101S, the newest TigerSHARC DSP, operates at 250MHz!
Multiple computational units
Two compute blocks, each containing a register file, ALU, multiplier, and
shifter.
Two additional integer ALUs
Two hardware loop counter registers
Can execute up to four independent 32-bit instructions at a time
Or, eight 16-bit instructions
Very wide word widths for high precision arithmetic
Designed to be used in a multiple processor environment
2
TigerSHARC Architecture Overview
(cont…)
BTB (Branch Target Buffer) as a means of
alleviating issues with the deep pipeline
32-instruction,
4-way set-associative cache
User controlled Branch Prediction
Three, 128-bit blocks of memory which provide
access to a program and two data operands
without causing instruction/data conflicts.
Load-store, Harvard architecture, like SHARC.
Native support for complex number instructions
3
The TS101S Architecture
4
Details of Multiple Compute Blocks
Two computational units, each containing:
file – Multi-ported to allow multiple accesses
to registers in a single clock cycle
Register
General purpose registers!
Contains 32 words, each word being 32-bits in length.
– Fixed-point and floating point
Multiplier – Fixed-point and floating point
ALU
Also features MAC (multiply-and-accumulate) capabilities
– Standard logical and arithmetic shifts as well
as bit manipulation
Shifter
5
The TS101S Pipeline
Fetch 1
Fetch
Stages
Fetch 2
Fetch 3
IAB
Decode
Integer
Execution
Stages
Access
Execute 1
Execute 2
6
Pipelines and Instruction Related
Information
ADSP-21061
Three
stage pipeline
20ns instruction cycle
SISD but can put instructions in parallel
ADSP-TS101S
Eight
stage pipeline with IAB
4ns instruction cycle
MIMD and can also put instructions in parallel
7
Loops, Branching and Timers
ADSP-21061
Zero-overhead
hardware loop support
Delayed Branching
One timer
ADSP-TS101S
Little
support for zero-overhead hardware loops
32-entry 4-way associative BTB cache with Branch
prediction
Two timers
8
Memory and Buses
ADSP-21061
1
Mbit dual ported SRAM
Shared by three buses (PM, DM, I/O)
PM and DM share a port while the I/O receives it’s
own
ADSP-TS101S
6
Mbit of SRAM (Quad Ported??)
User defined partitions
Each block is accessed by one 128-bit bus
9
Multiplication and other Nifty
Tricks
ADSP-21061
MAC
instructions (MRF and MRB)
Various precision output (32, 40, or 80 bit)
ADSP-TS101S
Each
compute block has it’s own set of MAC registers
8 16-bit MAC with 40-bit accumulation or 2 32-bit
MAC with 80-bit accumulation
Complex number MAC instructions
128-bit accelerator
Trellis decoding (8 Trellis butterflies per cycle)
10
Data Address Generation
ADSP-21061
2
data address generation units (DAGS)
8 circular buffers per DAG
ADSP-TS101S
2
data address generation units (IALU)
4 circular buffers per IALU
Both support modulo arithmetic, bit reversal
addressing, and post and pre-modify instructions
11
Ease of Use
ADSP-21061
Easy to use
Algebraic instruction set
Visual DSP environment
ADSP-TS101S
Similar
to 21061 but know have to consider 2
compute blocks
ADI suggests leaving parallelization to their optimizing
compiler
Visual DSP environment
12
Specific DSP Algorithms and the
TigerSHARC
In ENEL515 (and/or related articles) we’ve
studied the FIR, IIR, and FFT algorithms
TigerSHARC has a massively parallel
architecture that is tailored to performing
these algorithms.
13
FIR Filter Characteristics
Think back (or forward, depending on how much you’ve
procrastinated) to Lab #3.
FIR Characteristics
Simple, long loop
Repetitive calculations (multiply, then add!)
Access to an array of coefficients, and an array of “delay-line”
values
Few data dependency issues during the calculation of a single
output
For a filter of length N, require N multiplications and N
adds to obtain a single output value.
14
TigerSHARC and the FIR Filter
The general idea is: Divide and conquer!
Take a filter of size N and split it into two groups of N/2
Utilize the TigerSHARC’s multiple computational units and MAC
instructions to perform the algorithm in ½ the time (plus some overhead)
Two hardware loop counters to simultaneously control the two new
“N/2” size FIR loops with no overhead!
Can do all of the following SIMULTANEOUSLY!
Fetch two operands (one coefficient, one delay line value) from two
separate memory banks
Fetch the next instruction
Perform arithmetic operations on the PREVIOUS operands!
Unlike SHARC, instruction/data clashes are non-existant due to the
numerous bus paths linking computational units to memory space
15
TigerSHARC and the FIR Filter
(continued….)
8-cycle-deep pipeline
The long loop characteristic of the FIR filter algorithm
allows us to keep the 8-cycle-deep pipeline full
Stalls are expensive..
Branch Target Buffer reduces performance loss that results from
branching in a deeply pipelined processor
Full pipeline means fast algorithm
FIR Filter algorithms rely heavily on data sets that are
aligned in memory
Post-increment is your friend
TigerSHARC Quad Data Accesses – Supply four aligned words
to one compute block or two aligned words to each compute
block.
16
Example Instructions
X/Y
Conditional Compute
if xALE; do, R0=R1+R2
Condition codes,
AEQ, ALT, ALE, ALU, MEQ, MLT, MLE, SEQ, SLT, SF0, SF1.
A = Adder, M = Multiplier, S = Shifter
Memory Addessing
Indirect post-modify with update, register offset:
YR20=[J1+=J2]
Indirect post-modify with update, 8-bit immediate offset:
Q[K1+=0xF8]=XYR3:0
Indirect pre-modify no update, register offset:
J3:2=L[K1+K2]
Indirect pre-modify no update, immediate offset:
YR3:2=L[K1+0x0003333]
Complex Quad 16-bit Fixed Point Multiplication Instructions
{X|Y|XY} MRa += Rm ** Rn {({U}{I}{C|CR}{J})}
{X|Y|XY} Rs|Rsd=MRa, MRa+= Rm ** Rn {({U}{I}{C}{J})}
17
FIR Code Example
18
TigerSHARC and the IIR Filter
Short, simple loop characteristic
Means loop overhead is more of a concern
Means keeping the pipeline full is tougher!
Time to unroll the loop, although ADI says to let
VisualDSP do it for you.
Again, split up the calculations on an N-tap IIR filter into two
N/2 sets operating simultaneously
Idea: One computational block does feedforward
calculations, one does feedback!
Complex numbers commonly required
Hardware support for complex MAC in TigerSHARC
Again, Quad Data Access comes in handy for aligned data
Post-increment is still your friend
19
TigerSHARC and the FFT
Does not use the same MAC modes that IIR and FIR filters do.
Requires more complicated addressing modes
Example: Bit reverse addressing
Difficult to split onto separate computational units and even more
difficult to split amongst distributed processors
Requires large arrays of complex variables and fixed coefficients
Found on both SHARC and TigerSHARC
Hardware complex number MAC comes in handy again!
Large arrays of aligned data – Quad Data access again!
Requires HIGH-PRECISION arithmetic
Luckily we have 64-bit fixed point arithmetic and 40-bit extended floating
point arithmetic.
80-bit MAC precision
FFT Requires many intermediate values
32 GP registers in a single computational block
20
http://www.analog.com/technology/dsp/Sharc/benchmarks.html
21
http://www.analog.com/technology/dsp/TigerSHARC/benchmarks.html
22
Conclusion
TigerSHARC have a very SHARC-like
architecture, except it’s MUCH more complex.
Highly
optimized for parallelism
Major features: Complex number support,
multiple computational units, high instruction
throughput, wider buses.
Performs DSP algorithms including FIR, IIR, FFT
significantly faster than SHARC!
23
References
1. http://www.analog.com/productSelection/pdf/ADSP-21061_L_b.pdf
2. http://products.analog.com/products/info.asp?product=ADSP-TS101-S
3. http://www.analog.com/technology/dsp/TigerSHARC/backgrounder.html
4. http://www.analog.com/library/dspManuals/Tigersharc_hardware.html
5. http://www.analog.com/library/dspManuals/Tigersharc_instruction.html
6. http://www.btid.com/procsum/tsfloat.htm
7. http://www.analog.com/library/applicationNotes/dsp/tigerSharc/EE-147.pdf
8. http://www.analog.com/technology/dsp/TigerSHARC/architecture.html
9. http://www.analog.com/library/dspManuals/pdf/TSDSP_instruction/tsintr.pdf
(2-182 - 2-188)
10. ADSP-2106x SHARC User’s Manual, Second Edition
11. http://www.analog.com/library/dspManuals/pdf/TSDSP_instruction/tsin_flw.pdf
(3-9 - 3-16)
24
Note from Dr. Smith
Information on Burg algorithm outside
ICT536. It is essentially an FIR filter used
for prediction (i.e. what FIR coefficients
are needed so that the filtered signal is
"white noise" )
25