Scalar Operand Networks for Tiled Microprocessors Raw Architecture Project MIT CSAIL

advertisement
Scalar Operand Networks
for Tiled Microprocessors
Michael Taylor
Raw Architecture Project
MIT CSAIL
(now at UCSD)
Until 3 years ago – computer architects have been
using the N-way superscalar to encapsulate the ideal
for a parallel processor…
- nearly “perfect” but not attainable
Superscalar
“PE”->”PE”
communication
Free
exploitation
of parallelism
Implicit
Clean semantics
Yes
scalable
No
power efficient
No
(or VLIW)
(hw scheduler
or compiler)
mul $2,$3,$4
add $6,$5,$2
What’s great about superscalar microprocessors?
 It’s the networks!
Fast low-latency tightly-coupled networks
(0-1 cycles of latency,
no occupancy)
-For the lack of a better name
let’s call them Scalar Operand Networks (SONs)
- Can we incorporate the benefits of superscalar
communication + multicore scalability
-Can we build Scalable Scalar Operand Networks?
(I agree with Jose: “We need low-latency tightly-coupled … network
interfaces” – Jose Duato, OCIN, Dec 6, 2006)
The industry shift toward Multicore
- attainable but hardly ideal
Superscalar
Multicore
“PE”->”PE”
communication
Free
Expensive
exploitation
of parallelism
Implicit
Explicit
Clean semantics
Yes
No
scalable
No
Yes
power efficient
No
Yes
What we’d like – neither superscalar nor multicore
Superscalar
Multicore
“PE”->”PE”
communication
Free
Expensive
exploitation
of parallelism
Implicit
Explicit
Clean semantics
Yes
No
scalable
No
Yes
power efficient
No
Yes
Superscalars
have fast
networks
and great
usability
Multicore
has great
scalability
and efficiency
Why communication is expensive on multicore
Multiprocessor Node 1
send
occupancy
send
send
overhead
latency
Multiprocessor Node 2
Transport Cost
receive
occupancy
receive
receive
overhead
latency
Multiprocessor SON Operand Routing
Multiprocessor Node 1
send
occupancy
send
latency
Destination node name
Sequence number
Value
Launch sequence
Commit Latency
Network injection
Multiprocessor SON Operand Routing
receive sequence
demultiplexing
branch mispredictions
Multiprocessor Node 2
receive
occupancy
injection cost
receive
latency
.. similar overheads for
shared memory multiprocessors store instr, commit latency,
spin locks (+ attndt br. mispredicts)
Defining a figure of merit for
scalar operand networks
5-tuple <SO, SL, NHL, RL, RO>:
Send Occupancy
Send Latency
Network Hop Latency
Receive Latency
Receive Occupancy
We can use this metric
to quantitatively
differentiate
SONs from existing
multiprocessor
networks…
Tip: Ordering follows timing of message from sender to receiver
Proc 1
nothing to do
Proc 0
Impact of Occupancy (“o” = so+ro)
Impact of Latency
if (o * “surface area” > “volume”)
The lower the latency,
 not worth it to offload:
the less work needed to keep
overhead too high
myself busy waiting for answer
(parallelism too fine-grained)
 not worth it to offload:
could have done it myself faster
(not enough parallelism to hide latency)
The interesting region
Power4
(on-chip)
<2, 14, 0, 14,4>
Superscalar
(not scalable)
< 0, 0, 0, 0, 0>
Tiled Microprocessors (or “Tiled Multicore”)
(w/ scalable
SON)
Superscalar
Multicore
Tiled Multicore
PE-PE communication
Free
Expensive
Cheap
exploitation
of parallelism
Implicit
Explicit
Both
scalable
No
Yes
Yes
power efficient
No
Yes
Yes
Tiled Microprocessors (or “Tiled Multicore”)
Superscalar
Multicore
Tiled Multicore
Alu-Alu communication
Free
Expensive
Cheap
exploitation
of parallelism
Implicit
Explicit
Both
scalable
No
Yes
Yes
power efficient
No
Yes
Yes
Transforming from multicore or superscalar to tiled
add scalability
Superscalar
Tiled
add scalable SON
CMP/multicore
The interesting region
Power4
(on-chip)
<2, 14, 0, 14,4>
Raw
< 0, 0, 1, 2, 0>
Tiled “Famous Brand 2” < 0, 0, 1, 0, 0>
Superscalar
(not scalable)
< 0, 0, 0, 0, 0>
Scalability Problems in Wide Issue
Microprocessors
Unified
Load/Store
Queue
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Bypass Net
ALU
RF
ALU
Wide
Fetch
(16 inst)
Control
ALU
PC
Area and Frequency Scalability Problems
~N2
ALU
ALU
ALU
ALU
ALU
ALU
ALU
N ALUs
ALU
3
~N
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Bypass Net
ALU
RF
Ex:
Itanium 2
Without
modification, freq decreases linearly or worse.
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Bypass Net
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Operand Routing is Global
RF
+
>>
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Bypass Net
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Idea: Make Operand Routing Local
RF
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Bypass Net
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Idea: Exploit Locality
RF
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Replace the crossbar with a point-to-point,
pipelined, routed scalar operand network.
RF
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RF
ALU
+
ALU
Replace the crossbar with a point-to-point,
pipelined, routed scalar operand network.
>>
Operand Transport Scaling –
Bandwidth and Area
For N ALUs and N½ bisection bandwidth:
as in conventional superscalar
Local BW
Area
Un-pipelined
crossbar
bypass
Point-to-Point
Routed Mesh
Network
~ N½
~ N2
~N
~N
Scales
as 2-D
VLSI
We can route more operands per unit time if we are able
to map communicating instructions nearby.
Operand Transport Scaling - Latency
Time for operand to travel between instructions mapped to
different ALUs.
Un-pipelined Point-to-Point
crossbar
Routed Mesh
Network
Non-local
Placement
~N
~ N½
LocalityDriven
Placement
~N
~1
Latency bonus if we map communicating instructions
nearby so communication is local.
Distribute the Register File
RF
ALU
ALU
ALU
RF
RF
ALU
RF
RF
RF
RF
ALU
ALU
RF
ALU
ALU
RF
ALU
ALU
RF
RF
RF
ALU
RF
RF
ALU
ALU
RF
RF
ALU
ALU
ALU
RF
ALU
ALU
ALU
ALU
ALU
ALU
RF
RF
RF
ALU
Unified
Load/Store
Queue
RF
RF
RF
ALU
RF
RF
RF
RF
ALU
ALU
RF
ALU
Wide
Fetch
(16 inst)
Control
RF
ALU
ALU
RF
RF
ALU
ALU
PC
RF
ALU
RF
More Scalability Problems
PC
Wide
Fetch
(16 inst)
Control
Unified
Load/Store
Queue
ALU
RF
D$
RF
D$
ALU
ALU
ALU
RF
PC
I$
I$
RF
D$
PC
I$
D$
PC
PC
I$
D$
RF
RF
ALU
RF
PC
I$
RF
D$
PC
I$
D$
PC
I$
I$
RF
PC
I$
D$
RF
ALU
RF
PC
I$
Unified
Load/Store D$
Queue
ALU
D$
PC
I$
D$
RF
PC
D$
ALU
D$
PC
I$
I$
RF
ALU
I$
RF
PC
D$
ALU
Control
PC
I$
RF
D$
ALU
Wide
Fetch
(16 inst)
PC
ALU
D$
ALU
I$
ALU
PC
RF
ALU
PC
ALU
Distribute the rest:
Raw – a Fully-Tiled Microprocessor
Tiles!
D$
PC
I$
D$
RF
D$
I$
ALU
RF
D$
PC
I$
D$
RF
RF
ALU
RF
PC
I$
PC
ALU
ALU
ALU
RF
PC
I$
D$
I$
RF
PC
I$
D$
RF
ALU
ALU
ALU
RF
RF
PC
I$
D$
PC
I$
PC
D$
D$
ALU
RF
PC
I$
RF
D$
PC
I$
RF
D$
PC
I$
PC
ALU
RF
D$
I$
D$
ALU
D$
ALU
I$
I$
RF
ALU
PC
PC
ALU
D$
ALU
I$
RF
ALU
PC
Tiles!
Tiled Microprocessors
-fast inter-tile communication
through SON
-easy to scale (same reasons
as multicore)
Outline
1. Scalar Operand Network and Tiled Microprocessor intro
2. Raw Architecture + SON
3. VLSI implementation of Raw,
a scalable microprocessor with a scalar operand network.
Raw Microprocessor
Tiled scalable microprocessor
Point-to-point pipelined networks
16 tiles, 16 issue
Each 4 mm x 4mm tile:
MIPS-style compute processor
- single-issue 8-stage pipe
- 32b FPU
- 32K D Cache, I Cache
4 on-chip networks
- two for operands
- one for cache misses
- one for message passing
Raw Microprocessor Components
Functional
Units
Switch Processor
Instruction
Cache
Fetch Unit
Intra-tile
SON
CrossCrossbar
bar
Instruction
Cache
Inter-tile SON
Static Router
Data
Cache
Dynamic Router
Trusted
Core
“MDN”
Execution
Core
Dynamic Router
“GDN”
Untrusted Core
Compute Processor
Generalized Transport Networks
Inter-tile
Network
Links
Raw Compute Processor Internals
r24
r25
r26
Ex:
fadd r24, r25, r26
E
M1
A
F
r27
M2
TL
P
U
Tile-Tile Communication
add $25,$1,$2
Tile-Tile Communication
add $25,$1,$2
Route $P->$E
Tile-Tile Communication
add $25,$1,$2
Route $P->$E
Route $W->$P
Tile-Tile Communication
add $25,$1,$2
Route $P->$E
Route $W->$P
sub $20,$1,$25
Compilation
RawCC assigns
instructions to the tiles,
maximizing locality. It also
generates the static router
instructions that transfer
operands between tiles.
tmp3 = (seed*6+2)/3
v2 = (tmp1 - tmp3)*5
v1 = (tmp1 + tmp2)*3
v0 = tmp0 - v1
….
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
seed.0=seed
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
tmp0.1=pval0/2.0
v3.10=tmp3.6-v2.7
tmp0=tmp0.1
pval1=seed.0*3.0
v1.2=v1
v3=v3.10
v2.4=v2
pval5=seed.0*6.0
pval0=pval1+2.0
pval2=seed.0*v1.2
pval3=seed.o*v2.4
pval4=pval5+2.0
v1.2=v1
v2.4=v2
tmp3.6=pval4/3.0
pval2=seed.0*v1.2
pval3=seed.o*v2.4
tmp1.3=pval2+2.0
tmp2.5=pval3+2.0
tmp1.3=pval2+2.0
tmp0.1=pval0/2.0
tmp2.5=pval3+2.0
tmp1=tmp1.3
tmp0=tmp0.1
tmp2=tmp2.5
pval7=tmp1.3+tmp2.5
tmp3=tmp3.6
tmp1=tmp1.3
pval6=tmp1.3-tmp2.5
pval7=tmp1.3+tmp2.5
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v1.8=pval7*3.0
v2.7=pval6*5.0
v1.8=pval7*3.0
v0.9=tmp0.1-v1.8
v1=v1.8
v0=v0.9
v3.10=tmp3.6-v2.7
v0.9=tmp0.1-v1.8
v1=v1.8
v0=v0.9
v2=v2.7
v3=v3.10
v2.7=pval6*5.0
v2=v2.7
One cycle in the life of a tiled micro
mem
Direct
I/O
stream
into
Scalar
Operand
Network
mem
mem
4-way
automatically
parallelized
C program
2-thread
MPI app
httpd
Zzz...
An application uses only as many tiles as needed to
exploit the parallelism intrinsic to that application…
Tile 4
Tile 1
Tile 2
Tile 7
Tile 0
Tile 3
Tile 5
Tile 6
Tile 9
Tile 10
Tile 12
Tile 15
Tile 8
Tile 13
Tile 14
Tile 11
One Streaming
Application
on Raw
very different
traffic patterns
than RawCC-style
parallelization
Auto-Parallelization Approach #2:
Streamit Language + Compiler
Splitter
FIRFilter
FIRFilter
FIRFilter
Splitter
FIRFilter
FIRFilter
FIRFilter
FIRFilter
FIRFilter
FIRFilter
FIRFilter
FIRFilter
Joiner
Splitter
Splitter
Vec Mult
Vec Mult
Vec Mult
Vec Mult
FIRFilter
FIRFilter
FIRFilter
FIRFilter
Magnitude
Magnitude
Magnitude
Magnitude
Detector
Detector
Detector
Detector
Original
FIRFilter
FIRFilter
FIRFilter
Joiner
Joiner
FIRFilter
FIRFilter
Vec Mult
FIRFilter
Magnitude
Detector
Vec Mult
FIRFilter
Magnitude
Detector
Vec Mult
FIRFilter
Magnitude
Detector
Joiner
After fusion
Vec Mult
FIRFilter
Magnitude
Detector
FIRFilter
FIRFilter
FIRFilter
FIRFilter
FIRFilter
FIRFilter
Joiner
Vec Mult
FIRFilter
Magnitude
Detector
FIRFilter
FIRFilter
Joiner
Vec Mult
FIRFilter
Magnitude
Detector
End Results – auto-parallelized by MIT Streamit
to 8 tiles.
ALU
ALU
ALU
ALU
Are operand routes
predetermined?
>>
ALU
Transport (Static/Dynamic)
/
&
%
ALU
Is instruction assignment
to ALUs predetermined?
+
ALU
Assignment (Static/Dynamic)
ALU
AsTrO Taxonomy: Classifying SON diversity
Ordering (Static/Dynamic)
Is the execution order
of instructions assigned to a
node predetermined?
Microprocessor SON diversity
using AsTrO taxonomy
Assignment
Transport
Ordering
Raw
Scale
Dynamic
Static
Static
Static
RawDyn
Dynamic
Static
Dynamic
Dynamic Static
TRIPS
ILDP
Dynamic
WaveScalar
Outline
1. Scalar Operand Network and Tiled Microprocessor intro
2. Raw Architecture + SON
3. VLSI implementation of Raw,
a scalable microprocessor with a scalar operand network.
Raw Chips
October 02
Raw
16 tiles (16 issue)
180 nm ASIC (IBM SA-27E)
~100 million transistors
1 million gates
3-4 years of development
1.5 years of testing
200K lines of test code
Core Frequency:
425 MHz @ 1.8 V
500 MHz @ 2.2 V
18W average power
Frequency competitive
with IBM-implemented
PowerPCs in same process.
Raw motherboard
Support Chipset implemented in FPGA
A Scalable Microprocessor in Action
[Taylor et al, ISCA ’04]
Conclusions
Scalability problems in general purpose processors
can be addressed by tiling resources across a
scalable, low-latency, low-occupancy scalar operand
network (SON).
These SONs can be characterized by a 5-tuple and
the AsTrO classification.
The 180 nm 16-issue Raw prototype shows the
feasibility of the approach is feasible. 64+-issue is
possible in today’s VLSI processes.
Multicore machines could benefit by adding internode SON for cheap communication.
Download