Victim Cache Control

advertisement
Final Project
Super-Scalar
Stream Buffer
Victim Cache
CAM-based Cache
Group

Steve Fang
Kent Lin
Jeff Tsai
Qian Yu
Abstract
The final design is a super-scalar MIPS microprocessor. This allows the processor to
handle two instructions in parallel if there are no dependencies between them.
Furthermore, because more instructions are being executed per cycle as compared to the
single-feed processor, instruction cache misses become a significant bottleneck for the
processor. Therefore, a stream buffer is included to relieve this problem. Additionally, a
victim cache is also added to reduce the miss rate of the data-cache. Overall, these two
features are added as solutions to the problem of a very costly DRAM access. Finally, a
CAM style cache organization has been added to the first level cache.
Division of Labor
Implementation of the super-scalar design was split up into three parts: schematic
drawing and wiring, control logic and hazard handling, and memory. Memory consists of
the memory controller for the new DRAM component, the stream buffer and the victim
cache. The assignment breakdown was Jeff handled the memory controller, Qian worked
on the datapath drawing/wiring, Steve worked on the stream buffer, and Kent worked on
the victim cache. Each member was responsible for testing and verification of their
assigned task or component. In addition everyone contributed to the control logic and
hazard handling due to the complexity of the subject.
Detailed Strategy
Datapath
The structure of the datapath itself is fairly simple; much like the single-issue, 5-stage
pipeline covered in class, the super-scalar version features the same stages, but many
compoments are duplicated to allow for complete parallelism. Moreover, the datapath is
also aligned in a way such that only even instructions can go through the top pipeline,
while only odd instructions can enter the odd one. A high-level diagram is given below.
Decode
Even
Pipeline
Execution
Fetch
Odd
Pipeline
Memory
Decode
Write-back
Execution
In general, many components of the datapath were duplicated for the parallel execution of
the even and the odd instructions, while some components were modified with more
ports. As a results, the instruction bandwidth of the datapath increases, and so does the
CPI of the processor.
The following table presents a big picture of how the classic 5 stage datapath was
changed at each pipeline stage.
PIPELINE
Table1. Datapath Modification
COMPONENTS
COMPONENTS WITH WIDER
STAGES
DUPLICATED
BANDWIDTH
IF
Pipeline registers
Instruction Cache outputs
ID
Instruction Decode Controller
Register File outputs
Extender
Hazard Controller inputs, outputs
Branch Comparator
Stall Unit inputs
Forwarding muxes
Forwarding muxes inputs
Branch PC bus
Jump PC bus
Pipeline registers
EX
ALU
None
SLT instructions Unit
Shifter
Pipeline registers
ME
Pipeline register
None
Memory Source Muxs
WB
Register File Source Select
Register File inputs
Mux
Monitors
In the instruction fetch stage, the output bandwidth of the cache is doubled so that the
datapath can fetch both the even and odd instructions in parallel. Since the even and the
odd instructions are in the same 2-word cache line of the cache, the change was made in
the cache datapath. On the other hand, the data cache of MEM stage does not need to be
modified because our memory system only allows one memory access at a time. And the
Stall units of the decode stage will separate two parallel memory-access instructions.
The major modifications of the datapath are in the instruction decode stage and the
execution stage. The components, which operate only for an individual instruction, are
duplicated, such as branch comparator, extender and ALU. Hazard and stall units
however have to take instructions of ID, EX, and MEM stages from both even and odd
pipelines in order to determine the dependency. One special case is the register file. In
order to work with supers-scalar scheme, it should be able to write and read two different
registers as well as the same register in parallel. If both the even and the odd instructions
write back the same register file location, we must let odd instruction win in order to
prevent WAW hazard.
Beyond the general modifications, 2-way super-scalar pipeline adds complexity in
calculating new PCs. The following table shows the PC calculation of the different types
of instructions.
Table 2. PC Calculation
INSTRUCTION TYPES
NEW PC CALCULATIONS
Arithmetic, Logic, Memory
PC = PC + 8
Even Branch
PC = PC + 4 + Branch Value
Odd Branch
PC = PC + 8 + Branch Value
Jump, Jr
PC = Jump Value
Even JAL
PC = Jump Value; $R31 = PC + 4
Odd JAL
PC = Jump Offset; $R31 = PC + 8
To give a more intuitive view of the implementation, schematic captures of branch and
Jump PC calculation are presented as following:
Picture 1 shows the implementation of branch PC calculation. We cannot use PC[31:0]
for the calculation of even branch address since PC = PC + 8 in the super-scalar pipeline.
Therefore, PCD[31:3] (PC of the decode stage, the “current” PC) are used as the base PC.
The result signal of the even branch comparator is used as the third bit which in effect
add 4 to the base PC.
Picture 1. Branch PC calculation
Picture 2 shows the implementation of the Jump PC calculation. The first level selects
between Jump and Jr instructions, whereas the second level mux selects the even and odd
instructions.
Picture 2. Jump PC calculation
Super-scalar Structure
As mentioned before, the microprocessor features two pipelines for parallel operation.
Thus, complicated hazard and stall cases arise from this.
Stalls
There are six different stall cases that must be handled. The first two occur when either
one of the instructions in the execution stage is a branch or jump instruction. When that
is the case, the delay slot must be handled. More precisely, if the branch occurred in the
even location, then the delay slot has already been executed in parallel with it. Thus, the
instructions in the decode stage (both even and odd) must be replaced with no-ops. (See
figure 1)
Even
Pipeline
Odd
Pipeline
BRANCH
Delay
Slot
Figure 1 – Both instructions in the Decode stage needs to be ignored.
However, if the branch is in the odd pipeline, then the even instruction at the decode
stage needs to be executed while the odd one must be swapped with a no-op. (See figure
2)
Even
Pipeline
Delay
Slot
Odd
Pipeline
Instruction
BRANCH
Figure 2 – Since the branch now occurs in the odd pipeline, the delay slot comes after,
but the following instruction must be ignored.
The third case occurs when the instruction in the execution stage is a JAL or a LW and
the odd, decode instruction depends on it. When that is the case, forwarding cannot solve
this case and both instructions in the decode stage must be stalled.
Even
Pipeline
Odd
Pipeline
ADDIU $1, $3, 2
LW $1, 0($10)
Figure 3 – Load word followed by an instruction that uses the loaded data forces a stall
in both pipelines.
The fourth stall case is when the same thing happens for the odd instruction at the decode
stage. Here, the even instruction is allowed to execute while the odd instruction is stalled
for one cycle. Afterwards, the even instruction is replaced with a no-op since it has
already been executed once, and the odd instruction is permitted to enter the execution
stage. The fifth case is when both instructions in the decode stage accesses the memory.
Since there is only one data cache, a parallel access is impossible and so the even
instruction will go first while the odd will be stalled until the following cycle. Finally,
the last case occurs when the odd instruction in the decode stage has a dependency on the
even instruction in the same stage. Because both instructions are in the decode stage,
nothing has been calculated so forwarding is impossible. Therefore, this is handled
identically to the previous case, that is, the odd instruction is stalled for one cycle.
(Figure 4)
Even
Pipeline
Odd
Pipeline
SW $1, 0($10)
SW $3, 4($10)
Figure 4 – Here, both of these instructions cannot be executed in parallel, so the even
instruction is executed first while the odd one is stalled until the following cycle.
Because of the alignment of the pipelines, a major problem occurs when there is a branch
or jump to an odd instruction. This is because instructions come in pairs for this
processor, and so when the target of the branch is the odd instruction, it is crucial that the
even instruction associated with it (which is the instruction before the target one) is not
executed. Thus, there is also a branch handler which watches for branches and jumps to
odd addresses. Once the branch is handled, the stall handler previously discussed will
handle the delay slot such that only one instruction is executed there. Next, after the new
PC is calculated and its corresponding instruction fetched, the branch handler will check
the target PC. If the target PC is even, then the pipeline acts like normal whereas if the
target is odd, then the handler will ignore the even instruction. (see figure 5)
Branch
Even
Pipeline
Odd
Pipeline
Delay Slot
Target of
Branch
Figure 5 – If a branch/jump targets an odd instruction, the corresponding even
instruction must be ignored. This is handled by a special branch/jump handler that is
separate from the stall handler.
Stream Buffer
Stream buffer locates in between the first-level cache and the DRAM. With a slow
memory access time, an instruction fetch on every cycle will be very inefficient. The
addition of a stream buffer will improve the efficiency because it will do a burst to get
four instructions on every miss in the stream buffer. One normal read from the DRAM
will take 9 cycles, but one burst (two consecutive reads) will take 10 cycles. There are
three cases in which a burst will occur.
Case 1: At the very beginning of the program, the stream buffer has nothing in it and
needs to fetch first four instructions in the DRAM.
Case 2: When you need to branch/jump to a new instruction that is not in the stream
buffer or in the first-level cache.
Case 3: Two hits in the stream buffer followed by a miss in the stream buffer, meaning
that the next instruction has not been prefetched.
Stream Buffer Controller
The stream buffer controller works on the positive edge of a phase-shifted clock. The
delay of the phase shift is 10 ns. The reason for this phase-shifted clock signal is that the
registers in the first-level cache work on the negative edges of the clock. The request
signal from the first-level cache goes up after 5 ns if there is a cache miss. Therefore, by
using the phase-shifted clock, we can send a fake wait_sig to the cache to tell it to do an
instruction stall (before the negative edge of the normal clock) if there is a miss in the
stream buffer. The trick is to operate between the phase-shifted clock for the stream
buffer controller and the negative edge of the normal clock for the cache and pipelined
datapath.
Without the stream buffer, the first-level cache operates mainly by looking at the wait_sig
from the DRAM. However, now with the stream buffer, the stream buffer controller will
take in the wait_sig from the DRAM and send a fake wait_sig to the first-level cache.
This is necessary because during the burst mode, the DRAM controller will send a
wait_sig low for one cycle. This wait_sig from the DRAM cannot be sent directly to the
first-level cache. Therefore, the fake wait_sig is required. The fake wait_sig is a
continuous high signal like the specification. The dip is important for the stream buffer
controller because it uses that to enable the first half of the stream buffer to store the first
two instructions. And when the wait_sig goes down for the second time, the controller
will enable the second half of the stream buffer to store the next two instructions. An
example of the wait_sig timing from the DRAM during a burst is shown below.
1
time
0
Figure 6 – example of wait signal behavior during burst read.
The following is the pseudo code:
If (no match in buffer) then
If (request_from_cache = ‘0’) then
Wait_sig_to_cache := 0;
Request_to_DRAM := 0;
Elsif ((request_from_cache = ‘1’) and (wait_sig_from_DRAM = ‘0’)) then
Request_to_DRAM := ‘1’;
Wait_sig_to_cache := ‘1’;
If (wait_sig_from_DRAM = ‘0’ for the first time) then
Enable_for_first_half_register := ‘1’;
Enable_for_second_half_register := ‘0’;
Elsif (wait_sig_from_DRAM = ‘0’ for the second time) then
Enable_for_first_half_register := ‘0’;
Enable_for_second_half_register := ‘1’;
Else
Enable_for_first_half_register := ‘0’;
Enable_for_second_half_register := ‘0’;
End if;
End if;
Else
Wait_sig_to_cache := 0;
Request_to_DRAM := 0;
End if;
Once the four instructions are in the stream buffer and every time the first-level cache
gets a miss, the miss penalty will only be one cycle. Every time the cache controller
sends a request and the PC to the stream buffer, the stream buffer will compare the PC
with the address tag associated with each half of the buffer and select the matched half to
output to the cache based on the results of the comparators.
High-Level Schematic of the Stream Buffer
Address tag from 1st
level Cache (10-bit bus)
Compare with both
address tags to see if
there is a Hit
Data from DRAM
(even)
Use the results of
the comparators to
select the mux
Data from DRAM (odd)
(64-bit wide bus)
4 32-bit Registers
1st even address tag
To 1st level Cache
(Both are 32-bit buses)
2nd even address tag
Hit
Wait_sig
from DRAM
Request from
1st level Cache
Address from
the 1st level
Cache
Stream
buffer
Controller
Wait_sig to 1st
level Cache
Request to DRAM
Register enable
signals
Address to the
DRAM and to the
internal tag
registers
Victim Cache
The implementation of the victim cache is similar to the first level cache in terms of
control logic and schematic design. The main differences are in the handling of data
input and output. The victim cache contains four cache lines worth of information, is
fully-associative, and uses a FIFO (first in, first out) replacement policy. The reason for
using FIFO instead of random as the first level cache is that random replacement is more
effective for larger cache sizes. The small size of the victim cache makes it probable that
certain cache blocks will be replaced more often. Because the victim cache outputs to the
DRAM on a cache miss, replacing blocks on a more frequent basis will result in a higher
AMAT (determined by hit time + miss rate * miss penalty). A FIFO replacement policy
ensures each block has the same chance to be replaced. Theoretically a LRU is the best
replacement policy, but is too difficult to implement with more than two sets.
Top Level Schematic of Victim Cache in Memory Hierarchy
Datapath
1st Level Cache
Valid / Dirty / Tag[9:0]
Word 0
Word 1
Victim Cache
reg
1
Cache
lines
reg
2
Arbiter and
DRAM
Because the victim cache is fully associative, each cache block component contains a
comparator to determine a hit with the address and tag. However, the input to each cache
block only comes from the first level cache. Thus the victim cache blocks are simplified
version of the first level cache blocks. Schematic-wise, the victim cache sits between the
first level cache and the DRAM. The goal is to make the victim cache transparent to
system so the first level cache thinks it is requesting directly to the DRAM. Thus the
victim cache intercepts all intermediate signals and must output data as well as control
lines. Muxes are used to select which cache block to output and where to output (first
level cache or memory). To avoid losing the cache data while swapping the two cache
lines, two additional cache registers are added to hold the output from the first level cache
and victim cache. In addition to sending data back to the first level cache, the victim
cache must also send the dirty bit. The first level cache needs to know whether the data
being sent up is dirty or not, so the correct replacement behavior is used on subsequent
memory accesses.
Changes to the datapath are local to the memory system. The first level cache needs
extra output signals to send the tag, valid and dirty bits to the victim cache. In addition, it
needs an extra input to receive the dirty bit from the victim cache. The DRAM needs a
mux to choose the correct burst signal from the victim cache or the stream buffer.
Victim Cache Control
The first level cache control works on the rising edge of the clock while the memory
controller works on the falling edge. To act transparently in the system, the victim cache
must work within this time to output control signals at the correct time for both first level
cache and memory controller to work. A delayed clock is used to set up this timing
situation. The following RTL explains the behavior of the victim cache control.
if (first_level_cache_request = 1) then
reg1
<= 1st level cache line
reg2
<= victim cache line (Hit or Replace)
if (replace_block = dirty) then
DRAM
<= reg2
if (victim_cache_hit = 1) then
1st level cache <= reg2
else
1st level cache <= DRAM
victim cache <= reg1
Control Signals
-
To first level cache
Because the first level cache interacts with the DRAM solely on the wait signal, the
victim cache must simulate the wait signal to get the appropriate response from the first
level cache. The wait signal must be set high when the victim cache is writing a cache
line to the DRAM or when the first level cache is trying to read from the DRAM.
-
To DRAM
Because the victim cache only has one memory access at a time, the burst signal is set
low. The request signal comes from only the victim cache. Any request from the first
level cache is first interpreted by the victim cache to determine which memory requests to
perform if any.
CAM-based cache
The major difference between the CAM-based cache and standard cache is how the
address tags are stored and handled. With the CAM-based cache, the addresses are store
in a completely different register from the data words. Each tag address register is
accompanied by a comparator, the function of which is to determine whether the register
contains a match with the input address. The match signals for each register are used to
quickly output a hit data word instead of waiting in the control.
Top Level Schematic of CAM-Based Cache
Address
(10 bits)
Data (32 or 64 bits)
CAM array
Control
Logic
8 cache lines x 10 bits
Priority
encoder
The general CAM-based design utilizes the above scheme. At the very least, the CAMbased cache must output a hit signal and a line select signal. All comparators work in
parallel and output a single bit match signal. All the match signals pass to an 8 to 3
priority encoder. The hit signal (“or” of the match signals) is used by the cache control to
determine whether the data resides in the cache or a request to DRAM is necessary. The
line select signal is used with an 8 to 1 mux to choose which of the cache line data to
output. In an effort to make the cache output a hit value as quickly as possible, both
encoder and 8 to 1 mux are built out of gates.
The CAM-based cache was already mostly built during lab 6 so most functional testing
was done on it. Tests for the final project implementation were done at the cache level to
measure delay time and the improvement over the previous vhdl comprised components.
Improvement times varied according to the input patterns, but output delay averaged a
few nanoseconds.
Results
Overall, the super-scalar processor works on simple code where there are no
dependencies. It offers support for all the instructions that are required in the final
processor, but can only run reliably for simple arithmetic programs. However, when
there are complicated dependencies that overlap, the controller will fail to no-op correctly
at times and so certain tests fail. Therefore, it seems to be unable to exit from the partial
sums program.
Currently our processor minus the super-scalar design has a minimum cycle time of 76
ns. The critical path is located in the memory stage beginning from the control logic of
the first level cache running to the victim cache, passing data through the arbiter and
ending at the input to the dram controller. Because our design has the first level cache
working on the rising edge of the clock, we effectively have only one-half clock period to
get the data stable at the memory controller side. The exact timing of the critical path is
determined by the phase shifter, delay of the victim cache control and load delay of
several muxes and gates in the arbiter. The total time reaches 38 ns for the half clock
cycle and thus results in a 78 ns clock cycle. Compared with lab 6 which had a 42 ns
clock cycle, it is noted that the lab 6 did not have a victim cache which operated on the
delayed clock. This victim cache most likely is the main reason for the increase in cycle
time.
Referring to the performance improvements of the super-scalar design and added
memory components, the results are mixed. Already the longer cycle time is a
disadvantage to the new processor. In addition, the provided test program, partial_sum,
runs on tight loops with many memory loads, but few memory stores. In fact, our
processor is most optimized for long branching, but sequential code and high volume of
memory stores. The reason for this is because of our use of the stream buffer to pre-fetch
sequential data and the victim cache to increase the cache line capacity. The following
shows the results of our processor at various stages compared with the lab 6 processor.
Original processor (lab 6)
Processor + stream buffer
Cycles
Processor + victim cache
Processor + SB + VC
0
500
1000
1500
2000
2500
3000
Conclusion
In conclusion, this final project took us around 200 hours as a group to complete. The
strength of the project is that we tried to keep everything simple. When there is a bug, we
know how to attack the problem. However, when the problem occurs with VHDL and
Viewlogic, we were often stuck for a while before we can think of a way around it. Two
weeks to do the final project is really limited. If we had more time, we definitely would
improve our hardware components. Right now, one improvement for the stream buffer is
that it can start prefetching for the next two instructions after two instructions have been
fetched by the first-level cache. Another improvement on the first-level instruction cache
that we did not have time to do is to make it a FIFO.
Victim cache currently uses four states during a write-back to memory and a read. The
fourth state is a reset state and probably could be pushed to the falling edge of the third
state, thus saving one cycle. Finally, branching and jumping performance can be increase
by using a more aggressive scheme of stalls. I am sure a lot of people encountered
problems with VHDL and Viewlogic. Sometimes the clock will not run after cycling for
a certain time. Another problem is that the clock would be undefined at the very
beginning for no reason. The biggest challenge still lies in getting the VHDL
components to work properly. In the stall controller, there is one case which the VHDL
code never picks up. This was eventually relieved by removing the specific cases and
using a more general scheme to deal with that one particular case. (The LW dependency
when the current instruction was even was combined with the even case)
Appendix
See the attached zip file
Download