In this question, assume a static-pipeline machine with the following
Consider the following loop:
- Integer arithmetic always takes just one clock cycle.
- There are no memory system delays.
- Note that the example exploits a delayed conditional branch.
You should find that no stalls are caused the branch.
- FP operations are identified at the ID stage and are dispatched
to separate non-pipelined functional units. Results are routed
direct to the WB stage where they are handled in issue order.
- There are two FP functional units, one for add/subtract, the
other for multiply/divide.
FP addition takes 2 cycles, FP multiplication takes 5 cycles, and
FP division takes 19 cycles (although you will not need
- There is full bypassing, i.e. wherever a hazard can be resolved using
bypassing, the bypass circuitry is present.
LOOP: SD 0(R1),F4 ; stores into M[i]
ADDD F4,F0,F2 ; adds to M[i-1]
LD F0,-16(R1) ; loads M[i-2]
BNEZ R1,LOOP ; delayed branch
SUB R1,R1,#8 ; executed whether branch taken or not
- Using your paper sideways, draw a diagram showing the timing for the
execution of the loop. You will need to look at more than one
- Indicate clearly where forwarding occurs
- If you find a stall is necessary, explain clearly why
- What is the average CPI when the loop is executed many times?
- Explain what this loop does, and describe how it is likely to be used
to perform a straightforward useful function. Show any initialisation needed. What
is going on? How does the performance of this loop compare with a
Paul Kelly, Imperial College, 2000