Figure 1 shows the Tomasulo processor structure, extended with the Re-Order Buffer, as presented in the lecture slides. Figure 2 shows a variation of this design, in which the Reservation Stations and the Re-Order Buffer are combined into a single structure, the Register Update Unit (RUU). This is the design simulated by SimpleScalar.
In this question, assume a static-pipeline machine with the following characteristics:
Consider the following loop:
LOOP: SD 0(R1),F4 ; stores into M[i] ADDD F4,F0,F2 ; adds to M[i+1] LD F0,16(R1) ; loads M[i+2] BLT R1,R2,LOOP ; delayed branch if R1 < R2 ADD R1,R1,#8 ; executed whether branch taken or notAssume that R1 initially points to the start of an array M of doubles, and R2 points to the end.
The loop in Exercise 5.2 is the core of a fast implementation of a simple operation. Write down the assembler code for the straightforward version of the loop. Consider how it would be executed on a dynamically-scheduled version of the machine using Tomasulo's algorithm with a re-order buffer, and compare its timing with the behaviour of the loop shown in Exercise 5.2. What do you conclude?
Paul Kelly, Imperial College, 2012
Fetch, Issue (which includes decode and register fetch), Read Operands (where the Reservation Station collects operands from completing instructions - possibly omitted if operands available at issue), Execute, Write Back (on the CDB), and Commit. Different sources give these stages confusingly different names. A stage may be split across two (or more) cycles for implementation reasons. Some designs introduce further stages to account for delays moving data around the chip. See, for example, the Pentium 4, which has two ``Drive'' stages in its pipeline1.
In the Tomasulo design, the RS's are co-located with the functional units. When an instruction is issued, the issue unit has to pick an RS fronting an FU appropriate for this instruction. In the RUU design, the RS's form a uniform pool - any onewould do. It is the Dispatch unit that assigns the ready instruction to an FU. This seems easier and more efficient as we know at dispatch time which FU is actually free (if we know service times in advance there might not be any actual performance advantage).
The Tom+Rob Issue stage has to read the source registers, find a free RS, write the resulting tag to the destination register, and route the instruction to the RS and ROB.
In the RUU design the Issue stage's work is slightly simpler: it's easy to find a free RS - the RUU is managed as a circular buffer, with instructions being issued at one end and committed at the other (when an instruction is committed its RUU slot becomes free). Also, the instruction only has to be passed to the RUU - it doesn't have to be routed across the chip to the FU. Instead it is the Dispatch stage that does this.
As we see in the Pentium 4, the optimum may be a compromise between these two extremes.
In this question, assume a static-pipeline machine with the following characteristics:
LOOP: SD 0(R1),F4 ; stores into M[i] ADDD F4,F0,F2 ; adds to M[i+1] LD F0,16(R1) ; loads M[i+2] BLT R1,R2,LOOP ; delayed branch if R1 < R2 ADD R1,R1,#8 ; executed whether branch taken or notAssume that R1 initially points to the start of an array M of doubles, and R2 points to the end.
Take care to identify and explain all possible hazards.
SD 0(R1),F4 IF ID EX M ADDD F4,F0,F2 IF ID EX EX EX EX WB LD F0,16(R1) IF ID EX M M M WB BLT R1,R2,LOOP IF ID EX ADD R1,R1,#8 IF ID EX M WB SD 0(R1),F4 IF ID EX M ADDD F4,F0,F2 IF ID EX EX EX EX WB LD F0,16(R1) IF ID EX M M M WB BLT R1,R2,LOOP IF ID EX ADD R1,R1,#8 IF ID EX M WBBasic idea: (1 marks)
No stalls occur. (1 marks)
Hazards:
for i = 0 to N do A[N] := A[N] + K(2 marks)
The code is designed to allow as much time as possible for load and FP add operations to take place, without unrolling. To do this, instructions are migrated backwards to the earliest point they can be issued. An instruction can be migrated round the loop, so it initiates work belonging to a later iteration.
When this happens, it is also necessary to migrate the instruction into the start-up sequence, so that the first two iterations are primed appropriately:
LD F0,(R1) ; loads M[N] ADDD F4,F0,F2 ; compute new M[N] ready for SD LD F0,8(R1) ; loads M[N+1] ready for ADDD Loop: SD 0(R1),F4 ; stores into M[i] ADDD F4,F0,F2 ; adds to M[i+1] LD F0,16(R1) ; loads M[i+2] BLT R1,R2,LOOP ADD R1,R1,#8(2 marks)
You might also want to handle the last two iterations separately to avoid loading elements beyond the end of M).
A straightforward implementation of the loop above would look like:
Loop: LD F0,0(R1) ADDD F4,F0,F2 ; adds to M[i] SD 0(R1),F4 ; stores into M[i] BLT R1,R2,LOOP ADD R1,R1,#8This implementation suffers stalls in a static pipeline:
cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 LD F0,0(R1) IF ID EX M M M WB ADDD F4,F0,F2 IF ID EX EX EX EX WB (stall due to F0) SD 0(R1),F4 IF ID EX M WB (forwarding of F4 to M stage) BLT R1,R2,LOOP IF ID EX M WB ADD R1,R1,#8 IF ID EX M WB LD F0,0(R1) IF ID EX M M M WB ...Exactly how many stalls depends on the design of the floating point units. In a static-schedule design the constraint that ID stages access registers in-order leads to very poor performance.
The simple loop gets a CPI of 10/5=2.0. With a 10GHz clock, the
clever loop would take 100s to do 1M iterations, the simple one 200
s.
The simple one is 100% slower, the clever one takes
100/200=50% as long, so it's 50% faster.
Total for 5.2: (8 marks)
(This exercise is based on an example in Hennessy and Patterson (first ed), page 325).
The loop in Exercise 5.2 is the core of a fast implementation of a simple operation. Write down the assembler code for the straightforward version of the loop. Consider how it would be executed on a dynamically-scheduled version of the machine using Tomasulo's algorithm, and compare its timing with the behaviour of the loop shown in Exercise 5.2. What do you conclude?
Loop: LD F0,0(R1) ADDD F4,F0,F2 ; adds to M[i] SD 0(R1),F4 ; stores into M[i] ADD R1,R1,#8 BLT R1,R2,LOOP(1 marks)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 LD F0,0(R1) IF IS RO M M M WB ADDD F4,F0,F2 IF IS RO EX EX EX EX WB (F0 fwded from LD via CDB) SD 0(R1),F4 IF IS RO M (F4 fwded from ADDD via CDB) ADD R1,R1,#8 IF IS RO EX WB BLT R1,R2,LOOP IF IS RO EX WB LD F0,0(R1) IF IS RO M M M WB (loop-carried structural hazard due to CDB) ADDD F4,F0,F2 IF IS RO EX EX EX EX WB SD 0(R1),F4 IF IS RO M ADD R1,R1,#8 IF IS RO EX WB BLT R1,R2,LOOP IF IS RO EX WB LD F0,0(R1) IF IS RO M M M WB ADDD F4,F0,F2 IF IS RO EX EX EX EX WB SD 0(R1),F4 IF IS RO M ADD R1,R1,#8 IF IS RO EX WB BLT R1,R2,LOOP IF IS RO EX WB(3 marks) (As in the previous question, various variations on this answer are acceptable).
Conclude: you can sometimes, maybe often, get the performance advantage of dynamic scheduling via compile-time instruction scheduling.(2 marks)
Note the need for RO to initiate two operations in one cycle.
There is contention for the CDB, which delays forwarding the some iterations (eg iteration 2, cycles 12-13). However, delay does not accumulate since at most one instruction is issued per cycle, and the CDB handles one result per cycle.
Total for 5.3: (6 marks)
Total: 20 marks
Paul Kelly, Imperial College, 2012