# 332 Advanced Computer Architecture Chapter 2: part 2

## Dynamic scheduling, out-of-order execution, register renaming with speculative execution

Hennessy and Patterson 6<sup>th</sup> ed Section 3.6 pp208-217 and pp234-238

October 2022 Paul H J Kelly

These lecture notes are partly based on the course text, Hennessy and Patterson's Computer Architecture, a quantitative approach (4-6<sup>th</sup> ed), and on the lecture slides of David Patterson's Berkeley course (CS252)

#### Course materials online on

https://scientia.doc.ic.ac.uk/2223/modules/60001/materials and https://www.doc.ic.ac.uk/~phjk/AdvancedCompArchitecture/aca20/

## What about Precise Interrupts?

Tomasulo had:

In-order issue, out-of-order execution, and out-of-order completion

- Need to "fix" the out-of-order completion aspect so that we can find precise breakpoint in instruction stream
  - Suppose we have a page fault or a divide-by-zero exception?
- Actually we have the same issue with branch speculation...
- The answer: add a stage that "commits" the state
- In issue order

#### Four Steps of the Speculative Tomasulo Algorithm

#### 1. Issue—get instruction from FP Op Queue

If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called "dispatch")

#### 2. Execution—operate on operands (EX)

When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called "issue")

#### 3. Write result—finish execution (WB)

Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

#### 4. Commit—update register with reorder result

When an instruction is at the head of reorder buffer, *and* its result is present:

update the (commit-side) register with the result (or store to memory), and remove the instruction from the reorder buffer.

#### Mispredicted branch flushes reorder buffer



Tomasulo without Re-order Buffer



















#### Some subtleties to think about...

 It's vital to reduce the branch misprediction penalty. Does the Tomasulo+ROB scheme described here roll-back as soon as the branch is found to be mispredicted?

- This discussion has assumed a single-issue machine. How can these ideas be extended to allow multiple instructions to be issued per cycle?
  - Issue
  - Monitoring CDBs for completion
  - Handling multiple commits per cycle

#### Some subtleties to think about...

 What if a second conditional branch is encountered, before the outcome of the first is resolved?







#### Some subtleties to think about...

- Stores are buffered in the ROB, and committed only when the instruction is committed.
- A load can be issued while several stores (perhaps to the same address) are uncommitted. We need to make sure the load gets the right data. See:

Shen and Lipasti "Modern Processor Design" pg 271, or <a href="http://home.eng.iastate.edu/~zzhang/courses/cpre585\_f03/slides/lecture11.pdf">http://home.eng.iastate.edu/~zzhang/courses/cpre585\_f03/slides/lecture11.pdf</a>

 This lies beyond the depth we have time to cover properly in this course, but let's look at some of the issues

## Stores and loads with speculation

- We need to make sure stores are not sent to memory until the store instruction is committed
- We need to stall loads until all preceding stores have committed
  - **?**
  - Or: until all possibly-aliasing stores have committed?
  - Or: until the addresses of all preceding uncommitted stores have been determined
- If/when the addresses of a load and all preceding uncommitted stores are known...
  - And if none of the store addresses match the load
  - Then the load can proceed
  - If the address of the load matches the address of an uncommitted store, we can forward the store's data to the load



## Store-to-load forwarding

- The Tomasulo scheme works on registers it derives dependences between register-register instructions
- The registers being used are always known at issue time
- Loads and stores use computed addresses, which may or may not be known at issue time – consider:

```
    i1 SD F0 0(R3) // store F0 at address R3
    i2 LD R2 0(R1) // load an address from memory
    i3 SD F1 0(R2) // store F1 to that address
    i4 LD F2 0(R3) // load F1 from address R3
```

- Can we (should we?) forward F0 from i1 to i4?
- What if R1=R3?
- We could wait (as shown in previous slide)
- We could speculate! And then check for the misprediction
- We could add a forwarding predictor, to improve the speculation

## Store-to-load forwarding

- Memory dependence \*speculation\* is the idea that we might allow a load to proceed\* before we know for sure which, if any, prior uncommitted store instruction writes to its address\*\*.
- (\* proceed either by forwarding a value from some store whose \*value\* is known, or proceed by going to memory)
- (\*\* we may know the load's address but not (all) the addresses of the older stores. We might not know the load's address)
- Memory dependence speculation is when we use a predictor to decide when to do this.
- See <u>Memory dependence prediction Wikipedia</u>
- I think this article (start at page 8) is particularly clear:
- https://www.jilp.org/vol2/v2paper13.pdf

## Design alternatives for o-o-o processor architectures

#### • See:

- The Microarchitecture of the Pentium 4 Processor (Hinton et al, Intel Tech Jnl Q1 2001)
- The SimpleScalar Tool Set, Version 2.0 (Burger and Austin, http://www.simplescalar.com/docs/users\_guide\_v2.pdf)
- Wattch: a framework for architectural-level power analysis and optimizations (Brooks et al, ISCA 2000)
   <u>www.tortolaproject.com/papers/brooks00wattch.pdf</u>

#### Specifically:

- Register Update Unit (RUU, as in Simplescalar) versus
   Re-Order Buffer
- Realisation in Pentium III and Pentium 4 ("Netburst")
  - Frontend and Retirement Register Alias Tables (RATs)



Figure 5. Pipeline for sim-outorder

- Simplescalar is a software simulation of a processor microarchitecture
- It simulates a multi-issue out-of-order design with speculative execution
- Many aspects of the design can be controlled by parameters
- Simplescalar uses a Register Update Unit, which combines ROB and reservation stations in a single pool



#### RUU vs ROB

- In the Tomasulo+ROB design shown in these slides, registers and ROB entries have a tag
  - Every register, ROB entry and reservation station needs a comparator to monitor the CDB

- With the RUU, the tags are the ROB entry numbers
  - So the ROB is indexed by the tag on the CDB
  - The ROB entry serves are a renamed register for its instruction's result



- A Register Alias Table keeps track of latest alias for logical registers
- Once retired, data is copied from the ROB to the RRF

#### Q: How are registers allocated and freed?

- 128 Register File (RF) is separated from the ROB - which now only consists of status fields
- A unique, in-order sequence number is allocated for each uop that points to the corresponding ROB entry



#### **Basic Pentium III Processor Misprediction Pipeline**

| 1     | 2     | 3      | 4      | 5      | 6      | 7      | 8       | 9        | 10   |  |
|-------|-------|--------|--------|--------|--------|--------|---------|----------|------|--|
| Fetch | Fetch | Decode | Decode | Decode | Rename | ROB Rd | Rdy/Sch | Dispatch | Exec |  |

#### **Basic Pentium 4 Processor Misprediction Pipeline**

| 1    | 2     | 3    | 4    | 5     | 6     | 7   | 8   | 9   | 10  | 11  | 12  | 13   | 14   | 15 | 16 | 17 | 18   | 19    | 20    |  |
|------|-------|------|------|-------|-------|-----|-----|-----|-----|-----|-----|------|------|----|----|----|------|-------|-------|--|
| TC N | xt IP | TC F | etch | Drive | Alloc | Ren | ame | Que | Sch | Sch | Sch | Disp | Disp | RF | RF | Ex | Flgs | Br Ck | Drive |  |

#### Out-of-order processing – Four instructions per cycle

#### Example: Naive implementation (roughly from cc -S):

```
void f() {
  int i, a;
  for (i=1;
  i <= 1000000000;
                i++)
     a = a+i;
```

Real example

X86 code (slightly tidied but without register allocation)

```
movl $1,-4(%ebp)
 jmp .L4
.L5
  movl -4(%ebp),%eax
  addl %eax,-8(%ebp)
  incl -4(%ebp)
.L4:
  cmpl $100000000,-4(%ebp)
 ile .L5
```

# Unoptimised: movl \$1,-4(%ebp) jmp .L4 .L5 movl -4(%ebp),%eax addl %eax,-8(%ebp) incl -4(%ebp) .L4: cmpl \$1000000000,-4(%ebp)

## Optimised:

movl \$1,%edx
.L6:
addl %edx,%eax
incl %edx
cmpl \$1000000000,%edx
jle .L6

5 instructions in the loop Execution time on 2.13GHz Intel Core2Duo: 3.87 seconds (3.87 nanoseconds/iteration, 8.24 cycles)

ile .L5

4 instructions in the loop, no references to main memory

Execution time on 2.13GHz

Intel Core2Duo: 0.48 seconds
(0.48 nanoseconds/iteration,
1.02 cycles)

Time per instruction fell: 0.77 nanoseconds to 0.12 Optimised code runs at four instructions per cycle

## Wikipedia (!): Resources

http://en.wikipedia.org/wiki/Register\_renaming

#### Papers:

- Instruction issue logic for high-performance, interruptable pipelined processors. G. S. Sohi, S. Vajapeyam. International Conference on Computer Architecture, 1987 (<a href="http://doi.acm.org/10.1145/30350.30354">http://doi.acm.org/10.1145/30350.30354</a>)
- Towards Kilo-instruction processors. Cristal, Santana, Valero, Martinez ACM Trans. Architecture and Code Optimization (<a href="http://doi.acm.org/10.1145/1044823.1044825">http://doi.acm.org/10.1145/1044823.1044825</a>)

#### Other simulators:

- Simplescalar: <u>www.simplescalar.com/</u>
- Gem5: <a href="http://www.gem5.org">http://www.gem5.org</a>
- Liberty: <a href="http://liberty.cs.princeton.edu/">http://liberty.cs.princeton.edu/</a>
- SimFlex: <a href="http://parsa.epfl.ch/simflex/">http://parsa.epfl.ch/simflex/</a>
- SIMICS: <a href="http://www.windriver.com/products/simics/">http://www.windriver.com/products/simics/</a>

## Dynamic scheduling - summary

- Dynamic instruction scheduling is attractive:
  - Reduced dependence on compile-time instruction scheduling (and compiler knowledge of hardware)
  - Handles dynamic stalls due to cache misses
  - Register renaming frees architecture from constraints of the instruction set
- Comes with costs
  - Increases pipeline depth, and misprediction latency
  - Increased power consumption and area (but not by all that much if you are careful and clever)
  - Increased complexity and risk of design error
  - Hard to predict performance, hard to optimise code