Advanced Computer Architecture
Chapter 6: Prediction
Lecturer: Paul Kelly (p.kelly@ic.ac.uk)
Course web pages: http://www.doc.ic.ac.uk/~phjk/AdvancedComputerArchitecture

Motivation
There are many many opportunities for speculation; in each case, effectiveness relies on good prediction:
Control hazards: given a conditional jump, and limited resources, which branch should be speculatively executed?
Cache block replacement: which of the candidates for replacement will be the least-soon reused?
Adaptive cache allocation: is the newly-loaded value more likely to be reused than the cached value it would overwrite?
Prefetching: can we predict the next memory load address sufficiently well to initiate the access in advance?
Value prediction: given a data hazard, can we guess what the register value will be?

Prediction - a cross-cutting issue
Interestingly, this concept cuts across many aspects of system architecture, including
branch prediction,
caching
prefetching
cache coherency
profile-driven dynamic re-optimisation
process synchronisation etc.
In this segment of the course we look at a number of techniques, and try to identify some underlying principles.

Overview
the N-bit branch history revisited
branch prediction by branch correlation
tradeoff: using too little context vs trying to use too much
combining predictors
the predicated execution/correlated branches anomaly
The prediction cost matrix

The n-bit branch history table (BHT)- revisited

n-bit BHT - how well does it work?

N-bit BHT - why does it work so well?
n-bit BHT predictor essentially based on a saturating counter: taken increments, not-taken decrements
predict taken if most significant bit is set

Bias

Is local history all there is to it?
The bimodal predictor uses the BHT to record “local history” - the prediction information used to predict a particular branch is determined only by its memory address
Consider the following sequence:

Global history
Definition: Global history. The taken - not-taken history for all previously-executed branches.
Idea: use global history to improve branch prediction
Compromise: use m most recently-executed branches
Implementation: keep an m-bit Branch History Register (BHR) - a shift register recording taken - not-taken direction of the last m branches
Question: How to combine local information with global information?

Two-level correlating branch predictor

Two-level correlating branch predictor
This is an (m,n) “gselect” correlating predictor:
m global bits record behaviour of last m branches
These m bits are used to select which of the 2m n-bit BHTs to use

How many bits of branch history should be used?
(2,2) is good, (4,2) is better, (10,2) is worse

Variations
There are many variations on the idea:
gselect: many combinations of n and m
global: use only the global history to index the BHT - ignore the PC of the branch being predicted (an extreme (n,m) gselect scheme)
gshare: arrange bimodal predictors in single BHT, but construct its index by XORing low-order PC address bits with global branch history shift register - claimed to reduce conflicts
Per-address Two-level Adaptive using Per-address pattern history (PAp): for each branch, keep a k-bit shift register recording its history, and use this to index a BHT for this branch (see Yeh and Patt, 1992)
Each suits some programs well but not all

Horses for courses

Extreme example - “go”
“go” is a SPEC95 benchmark code with highly-dynamic, highly-correlated branch behaviour

Some dynamic applications have highly-correlated branches
For “go”, optimum BHR size (m) is much larger

Combining predictors
Suppose we have two predictors, A and B
Suppose A works well for some applications (for example those like go with dynamic but highly-correlated branches)
Suppose B works well for highly biased branches
A is best for some programs, B is best for others
Idea: build a “selective” predictor to choose, for each branch, the predictor with the best track record to date
With care, we can guarantee to do better than A or B separately
At considerable expense!

Warm-up effects and context-switching
In real life, applications are interrupted and some other program runs for a while (if only the OS)
This means the branch prediction is regularly trashed
Simple predictors re-learn fast
in 2-bit bimodal predictor, all executions of given branch update same 2 bits
Sophisticated predictors re-learn more slowly
for example, in (2,2) gselect predictor, prediction updates are spread across 4 BHTs
Selective predictor may choose fast learner predictor until better predictor warms up

Warm-up...
Best predictor takes 20,000 instructions to overtake bimodal

Correlating branch prediction - Summary
Saturating counter (bimodal) scheme handles highly-biased branches well
Some applications have highly dynamic branches
The idea of correlating prediction is to use recent branch history to predict future
Issues:
How much branch history (BHR size)
How to minimise interference in the BHT
How to weight local history vs global history
Combining different policies dynamically (selective scheme) is expensive - but effective

Acknowledgements; finding out more
Hennessy and Patterson section 4.3
Sima, Fountain, Kacsuk, “Advanced Computer Architecture, a design space approach”, section 8.4.4
Zhendong Su and Min Zhou, A comparative analysis of branch prediction schemes.  (http://www.cs.berkeley.edu/~zhendong/cs252/project.html)
Tse-Ye Yeh and Yale N Patt, Alternative implementations of two-level adaptive branch prediction.  ISCA’92

Prediction pay-off
Prediction pay-off matrix shows costs of false negatives and false positives

When to gamble - predication
If a branch is hard to predict, expected branch delay is large

Instructions for predication
Idea #1: conditional move instruction

Instructions for predication
Idea #2: predicated instructions

Predication versus prediction
Prediction: gamble that you know whether the branch is taken, and follow just one trace
Predication: anticipate that you don’t really know whether the branch will be taken or not, but realise that you can afford to execute both traces

Predication
- issues
Predication is useful when
conditional body is short
branch prediction accuracy is poor
Predication requires space in the instruction encoding
Particularly valuable with wide multi-issue (why?)
Can affect the prediction accuracy of other branches
Not to be confused with speculative execution:
with predication, depends on preceding instructions
with speculation, depends on later instructions

Predicting multiple branches
In a multi-issue superscalar, several branches could appear in a single instruction packet
Can we predict the outcome of all of them as soon as the packet is fetched?
In many cases yes, though branch prediction table structure somewhat messy
Extend the idea: can we predict which trace control will take through a sequence of branches
yes… suits both bimodal and correlated patterns
get branch target buffer to cache this trace as a single-issue packet
this “trace cache” idea is becoming important
eg Pentium 4
But how is cache filled?  What can we do in the fill unit?

More applns of prediction - data prefetching
Suppose processor has recently issued loads to addresses a1, a2, .. an
Hardware data prefetch unit monitors sequence and calculates
What the value of an+1 is most likely to be
Whether this is known with sufficient confidence
If so, initiates speculative load
If a1, a2, .. an have fixed stride, prediction is easy
Typical programs access several different arrays with different strides/patterns
Issues:
Could compiler do better?
Bad decision could make performance worse (how?)

More applns of prediction - cache bypass
Suppose the cache line loaded by this instruction turns out never to be reused
If this happens a lot, we could decide not to allocate data into the cache - so cache space is not used unnecessarily
Often combined with hardware prefetching
Eg in IBM Power3:
hardware prefetch engine identifies fixed-stride access stream, initiates prefetching
prefetched data is allocated directly into L1 cache but bypasses L2 cache
(similar ideas in Pentium 4, Itanium etc).

More applns of prediction - value prediction
Suppose computation is stalled waiting for data from memory
What can we do in the meantime?
What if the data always turns out to be the same…
Speculatively execute the computation assuming it’s the same again!
Sensible plan?

More applns of prediction - value prediction ...
Sensible plan?
Lipasti and Shen (1996) define value locality as
“the likelihood of a previously-seen value appearing in a storage location”
claim that for SPEC92, simple hardware structure can predict load value correctly 49%-61% of the time
Compare with optimising compiler:
opportunistic common subsexpression elimination
achieve more parallelism than the data dependences allow (!)

Prediction - summary
Speculation implies prediction
many conditional branches are bimodal but not static
a significant minority are dynamic - not bimodal
many of those are correlated with other branches
analysis of payoff matrix can tell you when it’s better to execute both branches - predication
many branch predictor variants, including ‘selective’ predictor, attempts to use best of several other predictors
multi-issue processors need to predict a trace of branches
similar ideas can be applied to other forms of speculation, including prefetching, load-no-allocate, adaptive load-exclusive, value prediction etc