332 Advanced Computer Architecture

Exercise 4: Matrix Multiply Exploration

Background

This exercise concerns a familiar computational kernel: matrix multiplication. Here's a simple implementation:

/* 
 * mm
 *
 * Multiply A by B leaving the result in C.
 * A is assumed to be an l x m matrix, B an m x n matrix.
 * The result matrix C is of course l x n.
 * The result matrix is assumed to be initialised to zero.
 */
void mm1(double A[SZ][SZ], B[SZ][SZ], C[SZ][SZ])
{
  int i, j, k;
  double r;

  for (i = 0; i < SZ; i++)
    for (j = 0; j < SZ; j++)
      for (k = 0; k < SZ; k++)
        C[i][j] += A[i][k] * B[k][j];
}

Your job is to characterise the behaviour of various variants of the matrix multiply program: the straightforward version, as shown in the code shown above, an interchanged ``ikj'' variant, and a more sophisticated, ``tiled'' version. You can find source code in

/homes/phjk/ToyPrograms/ACA07/MM

Getting started

The files provided

Using one¹ of the DoC CSG Linux systems (level 2, Huxley), make your own copy of the ACA07/MM directory tree by executing the following Linux commands:

prompt> mkdir ACA07
prompt> cd ACA07
prompt> cp -r /homes/phjk/ToyPrograms/ACA07/MM ./

(The ``./'' above is the destination of the copy - your current working directory). You should now have a copy of the MM directory. Now list the contents:

prompt> cd MM
prompt> ls
Makefile  MM1.c  MM2.c  MM3.c  MM4.c  scripts  TaskGraph

The files MM1.c, MM2.c etc contain the various different versions of the matrix multiply that we're going to play with. The TaskGraph directory is to explore on your own if you have time.

Compiling the programs

Compile the programs using make:

prompt> make all

This uses the GNU C compiler to build the binary for the linux x86 machines we're using. It also builds binaries for the SimpleScalar simulator.

Running them on Linux

Now run the programs as follows:

prompt> ./MM1.x86
prompt> ./MM2.x86
prompt> ./MM3.x86
prompt> ./MM4.x86

The goal of this exercise is for you to figure out why the four versions run at different speeds.

Compiling with a smaller problem size

To use the simulators you need to work with a smaller problem size (for MM3 and MM4 this must be a multiple of the block size, which is 64 by default). The problem size is determined by the #define ``SZ'', which you can set using make (if you don't, it defaults to 576 (=512+64))². Type:

prompt> make clean
prompt> make MYFLAGS=-DSZ=256

(the ``make clean'' deletes the old binaries).

Using the cachegrind simulator

Cachegrind is a cache simulator that works on x86 binaries. Try:

prompt> valgrind --tool=cachegrind ./MM1.x86

This produces a summary of the total memory accesses, and the cache miss rates for instructions and data: ca.154M instructions fetched, 34M data reads, 17M writes, L1D read miss rate 6%, L1D write miss rate 0.1%.

Note that the cache requirements of the application are quite different with a smaller problem size... With the default problem size, for MM1, the results are quite different:

 I   refs:      1,736,711,935
 I1  misses:            1,283
 L2i misses:              744
 I1  miss rate:           0.0%
 L2i miss rate:           0.0%

 D   refs:        771,148,947  (576,695,016 rd + 194,453,931 wr)
 D1  misses:      194,653,251  (194,528,626 rd +     124,625 wr)
 L2d misses:       24,097,352  ( 23,972,772 rd +     124,580 wr)
 D1  miss rate:          25.2% (       33.7%   +         0.0%  )
 L2d miss rate:           3.1% (        4.1%   +         0.0%  )

 L2 refs:         194,654,534  (194,529,909 rd +     124,625 wr)
 L2 misses:        24,098,096  ( 23,973,516 rd +     124,580 wr)
 L2 miss rate:            0.9% (        1.0%   +         0.0%  )

Getting an annotated source-code listing

When you run cachegrind, it produces a log file in your current directory, called something like cachegrind.out.12345. Now, suppose you have just executed MM1.x86. Now try:

prompt> cg_annotate --12345 MM1.c

(where 12345 is the number from the log file's name - it's the process id). This should produce a source code listing, showing lines of code from MM1.c with substantial cache misses. As a shorthand, you could type:

prompt> cg_annotate --`./scripts/lastcgpid` MM2.c

(this uses a little shell script to find the pid of the latest cachegrind run).

Modelling different cache configurations

You can give cachegrind parameters to control the cache being simulated - by default it chooses parameters simular to the machine you are using. For further details see

http://developer.kde.org/~sewardj/docs-2.2.0/cg_main.html#cg-top

(or google ``cachegrind first off'').

Using simplescalar

Now, rebuild for problem size 128:

prompt> make clean
prompt> make MYFLAGS=-DSZ=128

And try using the SimpleScalar simulator:

prompt> /homes/phjk/simplesim-3.0/sim-outorder  ./MM1.ss

The first thing this does is lists the parameters of the architecture being simulated - all of these details can be changed using command-line arguments. Then it runs the application, and outputs details of how the processor microarchitecture and memory hierarchy were used³.

Using a script to run a sequence of simulations

A script ``varycachesize'' has been provided for running a sequence of experiments over a range of cache sizes of direct-mapped cache. For example:

prompt> ./scripts/varycachesize ./MM1.ss 256 8192

(it is worth finding a fast, unshared machine for this). The output format has been designed to be easy to use with ``gnuplot''; the miss rate is column 4:

prompt> ./scripts/varycachesize ./MM1.ss 256 8192 > varycachesize_MM1_256_8192.dat
prompt> gnuplot
gnuplot> plot "varycachesize_MM1_256_8192.dat" using 4 with linespoints

Edit the script to study the effect of other cache parameters. Edit it to use sim-outorder, so you can study other microarchitecture features.

What to do

Plot a graph that shows how the L1 data cache miss rate (as measured with SimpleScalar) for MM 1,2,3 and 4 varies with cache size and explain the result. The range of 64-8192 is interesting.

Note:

You can plot several datasets at once as follows:

plot "varycachesize_MM1_256_8192.dat" using 4 with lp,\
     "varycachesize_MM2_256_8192.dat" using 4 with lp

What to do next

You might enjoy having a look at the TaskGraph subdirectory:

prompt> cd TaskGraph
prompt> make
prompt> ./mm

This version uses the TaskGraph Library, a prototype tool developed in my research group. You can write a program that builds the original loop nest, then programmatically applies transformations, before running the generated code. You could extend this to systematically search for the optimum version of the code.

Paul Kelly, Imperial College, 2007

Footnotes

... one ¹: Use ssh. See https://www.doc.ic.ac.uk/csg/computers for list of machines available. Choose a Pentium IV or Athlon machine first. Don't choose an Apple unless you want to modify the Makefile yourself.
... (=512+64))²: Using cachegrind (see below) with a problem size of 576 took more than 200 seconds when I tried it, ca. 1 MFLOPS on a 2GHz P4.
... used ³: For problem size 256 this took my 2GHz P4 almost 700 seconds