This exercise concerns a familiar computational kernel: matrix multiplication. Here's a simple implementation:
/* * mm * * Multiply A by B leaving the result in C. * A is assumed to be an l x m matrix, B an m x n matrix. * The result matrix C is of course l x n. * The result matrix is assumed to be initialised to zero. */ void mm1(double A[SZ][SZ], B[SZ][SZ], C[SZ][SZ]) { int i, j, k; double r; for (i = 0; i < SZ; i++) for (j = 0; j < SZ; j++) for (k = 0; k < SZ; k++) C[i][j] += A[i][k] * B[k][j]; }
Your job is to characterise the behaviour of various variants of the matrix multiply program: the straightforward version, as shown in the code shown above, an interchanged ``ikj'' variant, and a more sophisticated, ``tiled'' version. You can find source code in
/homes/phjk/ToyPrograms/ACA07/MM
prompt> mkdir ACA07 prompt> cd ACA07 prompt> cp -r /homes/phjk/ToyPrograms/ACA07/MM ./(The ``./'' above is the destination of the copy - your current working directory). You should now have a copy of the MM directory. Now list the contents:
prompt> cd MM prompt> ls Makefile MM1.c MM2.c MM3.c MM4.c scripts TaskGraphThe files MM1.c, MM2.c etc contain the various different versions of the matrix multiply that we're going to play with. The TaskGraph directory is to explore on your own if you have time.
Compile the programs using make:
prompt> make allThis uses the GNU C compiler to build the binary for the linux x86 machines we're using. It also builds binaries for the SimpleScalar simulator.
Now run the programs as follows:
prompt> ./MM1.x86 prompt> ./MM2.x86 prompt> ./MM3.x86 prompt> ./MM4.x86The goal of this exercise is for you to figure out why the four versions run at different speeds.
To use the simulators you need to work with a smaller problem size
(for MM3 and MM4 this must be a multiple of the block size, which is
64 by default). The problem size is determined by the #define
``SZ'', which you can set using make (if you don't, it defaults to 576 (=512+64))2.
Type:
prompt> make clean prompt> make MYFLAGS=-DSZ=256(the ``make clean'' deletes the old binaries).
Cachegrind is a cache simulator that works on x86 binaries. Try:
prompt> valgrind --tool=cachegrind ./MM1.x86This produces a summary of the total memory accesses, and the cache miss rates for instructions and data: ca.154M instructions fetched, 34M data reads, 17M writes, L1D read miss rate 6%, L1D write miss rate 0.1%.
Note that the cache requirements of the application are quite different with a smaller problem size... With the default problem size, for MM1, the results are quite different:
I refs: 1,736,711,935 I1 misses: 1,283 L2i misses: 744 I1 miss rate: 0.0% L2i miss rate: 0.0% D refs: 771,148,947 (576,695,016 rd + 194,453,931 wr) D1 misses: 194,653,251 (194,528,626 rd + 124,625 wr) L2d misses: 24,097,352 ( 23,972,772 rd + 124,580 wr) D1 miss rate: 25.2% ( 33.7% + 0.0% ) L2d miss rate: 3.1% ( 4.1% + 0.0% ) L2 refs: 194,654,534 (194,529,909 rd + 124,625 wr) L2 misses: 24,098,096 ( 23,973,516 rd + 124,580 wr) L2 miss rate: 0.9% ( 1.0% + 0.0% )
When you run cachegrind, it produces a log file in your current directory, called something like cachegrind.out.12345. Now, suppose you have just executed MM1.x86. Now try:
prompt> cg_annotate --12345 MM1.c(where 12345 is the number from the log file's name - it's the process id). This should produce a source code listing, showing lines of code from MM1.c with substantial cache misses. As a shorthand, you could type:
prompt> cg_annotate --`./scripts/lastcgpid` MM2.c(this uses a little shell script to find the pid of the latest cachegrind run).
You can give cachegrind parameters to control the cache being simulated - by default it chooses parameters simular to the machine you are using. For further details see
http://developer.kde.org/~sewardj/docs-2.2.0/cg_main.html#cg-top(or google ``cachegrind first off'').
Now, rebuild for problem size 128:
prompt> make clean prompt> make MYFLAGS=-DSZ=128And try using the SimpleScalar simulator:
prompt> /homes/phjk/simplesim-3.0/sim-outorder ./MM1.ssThe first thing this does is lists the parameters of the architecture being simulated - all of these details can be changed using command-line arguments. Then it runs the application, and outputs details of how the processor microarchitecture and memory hierarchy were used3.
A script ``varycachesize'' has been provided for running a sequence of experiments over a range of cache sizes of direct-mapped cache. For example:
prompt> ./scripts/varycachesize ./MM1.ss 256 8192(it is worth finding a fast, unshared machine for this). The output format has been designed to be easy to use with ``gnuplot''; the miss rate is column 4:
prompt> ./scripts/varycachesize ./MM1.ss 256 8192 > varycachesize_MM1_256_8192.dat prompt> gnuplot gnuplot> plot "varycachesize_MM1_256_8192.dat" using 4 with linespointsEdit the script to study the effect of other cache parameters. Edit it to use sim-outorder, so you can study other microarchitecture features.
Plot a graph that shows how the L1 data cache miss rate (as measured with SimpleScalar) for MM 1,2,3 and 4 varies with cache size and explain the result. The range of 64-8192 is interesting.
plot "varycachesize_MM1_256_8192.dat" using 4 with lp,\ "varycachesize_MM2_256_8192.dat" using 4 with lp
You might enjoy having a look at the TaskGraph subdirectory:
prompt> cd TaskGraph prompt> make prompt> ./mmThis version uses the TaskGraph Library, a prototype tool developed in my research group. You can write a program that builds the original loop nest, then programmatically applies transformations, before running the generated code. You could extend this to systematically search for the optimum version of the code.
Paul Kelly, Imperial College, 2007