This document is available on the web at
http://www.doc.ic.ac.uk/~phjk/AdvancedCompArchitecture/Exercises/Ex3-MMExploration/Ex3-MMExploration.html
This exercise concerns a familiar computational kernel: matrix-matrix multiplication. Here's a simple C implementation:
/* * Matrix-matrix multiplication for studying cache performance: * C = AB, where all the matrices are of dimensions N x N * * (The matrices can actually be submatrices of larger * matrices of dimensions M x M, where M > N, e.g. N=576, * M=1024, to study effects of associativity conflicts.) */ void mm(double A[M][M], B[M][M], C[M][M]) { int i, j, k; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { C[i][j] = 0; for (k = 0; k < N; k++) { C[i][j] += A[i][k] * B[k][j]; } } } }
Your job is to characterise the behaviour of various variants of the matrix multiply program: the straightforward version, as shown in the code shown above, an interchanged ``ikj'' variant, and a more sophisticated, ``tiled'' version. You can find source code at
~phjk/ToyPrograms/ACA14/MM
prompt> mkdir ACA14 prompt> cd ACA14 prompt> cp -r ~phjk/ToyPrograms/ACA14/MM ./(The ./ above is the destination of the copy - your current working directory). You should now have a copy of the MM directory. Now list the contents:
prompt> cd MM prompt> make prompt> ls Makefile MM1.ss MM2.c MM2.x86 MM3.ss MM4.c MM4.x86 MM1.c MM1.x86 MM2.ss MM3.c MM3.x86 MM4.ss scripts
Now run the programs as follows:
prompt> ./MM1.x86 prompt> ./MM2.x86 prompt> ./MM3.x86 prompt> ./MM4.x86
The goal of this exercise is for you to figure out why the four versions run at different speeds.
The default problem size is 1056 (=1024+32). You can set a different problem size, e.g. type:
prompt> make clean prompt> make MYFLAGS=-DSZ=2048 x86(the make clean deletes the old binaries).
Cachegrind is a cache simulator that works on x86 binaries.
Set a smaller problem size,1 e.g.
prompt> make clean prompt> make MYFLAGS=-DSZ=256 x86
Try:
prompt> valgrind --tool=cachegrind ./MM1.x86
This produces a summary of the total memory accesses, and the cache miss rates for instructions and data: ca. 154M instructions fetched, 34M data reads, 17M writes, L1D read miss rate 6%, L1D write miss rate 0.1%.
Note that the cache requirements of the application are quite different with a smaller problem size. With the default problem size, for MM1, the results are quite different:
==6187== I refs: 1,544,592,375 ==6187== I1 misses: 817 ==6187== L2i misses: 814 ==6187== I1 miss rate: 0.00% ==6187== L2i miss rate: 0.00% ==6187== ==6187== D refs: 577,037,006 (383,256,512 rd + 193,780,494 wr) ==6187== D1 misses: 194,591,990 (194,467,353 rd + 124,637 wr) ==6187== L2d misses: 237,436 ( 112,812 rd + 124,624 wr) ==6187== D1 miss rate: 33.7% ( 50.7% + 0.0% ) ==6187== L2d miss rate: 0.0% ( 0.0% + 0.0% ) ==6187== ==6187== L2 refs: 194,592,807 (194,468,170 rd + 124,637 wr) ==6187== L2 misses: 238,250 ( 113,626 rd + 124,624 wr) ==6187== L2 miss rate: 0.0% ( 0.0% + 0.0% ) ==6187== ==6187== Branches: 192,122,440 (192,122,252 cond + 188 ind) ==6187== Mispredicts: 337,203 ( 337,165 cond + 38 ind) ==6187== Mispred rate: 0.1% ( 0.1% + 20.2% )
When you run cachegrind, it produces a log file in your current directory, called something like cachegrind.out.6187 (the number is the process id, and appears in the output above as well). kcachegrind is a user interface for exploring a program's profile data. Try:
prompt> kcachegrind cachegrind.out.6187(you will have to adjust this: use the pid from your valgrind run instead of ``6187'').
You can give cachegrind parameters to control the cache being simulated - by default it chooses parameters simular to the machine you are using. For further details see http://valgrind.org/docs/manual/cg-manual.html
For example:
valgrind --tool=cachegrind --branch-sim=yes --D1=1024,1,64 ./MM1.x86
Now, rebuild for problem size 128:
prompt> make clean prompt> make MYFLAGS=-DSZ=128And try using the SimpleScalar simulator:
prompt> /homes/phjk/simplesim/sim-outorder ./MM1.ssThe first thing this does is lists the parameters of the architecture being simulated - all of these details can be changed using command-line arguments. Then it runs the application, and outputs details of how the processor microarchitecture and memory hierarchy were used2.
A script ``varycachesize'' has been provided for running a sequence of experiments over a range of cache sizes of direct-mapped cache. For example:
prompt> ./scripts/varycachesize ./MM1.ss 256 8192(it is worth finding a fast, unshared machine for this). The output format has been designed to be easy to use with ``gnuplot''; the miss rate is column 4:
prompt> ./scripts/varycachesize ./MM1.ss 256 8192 > varycachesize_MM1_256_8192.dat prompt> gnuplot gnuplot> plot "varycachesize_MM1_256_8192.dat" using 4 with linespointsEdit the script to study the effect of other cache parameters. Edit it to use sim-outorder, so you can study other microarchitecture features.
Plot a graph that shows how the L1 data cache miss rate (as measured with SimpleScalar) for MM 1,2 and 3 varies with cache size and explain the result. The range of 64-8192 is interesting.
plot "varycachesize_MM1_256_8192.dat" using 4 with lp,\ "varycachesize_MM2_256_8192.dat" using 4 with lp
Paul H.J. Kelly, Anton Lokhmotov, Carlo Bertolli and Graham Markall, Imperial College, 2012