This exercise concerns a familiar computational kernel: matrix multiplication. Here's a simple implementation:
/* * mm * * Multiply A by B leaving the result in C. * A is assumed to be an l x m matrix, B an m x n matrix. * The result matrix C is of course l x n. * The result matrix is assumed to be initialised to zero. */ void mm1(double A[SZ][SZ], B[SZ][SZ], C[SZ][SZ]) { int i, j, k; double r; for (i = 0; i < SZ; i++) for (j = 0; j < SZ; j++) for (k = 0; k < SZ; k++) C[i][j] += A[i][k] * B[k][j]; }
Your job is to characterise the behaviour of various variants of the matrix multiply program: the straightforward version, as shown in the code shown above, an interchanged ``ikj'' variant, and a more sophisticated, ``tiled'' version. You can find source code in
/homes/phjk/ToyPrograms/ACA07/MM
prompt> mkdir ACA07 prompt> cd ACA07 prompt> cp -r /homes/phjk/ToyPrograms/ACA07/MM ./(The ``./'' above is the destination of the copy - your current working directory). You should now have a copy of the MM directory. Now list the contents:
prompt> cd MM prompt> ls Makefile MM1.c MM2.c MM3.c MM4.c scripts TaskGraphThe files MM1.c, MM2.c etc contain the various different versions of the matrix multiply that we're going to play with. The TaskGraph directory is to explore on your own if you have time.
Compile the programs using make:
prompt> make allThis uses the GNU C compiler to build the binary for the linux x86 machines we're using. It also builds binaries for the SimpleScalar simulator.
Now run the programs as follows:
prompt> ./MM1.x86 prompt> ./MM2.x86 prompt> ./MM3.x86 prompt> ./MM4.x86The goal of this exercise is for you to figure out why the four versions run at different speeds.
To use the simulators you need to work with a smaller problem size
(for MM3 and MM4 this must be a multiple of the block size, which is
64 by default). The problem size is determined by the #define
``SZ'', which you can set using make (if you don't, it defaults to 576 (=512+64))2.
Type:
prompt> make clean prompt> make MYFLAGS=-DSZ=256(the ``make clean'' deletes the old binaries).
Cachegrind is a cache simulator that works on x86 binaries. Try:
prompt> /homes/phjk/valgrind/bin/valgrind --tool=cachegrind --branch-sim=yes ./MM1.x86This produces a summary of the total memory accesses, and the cache miss rates for instructions and data: ca.154M instructions fetched, 34M data reads, 17M writes, L1D read miss rate 6%, L1D write miss rate 0.1%.
Note that the cache requirements of the application are quite different with a smaller problem size... With the default problem size, for MM1, the results are quite different:
==6187== I refs: 1,544,592,375 ==6187== I1 misses: 817 ==6187== L2i misses: 814 ==6187== I1 miss rate: 0.00% ==6187== L2i miss rate: 0.00% ==6187== ==6187== D refs: 577,037,006 (383,256,512 rd + 193,780,494 wr) ==6187== D1 misses: 194,591,990 (194,467,353 rd + 124,637 wr) ==6187== L2d misses: 237,436 ( 112,812 rd + 124,624 wr) ==6187== D1 miss rate: 33.7% ( 50.7% + 0.0% ) ==6187== L2d miss rate: 0.0% ( 0.0% + 0.0% ) ==6187== ==6187== L2 refs: 194,592,807 (194,468,170 rd + 124,637 wr) ==6187== L2 misses: 238,250 ( 113,626 rd + 124,624 wr) ==6187== L2 miss rate: 0.0% ( 0.0% + 0.0% ) ==6187== ==6187== Branches: 192,122,440 (192,122,252 cond + 188 ind) ==6187== Mispredicts: 337,203 ( 337,165 cond + 38 ind) ==6187== Mispred rate: 0.1% ( 0.1% + 20.2% )
When you run cachegrind, it produces a log file in your current directory, called something like cachegrind.out.6187 (the number is the process id, and appears in the output above as well). kcachegrind is a user interface for exploring a program's profile data. Try:
prompt> kcachegrind cachegrind.out.6187(you will have to adjust this: use the pid from your valgrind run instead of ``6187'').
You can give cachegrind parameters to control the cache being simulated - by default it chooses parameters simular to the machine you are using. For further details see
http://valgrind.org/docs/manual/cg-manual.htmlFor example:
/homes/phjk/valgrind/bin/valgrind --tool=cachegrind --branch-sim=yes --D1=1024,1,64 ./MM1.x86
Now, rebuild for problem size 128:
prompt> make clean prompt> make MYFLAGS=-DSZ=128And try using the SimpleScalar simulator:
prompt> /homes/phjk/simplesim/sim-outorder ./MM1.ssThe first thing this does is lists the parameters of the architecture being simulated - all of these details can be changed using command-line arguments. Then it runs the application, and outputs details of how the processor microarchitecture and memory hierarchy were used3.
A script ``varycachesize'' has been provided for running a sequence of experiments over a range of cache sizes of direct-mapped cache. For example:
prompt> ./scripts/varycachesize ./MM1.ss 256 8192(it is worth finding a fast, unshared machine for this). The output format has been designed to be easy to use with ``gnuplot''; the miss rate is column 4:
prompt> ./scripts/varycachesize ./MM1.ss 256 8192 > varycachesize_MM1_256_8192.dat prompt> gnuplot gnuplot> plot "varycachesize_MM1_256_8192.dat" using 4 with linespointsEdit the script to study the effect of other cache parameters. Edit it to use sim-outorder, so you can study other microarchitecture features.
Plot a graph that shows how the L1 data cache miss rate (as measured with SimpleScalar) for MM 1,2,3 and 4 varies with cache size and explain the result. The range of 64-8192 is interesting.
plot "varycachesize_MM1_256_8192.dat" using 4 with lp,\ "varycachesize_MM2_256_8192.dat" using 4 with lp
You might enjoy having a look at the TaskGraph subdirectory:
prompt> cd TaskGraph prompt> make prompt> ./mmThis version uses the TaskGraph Library, a prototype tool developed in my research group. You can write a program that builds the original loop nest, then programmatically applies transformations, before running the generated code. You could extend this to systematically search for the optimum version of the code.
You might also like to look at some of the other simplescalar simulators; see
http://www.simplescalar.com/v4test.html http://www.simplescalar.com/This includes microarchitecture simulators for ARM, Alpha and PowerPC, together with energy models, microarchitecture extensions and multiprocessor simulations.
Paul Kelly, Imperial College, 2008