This is the first of two equally-weighted assessed coursework exercises. Hand your solution into the Dept of Computing Student General Office (Huxley 345) by 5pm on Wednesday February 18th.
A sparse matrix is one with a large proportion of zero elements. Many important algorithms involve multiplying a sparse matrix by a vector, . This key computational ``kernel'' is the focus of the main laboratory project for this course.
Applications include:
double A[A_rows][A_cols]; double x[A_rows], y[A_rows]; for ( i=0; i<A_rows; i++ ) { sum = 0.0; for ( k=1; k<=A_cols; k++ ) sum += A[i][k]*x[k]; y[i] = sum; }To make this sparse we need to choose a storage scheme. A popular choice is ``compressed row'':
struct MatrixElement { long col; double val; }; /* no of non-zeroes in each row */ int SA_entries[A_rows]; /* array of pointers to list of non-zeroes in each row */ MatrixElement** SA = (struct MatrixElement**) malloc(sizeof(struct MatrixElement)*A_rows); /* When SA_entries is known we can allocate just enough memory for data */ for(i=0; i<A_rows; i++) { SA[i] = (struct MatrixElement*) malloc(sizeof(struct MatrixElement)*A_entries[i]); }The idea here is that for each row, we keep an array of the non-zero matrix elements, pointed to by
SA[i]
. For each element
SA[i][j]
we need its value, and its column number. So, if
A[i][j]
is the n
th non-zero in row i
, its value
will be stored at SA[i][n].val
, and SA[i][n].col
will be
j
.
Now the matrix-vector multiply can be written as:
for(i=0; i<A_rows; i++) { sum = 0; for(j=0; j<SA_entries[i]; j++) { sum += ((SA[i][j]).val)*x[(SA[i][j]).col]; } y[i] = sum; }
Copy the source code directory tree to your own directory:
cd cp -r /homes/phjk/ToyPrograms/C/smv-V2 ./Now run some setup scripts:
cd smv-V2 source SOURCEME.csh ./scripts/MakeMatrices.sh 1This creates a directory /tmp/USERNAME/spmv/ and builds some sample matrices there. You should compile the application using the following commands:
pushd solver/csteady/ make popdNow you can run the program:
./solver/csteady/csteady inputs/courier1/This reads input from the directory for problem size 1, and writes its output to a file called ``STEADY'' in the same directory. This is problem size 1 - very tiny. Problem size 3 is more substantial. 1
The makefile in solver/csteady also builds a binary for execution using the SimpleScalar simulator. You can execute it by typing:
sim-outorder ./solver/csteady/csteady-ss inputs/courier1/
Use the fastest machine you can find. On a 3GHz Pentium 4 each simulation run (for problem size 1) takes about a minute. Use the ``top'' command to make sure you're not sharing it.
The most important output from the simulator is ``sim_cycle'' - the total number of cycles to complete the run. It's often also useful to look at ``sim_IPC'', the instructions per cycle - provided you always execute the same number of instructions. The time taken to perform the simulation ``sim_elapsed_time'' simply tells you how the simulator took.
Other outputs from the simulator can be helpful in guiding your search - eg ``ruu_full'', the proportion of cycles when the RUU is full.
To do this you need to write a shell script. You might find the following Bash script useful:
#!/bin/sh -f for ((x=2; x <= 128 ; x *= 2)) do echo -n "ruu-size" $x " " sim-outorder -ruu:size $x \ ./solver/csteady/csteady-ss inputs/courier1/ \ 2>&1 | grep sim_IPC doneThe output is delivered to the file ``results''. You can find this script in
scripts/ruu-loop.shSee also ``lsq-loop.sh''.
Try using the gnuplot program. Run the script above, and save the output in a file ``table''. Type ``gnuplot''. Then, at its prompt type:
set logscale x 2 plot [][] 'table' using 1:4 with linespointsTo save the plot as a postscript file, try:
set term postscript eps set output "psfile.ps" plot [][] 'table' using 1:4 with linespointsTry ``help postscript'', ``help plot'' etc for further details.
Compile the application using the ``-S'' flag.
cd solver/csteady ~phjk/simplescalar/bin/gcc -O3 -S csteady.cThe assembler code generated by the compiler is delivered in ``csteady.s''.
The compiler accepts many parameters to control optimisation. Try ``man gcc'' to read about them. For example, ``-funroll-loops'' encourages the compiler to unroll loops:
~phjk/simplescalar/bin/gcc -O3 -funroll-loops csteady.cDoes this improve the performance of the simulated machine? What happens to the IPC?
Paul Kelly, Imperial College London, 2004