Ex7-FullDiagOddChallenge

332 Advanced Computer Architecture

Exercise 7 (Assessed): the ``FullDiagOdd'' Challenge

This is the second of two assessed coursework exercises, both based on ``FullDiagOdd'' computational chemistry application.

You may work in groups of two or three if you wish, but your report must include an explicit statement of who did what.

Submit your work electronically via CATE.

Compiling and running FullDiagOdd

Copy the source code directory tree to your own directory:

cd 
cp -r /homes/awb01/Teaching/ACA06/FullDiagOdd ./

Now compile the program:

cd FullDiagOdd
make

Now you can run the program:

./mat.x86 <5a.dat

This reads input from the file 5a.dat, and writes its output to screen.

You have been provided with a selection of input files of various sizes: the short-running ones are for use with simulators; for serious runs on real hardware use the longer-running examples like ``11a.dat!''.

All-out performance

Basically, your job is to figure out how to run this program as fast as you possibly can, and to write a brief report explaining how you did it.

Rules

You can choose any hardware platform you wish. You are encouraged to find interesting and diverse machines to experiment with. The goal is high performance on your chosen platform so it is OK to choose an interesting machine even if it's not the fastest available. On linux type ``cat /proc/cpuinfo''.
Try the Apple G5s, possibly PDAs, DSP processors, graphics co-processor or FPGA.
Make sure the machine is quiescent before doing timing experiments. Always repeat experiments for statistical significance.
Choose a problem size which suits the performance of the machine you choose - the runtime must be large enough for an improvements to be evident.
The numerical results reported by the application need not be absolutely identical, but if not you must justify the correctness of your results¹.
You can achieve full marks even if you do not achieve the maximum performance.
Marks are awarded for
- Systematic analysis of the application's behaviour
- Systematic evaluation of performance improvement hypotheses
- Drawing conclusions from your experience
- A professional, well-presented report detailing the results of your work.
You should produce a compact report in the style of an academic paper for presentation at an international conference such as Supercomputing (www.sc2000.org). The report must not be more than 7 pages in length.

Hints, tools and techniques

Performance analysis tools:

You may find it useful to find out about:

Cachegrind and cg_annotate
kcachegrind - kcachegrind.sourceforge.net - graphical interface to cachegrind
gprof - standard command-line profiling tool.
kprof - kprof.sourceforge.net - graphical interface to gprof
VTune - Intel's (Windows and Linux) tool for understanding CPU performance issues and mapping them back to source code
(http://www.intel.com/software/products/vtune/). Free trial.
AMD's CodeAnalyst (installed on CSG Athlon machines - Start $\rightarrow$ Programming $\rightarrow$ AMD) (if you have an AMD machine)².
Sun's Performance Analyzer
http://docs.sun.com/source/806-3562/ (if you have a Sun Sparc machine)
oprofile
http://oprofile.sourceforge.net/news/ (requires kernel rebuild)

Compilers

You could investigate the potential benefits of more sophisticated compiler techniques:

Intel's compilers
(http://www.intel.com/software/products/compilers/ and installed on various Linux systems in the Department).
Codeplay's compilers (www.codeplay.com) (free demo download?)
IBM's compilers for Apple G5 - XL C/C++ Advanced Edition (a beta download was available, possible donation from Apple or IBM?)

Explains what hardware and software you used,
What hypothesis (or hypotheses) you investigated,
How you evaluated what the potential advantage could be,
How you explored the effectiveness of the approach experimentally
What conclusions can you draw from your work
If you worked in a group, indicate who was responsible for what.

Please do not write more than seven pages.

Paul Kelly, Imperial College, 2006

Footnotes

... results ¹: The gcc flag -ffloat-store is sometimes useful to check whether the difference in output is due purely to register allocation.
... machine)²: To do this you will need to build the code using a native Windows compiler. This is easier if you can use the Fortran sources, see the ``OriginalFortran/'' subdirectory.