Automatic Parallelisation of Scientific Codes for Shared Memory

line

Motivation

The aim of this project is to enable scientific application programmers to exploit transparently the extra processing power of SMPs (shared-memory multi-processors, i.e. for example Linux boxes with 2 or more processors). In commodity SMPs, the extra processors are currently often used for getting better performance when running multiple applications; very few individual application utilise all processors.

OpenMP (http://www.openmp.org) is a standard interface for allowing C/C++ and Fortran programmers to exploit parallelism on shared-memory multiprocessors, ranging from very simple dual processor Linux machines to large supercomputers. It is actually very simple:

#pragma omp parallel for
for ( j = 0; j < sz; ++j ) {
  b[j] += alpha * a[j];
}
	
This simple annotation indicates to an OpenMP compiler that this loop is to be executed in parallel. It is moderately successful: Using the Intel C compiler, which performs OpenMP parallelisation, and 1000000-element arrays a and b, this loop runs on my desktop charis (a dual Athlon 1600+) without parallelisation in 27.2 milliseconds, whereas with parallelisation it runs in 18.4 milliseconds (interestingly, both are slightly faster under Windows). With more compute-intensive problems and being more careful to avoid unnecessary overheads of thread-creation, it should be possible to get better speedups. Now, here is another example:
r = 0.0;
#pragma omp parallel for
for ( j = 0; j < sz; ++j ) {
  res += a[j] * b[j];
}
	
This dot-product loop runs in 23.5 milliseconds on a single processor, without the annotation. However, with the annotation shown above, it runs in 60.8 ms! What went wrong? This loop is actually not parallel, because there is a loop-carried dependency (carried by res). Therefore, running it as a parallel loop first causes res to be invalidated in each processor's cache on each iteration and also gives the wrong answer! The solution:
res = 0.0;
#pragma omp parallel for reduction (+: res) 
for ( j = 0; j < sz; ++j ) {
  res = res + (a[j] * b[j]);
}
	
This new annotation advises the OpenMP compiler that this is in fact a reduction over +, and that res is the reduction variable. The compiler will then generate code that re-associates this reduction, and which duly runs in 13.8 ms.

The Task:

Exploit shared memory parallelism in loops that occur in scientific codes.

As demonstrated above, OpenMP provides an easy way of generating parallel code. The hard part is to know what the annotation should be. The main focus of this project should be to extract automatically the dependence information which defines whether parallelisation is valid. There are several possible tools that you could use:

  1. The TaskGraph Library is the result of a prize-winning individual project in this Department. It is a tool for runtime dependence analysis and code optimisation that uses its own small C++-like language to represent the computation to be performed.
  2. You could also look at compiler tools such as SUIF...

Once the analysis has been achieved and we can reliably generate OpenMP annotations, you should also look at the following:

Is this project for you?

This is a research-level project. To do this project, you must not be scared to pick up and work with an existing large software system. You should enjoy the challenge of getting software to work and do what you want. If you would like to discuss this project, please feel free to talk to either Paul Kelly or Olav Beckmann. Our EuroPar paper gives an overview of the DESO framework, whilst the THEMIS paper gives are more big-picture overview of the aims of our research project.

-----------------------------

Olav Beckmann
Last modified: Tue Oct 4 15:14:19 BST 2005