The aim of this project is to enable scientific application programmers to exploit transparently the extra processing power of SMPs (shared-memory multi-processors, i.e. for example Linux boxes with 2 or more processors). In commodity SMPs, the extra processors are currently often used for getting better performance when running multiple applications; very few individual application utilise all processors.
OpenMP (http://www.openmp.org) is a standard interface for allowing C/C++ and Fortran programmers to exploit parallelism on shared-memory multiprocessors, ranging from very simple dual processor Linux machines to large supercomputers. It is actually very simple:
#pragma omp parallel for
for ( j = 0; j < sz; ++j ) {
b[j] += alpha * a[j];
}
This simple annotation indicates to an OpenMP compiler that
this loop is to be executed in parallel. It is moderately
successful: Using the Intel C compiler, which performs OpenMP
parallelisation, and 1000000-element arrays a and
b, this loop runs on my desktop charis (a
dual Athlon 1600+) without parallelisation in
27.2 milliseconds, whereas with
parallelisation it runs in 18.4 milliseconds
(interestingly, both are slightly faster under Windows). With
more compute-intensive problems and being more careful to
avoid unnecessary overheads of thread-creation, it should be
possible to get better speedups. Now, here is another example:
r = 0.0;
#pragma omp parallel for
for ( j = 0; j < sz; ++j ) {
res += a[j] * b[j];
}
This dot-product loop runs in 23.5
milliseconds on a single processor, without the
annotation. However, with the annotation shown above, it runs
in 60.8 ms! What went wrong? This loop is
actually not parallel, because there is a loop-carried
dependency (carried by res). Therefore, running it as
a parallel loop first causes res to be invalidated in
each processor's cache on each iteration and also gives the
wrong answer! The solution:
res = 0.0;
#pragma omp parallel for reduction (+: res)
for ( j = 0; j < sz; ++j ) {
res = res + (a[j] * b[j]);
}
This new annotation advises the OpenMP compiler that this is
in fact a reduction over +, and that res is the
reduction variable. The compiler will then generate code that
re-associates this reduction, and which duly runs in
13.8 ms.
As demonstrated above, OpenMP provides an easy way of generating parallel code. The hard part is to know what the annotation should be. The main focus of this project should be to extract automatically the dependence information which defines whether parallelisation is valid. There are several possible tools that you could use:
Once the analysis has been achieved and we can reliably generate OpenMP annotations, you should also look at the following:
This is a research-level project. To do this project, you must not be scared to pick up and work with an existing large software system. You should enjoy the challenge of getting software to work and do what you want. If you would like to discuss this project, please feel free to talk to either Paul Kelly or Olav Beckmann. Our EuroPar paper gives an overview of the DESO framework, whilst the THEMIS paper gives are more big-picture overview of the aims of our research project.
Olav Beckmann
Last modified: Tue Oct 4 15:14:19 BST 2005