One of the recent research efforts in the Software Performance Optimisation Research Group is a DESO (Delayed Evaluation, Self-Optimising) library of parallel scientific software components. This library comes with both C and C++ bindings. An example of the C++ interface is the following simple numerical algorithm:
/*
* A is an N-by-N matrix
* r, z, p, q, x are N-element vectors
* rho, rho_o, beta, alpha and err are scalars
*
* Uses the Conjugate Gradient algorithm to solve
* the equation Ax = b for x.
*/
for ( i = 1; i <= max_iter; i++ ) {
/* ... */
rho = r * z;
if (i == 1) {
p = z;
} else {
z = z + (rho / rho_o) * p;
p = z;
}
q = A * p;
alpha = rho / (p * q);
x = x + alpha * p;
r = r - alpha * q;
err = nrm2(r); /* Check convergence */
/* ... */
}
See our EuroPar paper for an illustration for how a simple high-level piece of code like the above can become a working parallel program.
The Challenge of Cross-component Optimisation. It is
sadly not true that the composition of optimised software
components is optimal. For example: Suppose we have two large
optimised numerical functions f and g,
then f(); g() is not necessarily optimal, in
particular if these are parallel functions. One possible solution
is of course to make sure that we recompile f and
g each time we want to call them and then use a
compiler with powerful interprocedural optimisations. However, in
the case of libraries, we have no information available at
compile-time about how library functions will be used: we lose all
context information. In our library of parallel scientific
components, we overcome this problem of lost context information
by using delayed evaluation to capture the data-flow
graph of library operations before they are executed at program
runtime. We then have the opportunity to perform runtime
cross-component optimisations.
Code (Component) Selection. For a vector update
statement such as x = x + alpha * p, the data-flow
graph nodes created are as follows:
+------------------+
| temp = alpha * p |
+------------------+
/
/
|
\|/
|
+--------------+
| x = x + temp |
+--------------+
The reason why two nodes are created is the manner in which DESO's
C++ interface "parses" the statement x = x + alpha *
p. This approach is clearly sub-optimal, because
We should fuse the two loops temp = alpha * p
and x = x + temp.
On many platforms, the hardware vendor will have
supplied an optimised implementation of the operation y
<- alpha × x + y. In our current implementation,
we lose the opportunity to use that optimised
implementation. The problem of selecting the optimal set of
vendor-supplied functions for a particular computation is a
code selection problem.
For a larger illustration of this problem, consider the following data flow graph, which is the actual DAG generated when running the first iteration of the above sample program.
The aim of this project is to use runtime techniques to optimise these kind of DAGs. It would be a real bonus if we could also know whether our optimisation has just resulted in an improvement or found the actual optimal code selection for this DAG.
Similar problems have to be solved by optimising compilers for sequential code, and several published algorithms are available for that context.
There are ways in which both the code selection problem and the loop fusion problem could be addressed by using compile-time methods, specifically C++ Template Metaprogramming, see for example Blitz++ ( http://www.oonumerics.org/blitz/). There should be interesting trade-offs between the runtime and compile-time approaches.
Extend our existing parallel linear algebra library to perform runtime code selection, exploiting available optimised library routines. Evaluate by comparing with loop fusion and compile-time approaches.
Install and use the DESO library, understand how the C++ interface and the delayed evaluation system work and how data-flow of parallel operations is stored internally.
Research existing algorithms in the compiler literature for solving this problem (we will point you towards the key papers).
Implement a suitable algorithm in the existing library, together with the required model of what optimised operations are available.
Evaluate your work using various benchmarks and compare with other solutions (e.g. loop fusion and compile-time solutions).
This is a research-level project. To do this project, you must not be scared to pick up and work with an existing large software system. You should enjoy the challenge of getting software to work and do what you want. If you would like to discuss this project, please feel free to talk to either Paul Kelly or Olav Beckmann. Our EuroPar 2002 paper gives an overview of the software framework which you would be working with, whilst the THEMIS paper gives are more big-picture overview of the aims of our research project.
Olav Beckmann
Last modified: Tue Oct 4 15:28:10 BST 2005