We have a library which allows application programmers to execute automatically and more or less transparently certain computationally intensive parts of scientific programs in parallel. This library comes with both C and C++ APIs, for example:
/*
* A is an N-by-N matrix
* r, z, p, q, x are N-element vectors
* rho, rho_o, beta, alpha and err are scalars
*/
for ( i = 1; i <= max_iter; i++ ) {
/* ... */
rho = r * z;
if (i == 1) {
p = z;
} else {
z = z + (rho / rho_o) * p;
p = z;
}
q = A * p;
alpha = rho / (p * q);
x = x + alpha * p;
r = r - alpha * q;
err = nrm2(r); /* Check convergence */
/* ... */
}
This is a cut-down version of a real program that uses the non-preconditioned conjugate gradient method to solve the equation Ax = b (where x is the unknown vector to be found) in parallel.
Data Parallelism. Our library uses the data-parallel programming model for achieving parallelism. This means that the data is fully distributed (partitioned) over all available processors and that computation is then performed according to the "owner computes" rule: each processor computes that part of the overall data that it owns according to the manner in which the data has been distributed. (Not all data-parallel programs use the owner computes rule, but the principle that the data is fully distributed is always the same.)
Delayed Evaluation. The performance of data-parallel programs is largely determined by inter-processor communication. It is possible to minimise the cost of such communications by carefully choosing the data placements used. Our existing library performs such data placement optimisation. This is based on
Loop Fusion. On careful inspection, it is clear that
the statements z = z + (rho / rho_o) * p; and
p = z; in the above program could be fused. It may not
even be necessary to write out both z and
p. In this particular application program this fusion
would probably only result in a very marginal performance
improvement; however, in other benchmarks the benefit should be
much higher.
The TaskGraph Library is the result of a prize-winning individual project in this Department. It is a tool for runtime dependence analysis and code optimisation that uses its own small C++-like language to represent the computation to be performed. The data structure used by this library to represent this computation - by storing an abstract syntax tree (AST) for the code to be executed - is called a TaskGraph. The TaskGraph library is capable of performing loop fusion within one TaskGraph, i.e. if an AST we have constructed contains two loops, the library's code optimiser will check whether the loops are fusible and perform the fusion if valid.
Extend our existing parallel linear algebra library to perform runtime loop fusion.
Install and use our existing data-parallel library, understand how the delayed evaluation system work and how data-flow of parallel operations is stored internally.
Install and use the TaskGraph library, understand the internal representation of the AST.
One possible way of solving this problem to to re-write the routines our library provides as TaskGraphs. We then need to "fuse" these TaskGraphs at runtime in order to create one larger TaskGraph containing several loops that can then be analysed for fusion. There may be an alternative solution in constructing one large TaskGraph at the top-level program which calls library operators that each augment the AST for the computation to be performed.
There is a possible conflict between the objectives of fusing some loops at runtime and optimal code selection (see http://www.doc.ic.ac.uk/~ob3/ProjectIdeas/graph-matching.html). Investigate this issue
Evaluate your work using various benchmarks.
This is a research-level project. To do this project, you must not be scared to pick up and work with an existing software system, and you must have a reasonably good understanding of basic linear algebra in order to work with the existing data placement representation.
Olav Beckmann
Last modified: Tue Oct 4 15:11:37 BST 2005