We have a library which allows application programmers to execute automatically and more or less transparently certain computationally intensive parts of scientific programs in parallel. This library comes with both C and C++ APIs, for example:
/*
* A is an N-by-N matrix
* r, z, p, q, x are N-element vectors
* rho, rho_o, beta, alpha and err are scalars
*/
for ( i = 1; i <= max_iter; i++ ) {
/* ... */
rho = r * z;
if (i == 1) {
p = z;
} else {
z = z + (rho / rho_o) * p;
p = z;
}
q = A * p;
alpha = rho / (p * q);
x = x + alpha * p;
r = r - alpha * q;
err = nrm2(r); /* Check convergence */
/* ... */
}
This is a cut-down version of a real program that uses the non-preconditioned conjugate gradient method to solve the equation Ax = b (where x is the unknown vector to be found) in parallel.
Data Parallelism. Currently, our library only uses the
data-parallel programming model for achieving parallelism. This
means that the data is fully distributed (partitioned)
over all available processors and that computation is then
performed according to the "owner computes" rule: each
processor computes that part of the overall data that it owns
according to the manner in which the data has been
distributed. (Not all data-parallel programs use the owner
computes rule, but the principle that the data is fully
distributed is always the same.) In the data-parallel programming
model, it is therefore generally true that all processors will at
any one point in time be working on the same operation, such as
the vector update x = x + alpha * p (except that some
may be idle). The advantage of this is that it is a relatively
easy parallel programming model to reason about; the drawback is
that it can sometimes (often??) be sub-optimal.
Optimising Data-Parallel Programs. The performance of data-parallel programs is almost completely determined by any inter-processor communication that is required. It is possible to minimise the cost of such communications by carefully choosing the data placements used. Our existing library performs such data placement optimisation. This is based on
Task Parallelism. On more careful inspection, it is
clear that the operations x = x + alpha * p; and
r = r - alpha * q; in the above program are in fact
independent, and could therefore, provided we have a suitable
processor configuration, be executed in parallel with each
other, as well as each being parallel operations
themselves. In order to achieve this, we would have to place
x and r on disjoint sets of
processors.
Optimising Task-Parallel Programs. The issues involved in optimising task-parallel programs are wider than in the data-parallel case:
Extend our existing parallel linear algebra library to search for and exploit opportunities for using task parallelism.
Install and use our existing data-parallel library, understand how the delayed evaluation system works and how data-flow of parallel operations is stored internally.
Develop a model for the computation cost of operations; investigate whether this could be transparently determined at runtime, rather than having to be provided beforehand.
Search data-flow graphs for possible task parallelism, check for load balance (Optimising load balance is very complex and possibly not that interesting - what we want to do is make sure that we do not make anything worse by introducing task parallelism.)
Possibly extend the current model of the parallel platform and/or the current model of data placement in order to generate data placements that make task-parallel execution of independent operations possible.
Evaluate your work using various benchmarks.
This is a research-level project with enormous potential for extending the work. To do this project, you must not be scared to pick up and work with an existing software system, and you must have a reasonably good understanding of basic linear algebra in order to work with the existing data placement representation.
Olav Beckmann
Last modified: Tue Oct 4 15:13:24 BST 2005