Task Parallelism in Scientific Linear Algebra Codes

-----------------------

Motivation

We have a library which allows application programmers to execute automatically and more or less transparently certain computationally intensive parts of scientific programs in parallel. This library comes with both C and C++ APIs, for example:

/* 
 * A is an N-by-N matrix
 * r, z, p, q, x are N-element vectors
 * rho, rho_o, beta, alpha and err are scalars 
 */
for ( i = 1; i <= max_iter; i++ ) {
  /* ... */
  rho = r * z;
  if (i == 1) { 
    p = z;
  } else { 
    z = z + (rho / rho_o) * p;
    p = z;
  }

  q = A * p;
  alpha = rho / (p * q);
  x = x + alpha * p;
  r = r - alpha * q;

  err = nrm2(r); /* Check convergence */ 
  /* ... */ 
}

This is a cut-down version of a real program that uses the non-preconditioned conjugate gradient method to solve the equation Ax = b (where x is the unknown vector to be found) in parallel.

Data Parallelism. Currently, our library only uses the data-parallel programming model for achieving parallelism. This means that the data is fully distributed (partitioned) over all available processors and that computation is then performed according to the "owner computes" rule: each processor computes that part of the overall data that it owns according to the manner in which the data has been distributed. (Not all data-parallel programs use the owner computes rule, but the principle that the data is fully distributed is always the same.) In the data-parallel programming model, it is therefore generally true that all processors will at any one point in time be working on the same operation, such as the vector update x = x + alpha * p (except that some may be idle). The advantage of this is that it is a relatively easy parallel programming model to reason about; the drawback is that it can sometimes (often??) be sub-optimal.

Optimising Data-Parallel Programs. The performance of data-parallel programs is almost completely determined by any inter-processor communication that is required. It is possible to minimise the cost of such communications by carefully choosing the data placements used. Our existing library performs such data placement optimisation. This is based on

  1. Using delayed evaluation to capture the data-flow graph of library operations before they are executed
  2. A mathematical model of data placement
  3. A mathematical model of the placement requirements of data-parallel operations
  4. A very simple cost model for any communications that are required
  5. A simple search algorithm that tries to find such data placements that minimise the overall cost of communication.

Task Parallelism. On more careful inspection, it is clear that the operations x = x + alpha * p; and r = r - alpha * q; in the above program are in fact independent, and could therefore, provided we have a suitable processor configuration, be executed in parallel with each other, as well as each being parallel operations themselves. In order to achieve this, we would have to place x and r on disjoint sets of processors.

Optimising Task-Parallel Programs. The issues involved in optimising task-parallel programs are wider than in the data-parallel case:

  1. We need to search the data-flow graph of library operations for independent sections.
  2. We need a model not only of the cost of communication, but also of the cost of the actual operations in order to be able to balance the amount of time each independent computation takes.
  3. We may have to extend our current model of data placement in order to place the data involved in independent computations on distinct processor groups.

The Task

Extend our existing parallel linear algebra library to search for and exploit opportunities for using task parallelism.

This is a research-level project with enormous potential for extending the work. To do this project, you must not be scared to pick up and work with an existing software system, and you must have a reasonably good understanding of basic linear algebra in order to work with the existing data placement representation.

--------------------

Olav Beckmann
Last modified: Tue Oct 4 15:13:24 BST 2005