Automatic Loop Fusion in Scientific Linear Algebra Codes

-------------------------

Motivation

We have a library which allows application programmers to execute automatically and more or less transparently certain computationally intensive parts of scientific programs in parallel. This library comes with both C and C++ APIs, for example:

/* 
 * A is an N-by-N matrix
 * r, z, p, q, x are N-element vectors
 * rho, rho_o, beta, alpha and err are scalars 
 */
for ( i = 1; i <= max_iter; i++ ) {
  /* ... */
  rho = r * z;
  if (i == 1) { 
    p = z;
  } else { 
    z = z + (rho / rho_o) * p;
    p = z;
  }

  q = A * p;
  alpha = rho / (p * q);
  x = x + alpha * p;
  r = r - alpha * q;

  err = nrm2(r); /* Check convergence */ 
  /* ... */ 
}

This is a cut-down version of a real program that uses the non-preconditioned conjugate gradient method to solve the equation Ax = b (where x is the unknown vector to be found) in parallel.

Data Parallelism. Our library uses the data-parallel programming model for achieving parallelism. This means that the data is fully distributed (partitioned) over all available processors and that computation is then performed according to the "owner computes" rule: each processor computes that part of the overall data that it owns according to the manner in which the data has been distributed. (Not all data-parallel programs use the owner computes rule, but the principle that the data is fully distributed is always the same.)

Delayed Evaluation. The performance of data-parallel programs is largely determined by inter-processor communication. It is possible to minimise the cost of such communications by carefully choosing the data placements used. Our existing library performs such data placement optimisation. This is based on

  1. Using delayed evaluation to capture the data-flow graph of library operations before they are executed
  2. A mathematical model of data placement
  3. A mathematical model of the placement requirements of data-parallel operations
  4. A very simple cost model for any communications that are required
  5. A simple search algorithm that tries to find such data placements that minimise the overall cost of communication.

Loop Fusion. On careful inspection, it is clear that the statements z = z + (rho / rho_o) * p; and p = z; in the above program could be fused. It may not even be necessary to write out both z and p. In this particular application program this fusion would probably only result in a very marginal performance improvement; however, in other benchmarks the benefit should be much higher.

The TaskGraph Library is the result of a prize-winning individual project in this Department. It is a tool for runtime dependence analysis and code optimisation that uses its own small C++-like language to represent the computation to be performed. The data structure used by this library to represent this computation - by storing an abstract syntax tree (AST) for the code to be executed - is called a TaskGraph. The TaskGraph library is capable of performing loop fusion within one TaskGraph, i.e. if an AST we have constructed contains two loops, the library's code optimiser will check whether the loops are fusible and perform the fusion if valid.

The Task

Extend our existing parallel linear algebra library to perform runtime loop fusion.

This is a research-level project. To do this project, you must not be scared to pick up and work with an existing software system, and you must have a reasonably good understanding of basic linear algebra in order to work with the existing data placement representation.

-----------------------------

Olav Beckmann
Last modified: Tue Oct 4 15:11:37 BST 2005