Compiler for Skeleton-based Parallel Programming

The goal of this project is to write a compiler which automatically re-tunes a parallel program when it is ported to a new parallel machine.

"Skeletons" are what make this possible. Skeletons encourage structured parallel programming, in which a parallel program is built from ready-made components for which efficient implementations are known to exist. These components may be ordinary parallel operations, or, more interestingly, they may be "skeletons" which can be filled in with user-code. The main issue in building a compiler is to use analytical performance models of the different skeletons used in a program in order to find the most efficient allocation of resources.

Background: A fundamental problem in parallel programming is that the way one part of the program should be parallelised can depend on the parallelism in the rest of the program.

Example: Matrix multiply: The way an input array is distributed may determine how a parallel matrix multiply should be implemented, and may influence the distribution, in turn, of the result.
Example: Poisson solver: suppose one part of a program finds the heat flow in a 2-D region, and a second part tests whether all points in the region satisfy some convergence test. We could allocate a square array of processors to solve for the heat flow, and a tree of processors to compute the convergence test. In reality, however, this may not be the most efficient allocation of resources. It's likely that it would be more efficient to allocate all the processors to solving for the heat flow, and to use the same processors to do the (relatively fast) convergence testing.

The idea of skeletons is to get the programmer to write the program so that as much of the application's parallelism as possible is exposed in the form of calls to standard parallel operations/skeletons. Then the compiler's job is to decide how each parallel operation should be implemented (in parallel, sequentially, using which parallel implementation "template"), and, if in parallel, how data structures should be distributed and how processors should be allocated to the different parallel activities.

To do this, the compiler must have an estimate of the execution time for each user-provided code fragment (written in C for example). It should also have a performance model of how each skeleton's implementation template performs on the given target architecture.

The job:

Design/select an input language. Although we should discuss it, I think the "SCL" language proposed by John Darlington and co (see "Reading" below) would be more-or-less suitable, perhaps in simplified form. This is a data-parallel functional language which allows fragments of C to be included (actually I think I prefer an imperative language with expressions which can use the full power of a functional language).
Write/find a small set of example programs to use to evaluate your work. Identify the minimum set of SCL operations needed to run them.
For each of your SCL operations, design a parallel implementation scheme (called a "template") and write down a simple mathematical model of its expected performance. For some skeletons, it may make sense to have several different templates for different situations, but don't worry about that yet.
Using the MPI (Message Passing Interface) standard for portable parallel programming, built quick and simple implementations of your templates. Do some performance analysis to get parameters for your performance models.
You should now be able to translate your example programs into implementations by hand, so you can get some results.
Write an automatic translator to translate SCL into a C program which calls templates and user code fragments appropriately (note that you need to add monitoring code to gather execution times for C fragments).
Add a resource allocation phase to your compiler. It should build a representation of the combined performance of the complete system, and should then use simple heuristics to make resource allocations which minimise the expected execution time.
Evaluate your work by studying the performance of your system on your chosen benchmarks.

Reading:

{Fortran} {90D/HPF} Compiler for Distributed-Memory {MIMD} Computers: design, implementation and performance results by Zeki Bozkus, Alok Choudhary, Geoffrey Fox, Tomasz Haupt, Sanjay Ranka". In Proceedings, Supercomputing'93.
A methodology For the Development and the Support of Massively Parallel Programs M Danelutto, R {Di Meglio}, S Orlando, S Pelagatti and M Vanneschi. Future Generation Computer Systems Vol.8, pp205--220 (Aug 1992).
Parallel Implementation of FP using a Template-based Approach Marco Danelutto and Susanna Pelagatti, University of Pisa.
Parallel Skeletons for Structured Composition John Darlington, Yike Guo, Wing To and Jin Yang. In Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), July 1995.

Equipment: Unix workstation, Fujitsu AP1000.

Recommended tools: Unix, C or C++, MPI (Message Passing Interface).

Suitability: This is a challenging research-level project. The basic prerequisites are 1) basic compiler-writing skills, 2) insight into performance, architecture and applications issues, 3) some aptitude for manipulating analytical performance models (we're not talking about complicated maths here).