Compiler for Skeleton-based Parallel Programming
The goal of this project is to write a compiler which
automatically re-tunes a parallel program when it is ported
to a new parallel machine.
"Skeletons" are what make this possible. Skeletons encourage structured parallel
programming, in which a parallel program is built from
ready-made components for which efficient implementations
are known to exist. These components may be ordinary
parallel operations, or, more interestingly, they may
be "skeletons" which can be filled in with user-code.
The main issue in building a compiler is to use analytical
performance models of the different skeletons used in a
program in order to find the most efficient allocation
of resources.
Background:
A fundamental problem in parallel programming is that the way one part
of the program should be parallelised can depend on the parallelism in
the rest of the program.
- Example: Matrix multiply:
The way an input array is
distributed may determine how a parallel matrix multiply should be
implemented, and may influence the distribution, in turn, of the
result.
- Example: Poisson solver: suppose one part of a
program finds the heat flow in a 2-D region, and a second part tests
whether all points in the region satisfy some convergence test. We
could allocate a square array of processors to solve for the heat
flow, and a tree of processors to compute the convergence test. In
reality, however, this may not be the most efficient allocation of
resources. It's likely that it would be more efficient to allocate
all the processors to solving for the heat flow, and to use the same
processors to do the (relatively fast) convergence testing.
The idea of skeletons is to get the programmer to write the program so
that as much of the application's parallelism as possible is exposed
in the form of calls to standard parallel operations/skeletons. Then
the compiler's job is to decide how each parallel operation should be
implemented (in parallel, sequentially, using which parallel
implementation "template"), and, if in parallel, how data structures
should be distributed and how processors should be allocated to the
different parallel activities.
To do this, the compiler must have an estimate of the execution time
for each user-provided code fragment (written in C for example). It
should also have a performance model of how each skeleton's
implementation template performs on the given target architecture.
The job:
- Design/select an input language. Although we should discuss it,
I think the "SCL" language proposed by John Darlington and co
(see "Reading" below) would be more-or-less suitable, perhaps in simplified
form. This is a data-parallel functional language which allows
fragments of C to be included (actually I think I prefer an
imperative language with expressions which can use the full power of
a functional language).
- Write/find a small set of example programs to use to evaluate
your work. Identify the minimum set of SCL operations needed to
run them.
- For each of your SCL operations, design a parallel implementation
scheme (called a "template") and write down a simple mathematical model of its expected
performance. For some skeletons, it may make sense to have
several different templates for different situations, but don't worry
about that yet.
- Using the MPI (Message Passing Interface) standard for portable
parallel programming, built quick and simple implementations of
your templates. Do some performance analysis to get parameters
for your performance models.
- You should now be able to translate your example programs into
implementations by hand, so you can get some results.
- Write an automatic translator to translate SCL into a C program
which calls templates and user code fragments appropriately
(note that you need to add monitoring code to gather execution
times for C fragments).
- Add a resource allocation phase to your compiler. It should
build a representation of the combined performance of the complete
system, and should then use simple heuristics to make resource
allocations which minimise the expected execution time.
- Evaluate your work by studying the performance of your system on
your chosen benchmarks.
Reading:
-
{Fortran} {90D/HPF} Compiler for Distributed-Memory {MIMD} Computers:
design, implementation and performance results
by Zeki Bozkus, Alok Choudhary, Geoffrey Fox, Tomasz
Haupt, Sanjay Ranka". In Proceedings, Supercomputing'93.
-
A methodology For the Development and the Support of Massively
Parallel Programs
M Danelutto, R {Di Meglio}, S Orlando, S Pelagatti and
M Vanneschi.
Future Generation Computer Systems Vol.8, pp205--220 (Aug
1992).
-
Parallel Implementation of FP using a Template-based
Approach
Marco Danelutto and Susanna Pelagatti, University of Pisa.
- Parallel Skeletons for Structured
Composition
John Darlington, Yike Guo, Wing To and Jin Yang.
In Fifth ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPOPP), July 1995.
Equipment: Unix workstation, Fujitsu AP1000.
Recommended tools:
Unix, C or C++, MPI (Message Passing Interface).
Suitability: This is a challenging research-level
project. The basic prerequisites are 1) basic compiler-writing skills, 2) insight into performance,
architecture and applications issues, 3) some aptitude for manipulating analytical performance
models (we're not talking about complicated maths here).