Paul Kelly: PhD theses supervised

Andrew J Bennett

Parallel Graph Reduction for Shared-Memory Architectures

Imperial College, 1993

Frank S Taylor

Parallel Functional Programming by Partitioning

Imperial College, 1997


Caliban is a declarative language which addresses the area of static distributed memory parallel computing. It is an annotation language that allows the programmer to partition a functional program and data amongst the computational resources available. It is integrated into the source language so that the full power of the host language can be used to express the partitioning of the program. Partial evaluation is used to determine a complete version of the annotation at compile time. Program transformation is then used to make the parallelism explicit.

This thesis describes the Caliban language and its pilot implementation. It then continues by presenting extensions and improvements to the basic language. Implementation techniques for the improved language are discussed in relation to an implementation on the Fujitsu AP1000 distributed memory multiprocessor computer. Two application case studies together with some performance results are presented. Finally, there is a critical appraisal of the language and its approach.

Caliban has good support for general data and computation partitioning. It also aids software reuse with its ability to abstract common computational structures into higher order forms which are concretised at compile time by partial evaluation. However, there do remain some open issues relating to evaluation order control. Finally, Caliban can be implemented reasonably efficiently on standard parallel hardware.

Frank Taylor's thesis can be found at his home page,

Ariel N Burton

The use of replayable traces in the design and evaluation of operating systems

Imperial College, 1998


An important problem faced by operating system and file system designers and administrators, is evaluating or predicting the performance consequences of their decisions. To the designer, it is vital to know how different alternatives will affect real applications, whilst for the administrator it is important that a system is configured so that its resources are used optimally. However, when considering different options it is essential that any decision is based on knowledge of how the system will behave in the field under realistic conditions. Existing techniques and tools fail either to model realistic workloads, or are not easily repeatable or reproducible, making comparisons of alternatives difficult. Nor are they simple or convenient to use.

This thesis is an investigation into efficient tracing and rerunning of complete application programs. The main aim is to develop low-overhead techniques for capturing and rerunning complete traces of workloads' activities. The methodology presented distinguishes itself from other trace based methods in that the traces can be rerun and used to conduct repeatable, realistic, experiments to evaluate alternative operating system designs and configurations. Two forms of trace rerun are presented and evaluated using a number of real applications.

As a more speculative contribution, the tracing and reexecution technologies developed in the early part of the thesis are exploited and used for a very different, but novel purpose: a file system in which, when appropriate, files may be recreated from the sequence of operations by which they were created, instead of being read from backing store.

Paulo Sergio Soares de Almeida.

Control of Object Sharing in Programming Languages

Imperial College, 1998


Current data abstraction mechanisms are not adequate to control sharing of state in the general case involving objects in linked structures. They only prevent the direct access to the state variables of single objects, as opposed to considering the state reachable by an object and the inter-object references, neglecting the fact that an object is not, in general, self-contained. The pervading possibility of sharing is a source of errors and an obstacle both to reasoning about programs and to language implementation techniques.

This thesis presents balloon types, a general extension to programming languages which makes the ability to share state a first class property of a data type, resolving a long-standing flaw in existing data abstraction mechanisms. Balloon types provide the balloon invariant, which expresses a strong form of encapsulation of state: it is guaranteed that no state reachable (directly or transitively) by an object of a balloon type is referenced by any `external' object.

The mechanism is syntactically very simple, relying on a non-trivial static analysis to perform checking. The static analysis is presented as an abstract interpretation based on a denotational semantics of a simple imperative first-order language with constructs for creating and manipulating objects.

Balloon types are applicable in a wide range of areas such as program transformation, memory management and distributed systems. They are the key to obtaining self-contained composite objects, truly opaque data abstractions and value types---important concepts for the development of large scale, provably correct programs.

Sergio Almeida's papers etc can be found at his home page,

Sarah A M Bennett (nee Talbot)

Shared-Memory Multiprocessors With Stable Performance

Imperial College, 1999


The shared-memory programming model is attractive to programmers of parallel computers because they are not required to control the placement and communication of application data. Unfortunately, access to data is the root cause of performance problems on distributed shared-memory multiprocessors. Severe performance degradation can occur when there are many access requests competing for service at one or a few processing nodes. The request messages must join a queue for service, thereby increasing the delay between the original issue of a remote access request and the receipt of the data.

This thesis presents a new architectural mechanism for remote read requests which reduces excessive queueing. The reduction in queueing comes from using intermediate nodes, called proxies, to service read requests. By combining read requests for data at a proxy, only one request has to be forwarded to the data's home node.

The use of proxies introduces overheads, primarily due to the extra messages which can arise from sending read requests via a proxy, and the local storage space which a proxy uses to retain data to satisfy later requests from its clients. The overheads can cause performance degradation if proxies are used when they are not needed. To avoid such adverse effects, several variants of the proxy technique are investigated: invoking proxies automatically when excessive queueing occurs at run-time (reactive and adaptive proxies), not holding proxy data copies, and using separate storage buffers for proxy data copies.

The different implementations of the proxy technique are evaluated using execution-driven simulations of eight benchmark applications. The results show that substantial performance benefits can be obtained by using proxies for applications which have widely-shared data structures, and using adaptive proxies in conjunction with separate proxy buffers can improve the performance for all the benchmark applications.



Olav Beckmann

Interprocedural Optimisation of Regular Parallel Computations at Runtime

Imperial College, 2001


This thesis concerns techniques for efficient runtime optimisation of regular parallel programs that are built from separate software components.

High-quality, high-performance parallel software is frequently built from separately-written reusable software components such as functions from a library of parallel routines. Apart from the strong
case from the software engineering point-of-view for constructing software in such a way, there is often also a large performance benefit in hand-optimising individual, frequently used routines.

Hitherto, a problem with such libraries of separate software components has been that there is a performance penalty, both because of invocation and indirection overheads, and because opportunities
for cross-component optimisations are missed. The techniques we describe in this thesis aim to reverse this disadvantage by making use of high-level abstract information about the components for
performing cross-component optimisation. The key is to specify, generate and make use of metadata which characterise both data and software components, and to take advantage of run-time information.

We propose a delayed evaluation, self-optimising (DESO) library of data-parallel numerical routines. Delayed evaluation allows us to capture the control-flow of a user program from within the
library at runtime. When evaluation is eventually forced, we construct, at runtime, an optimised execution plan for the computation to be performed by the user program. The fact that our routines are
purely data-parallel means that the key optimisation we have to perform is to determine parallel data placements which minimise the overall execution time of the parallel program.

A key challenge in optimising at runtime is that optimisation itself must not be allowed to slow down the user program. We describe a range of techniques which permit us to fulfil this requirement:
we work from optimised aggregate components which are not re-optimised at runtime; we use carefully constructed mathematical metadata that facilitate efficient optimisation; we re-use previously
calculated optimisation results at runtime and we use an incremental optimisation strategy where we invest more optimisation time as optimisation contexts are encountered more frequently.

We propose specific algorithms for optimising affine alignment of arrays in data parallel programs, for detecting opportunities to re-use previous optimisation results, and for optimising replication in
data parallel programs.

We have implemented our techniques in a parallel version of the widely used Basic Linear Algebra Subroutines (BLAS) library, and we provide initial performance results obtained with this library.

Olav Beckmann's thesis is available from his home page,

Kwok Cheung Yeung

Dynamic performance optimisation of distributed Java applications

Imperial College, 2004


This thesis describes a novel approach to optimising the performance of distributed Java applications that make use of Remote Method Invocation (RMI) automatically while preserving the original application semantics.

The key enabling optimisation is call aggregation, in which we delay the execution of remote calls for as long as possible on the client side until a dependency forces their execution. The delayed calls are sent over to the server as a single unit, along with meta-data describing the remote calls. This reduces the number of network crossings, and the increased context information in conjunction with the meta-data enables the application of cross-call optimisations such as data sharing and dead-variable elimination. Other optimisations include server forwarding, which can reroute data communications to exploit fast connections, and plan caching, which is used to reduce run time overheads. Experimental evaluation suggests that the performance of applications under these optimisations can be comparable to that of implementing the aggregation and forwarding optimisations by hand.

The Veneer virtual Java Virtual Machine (vJVM) is presented as a flexible platform on which to base the RMI optimisations by providing a level of control over an executing program close to that of a customised interpreter while running on a standard JVM. Veneer can intercept selected methods of a program and delegate the process of execution to a user-defined executor, which is essentially a simple interpreter that may deviate from normal execution while executing the method if required.

There are many circumstances in which the optimisations might alter the semantics of an RMI program, and practical ways to detect and correct this are investigated. We develop a simple logical framework on which to reason about the optimisations, and we use it to show that call aggregation is safe provided that certain conditions are met.


Thiyagalingam Jeyarajan

Alternative Array Storage Layouts for Regular Scientific Programs

Imperial College, 2005


This thesis concerns techniques for using hierarchical storage formats such as Morton layout as an alternative storage layout for regular scientific programs operating over dense two-dimensional arrays.

Programming languages with support for two-dimensional arrays use one of two linear mappings from two-dimensional array indices to locations in the machine's one-dimensional address space: row-major or column-major. Unfortunately, a wrong choice of one of these two layouts can lead to very poor spatial locality if an array is not traversed in the order in which it is stored. Although a simple loop interchange transformation may resolve the situation, such a transformation may not always be valid due to actual dependencies or dependencies assumed by conservative compiler analyses.

An attractive strategy is to store elements in such a way that both row-major and column-major traversals offer good spatial locality. Hierarchically-blocked non-linear storage layouts, such as Morton ordering have been proposed as a compromise between row-major and column-major layouts. Morton layout offers some spatial locality whether traversed row-wise or column-wise. The contributions of the thesis are:

  1. An experimental exploration of performance issues of Morton layout using a suite of microbenchmark kernels on several hardware platforms.
  2. We show that the performance of the basic Morton scheme can be improved by aligning the base address of Morton arrays to the largest significant size in the memory hierarchy, namely page size.
  3. We show that unrolling the loops with strength reduction reduces address calculation overhead associated with the usage of Morton layouts and significantly improves performance of basic Morton scheme.
  4. We discuss the design issues for implementing a prototype compiler to support Morton layout in large scientific programs including required transformations.
The optimisations we propose here enable Morton layout to be a promising alternative to conventional array layouts. Further, we support our claims through experimental results using selected benchmark kernels on several hardware platforms.


David J Pearce

Some directed graph algorithms and their application to pointer analysis

Imperial College, 2005


This thesis is focused on improving execution time and precision of scalable pointer analysis. Such an analysis statically determines the targets of all pointer variables in a program. We formulate the analysis as a directed graph problem, where the solution can be obtained by a computation similar, in many ways, to transitive closure. As with transitive closure, identifying strongly connected components and transitive edges offers significant gains. However, our problem differs as the computation can result in new edges being added to the graph and, hence, dynamic algorithms are needed to efficiently identify these structures. Thus, pointer analysis has often been likened to the dynamic transitive closure problem.

Two new algorithms for dynamically maintaining the topological order of a directed graph are presented. The first is a unit change algorithm, meaning the solution must be recomputed immediately following an edge insertion. While this has a marginally inferior worse-case time bound, compared with a previous solution, it is far simpler to implement and has fewer restrictions. For these reasons, we find it to be faster in practice and provide an experimental study over random graphs to support this. Our second is a batch algorithm, meaning the solution can be updated after several insertions, and it is the first truly dynamic solution to obtain an optimal time bound of O(v+e+b) over a batch b of edge insertions. Again, we provide an experimental study over random graphs comparing this against the standard approach to topological sort. Furthermore, we demonstrate how both algorithms can be extended to the problem of dynamically detecting strongly connected components (i.e. cycles), thus achieving the first solutions which do not need to traverse the entire graph for half of all edge insertions.

Several other new techniques for improving pointer analysis are also presented. These include difference propagation, which avoids redundant work by tracking changes in the points-to sets, and a novel approach to field-sensitive analysis of C. Finally, a detailed study of numerous solving algorithms, evaluating our techniques and algorithms against previous work, is contained herein. Our benchmark suite consists of many common C programs ranging in size from 15,000-200,000 lines of code.

pdf available from David's home page

Jay Cornwall

A metadata-enhanced framework for high performance visual effects

Imperial College, 2010


This thesis is devoted to reducing the interactive latency of image processing computations in visual effects. Film and television graphic artists depend upon low-latency feedback to receive a visual response to changes in effect parameters. We tackle latency with a domain-specific op- timising compiler which leverages high-level program metadata to guide key computational and memory hierarchy optimisations. This metadata encodes static and dynamic information about data dependence and patterns of memory access in the algorithms constituting a visual effect – features that are typically difficult to extract through program analysis – and presents it to the compiler in an explicit form. By using domain-specific information as a substitute for program analysis, our compiler is able to target a set of complex source-level optimisations that a ven- dor compiler does not attempt, before passing the optimised source to the vendor compiler for lower-level optimisation.

Three key metadata-supported optimisations are presented. The first is an adaptation of space and schedule optimisation – based upon well-known compositions of the loop fusion and array contraction transformations – to the dynamic working sets and schedules of a runtime- parameterised visual effect. This adaptation sidesteps the costly solution of runtime code generation by specialising static parameters in an offline process and exploiting dynamic metadata to adapt the schedule and contracted working sets at runtime to user-tunable parameters. The second optimisation comprises a set of transformations to generate SIMD ISA-augmented source code. Our approach differs from autovectorisation by using static metadata to identify parallelism, in place of data dependence analysis, and runtime metadata to tune the data layout to user-tunable parameters for optimal aligned memory access. The third optimisation comprises a related set of transformations to generate code for SIMT architectures, such as GPUs. Static dependence metadata is exploited to guide large-scale parallelisation for tens of thousands of in-flight threads. Optimal use of the alignment-sensitive, explicitly managed memory hierarchy is achieved by iden- tifying inter-thread and intra-core data sharing opportunities in memory access metadata.

A detailed performance analysis of these optimisations is presented for two industrially developed visual effects. In our evaluation we demonstrate up to 8.1x speed-ups on Intel and AMD multicore CPUs and up to 6.6x speed-ups on NVIDIA GPUs over our best hand-written implementations of these two effects. Programmability is enhanced by automating the generation of SIMD and SIMT implementations from a single programmer-managed scalar representation.