This thesis describes the Caliban language and its pilot implementation. It then continues by presenting extensions and improvements to the basic language. Implementation techniques for the improved language are discussed in relation to an implementation on the Fujitsu AP1000 distributed memory multiprocessor computer. Two application case studies together with some performance results are presented. Finally, there is a critical appraisal of the language and its approach.
Caliban has good support for general data and computation partitioning. It also aids software reuse with its ability to abstract common computational structures into higher order forms which are concretised at compile time by partial evaluation. However, there do remain some open issues relating to evaluation order control. Finally, Caliban can be implemented reasonably efficiently on standard parallel hardware.
Frank Taylor's thesis can be found at his home page, http://www.lieder.demon.co.uk/frank.html
This thesis is an investigation into efficient tracing and rerunning of complete application programs. The main aim is to develop low-overhead techniques for capturing and rerunning complete traces of workloads' activities. The methodology presented distinguishes itself from other trace based methods in that the traces can be rerun and used to conduct repeatable, realistic, experiments to evaluate alternative operating system designs and configurations. Two forms of trace rerun are presented and evaluated using a number of real applications.
As a more speculative contribution, the tracing and reexecution technologies developed in the early part of the thesis are exploited and used for a very different, but novel purpose: a file system in which, when appropriate, files may be recreated from the sequence of operations by which they were created, instead of being read from backing store.
This thesis presents balloon types, a general extension to programming languages which makes the ability to share state a first class property of a data type, resolving a long-standing flaw in existing data abstraction mechanisms. Balloon types provide the balloon invariant, which expresses a strong form of encapsulation of state: it is guaranteed that no state reachable (directly or transitively) by an object of a balloon type is referenced by any `external' object.
The mechanism is syntactically very simple, relying on a non-trivial static analysis to perform checking. The static analysis is presented as an abstract interpretation based on a denotational semantics of a simple imperative first-order language with constructs for creating and manipulating objects.
Balloon types are applicable in a wide range of areas such as program transformation, memory management and distributed systems. They are the key to obtaining self-contained composite objects, truly opaque data abstractions and value types---important concepts for the development of large scale, provably correct programs.
Sergio Almeida's papers etc can be found at his home page, http://shiva.di.uminho.pt/~psa/
This thesis presents a new architectural mechanism for remote read requests which reduces excessive queueing. The reduction in queueing comes from using intermediate nodes, called proxies, to service read requests. By combining read requests for data at a proxy, only one request has to be forwarded to the data's home node.
The use of proxies introduces overheads, primarily due to the extra messages which can arise from sending read requests via a proxy, and the local storage space which a proxy uses to retain data to satisfy later requests from its clients. The overheads can cause performance degradation if proxies are used when they are not needed. To avoid such adverse effects, several variants of the proxy technique are investigated: invoking proxies automatically when excessive queueing occurs at run-time (reactive and adaptive proxies), not holding proxy data copies, and using separate storage buffers for proxy data copies.
The different implementations of the proxy technique are evaluated using execution-driven simulations of eight benchmark applications. The results show that substantial performance benefits can be obtained by using proxies for applications which have widely-shared data structures, and using adaptive proxies in conjunction with separate proxy buffers can improve the performance for all the benchmark applications.
Abstract:
This thesis concerns techniques for efficient runtime optimisation of regular parallel programs that are built from separate software components.
High-quality, high-performance parallel software is frequently built
from separately-written reusable software components such as functions
from a library of parallel routines. Apart from the strong
case from the software engineering point-of-view for constructing software
in such a way, there is often also a large performance benefit in hand-optimising
individual, frequently used routines.
Hitherto, a problem with such libraries of separate software components
has been that there is a performance penalty, both because of invocation
and indirection overheads, and because opportunities
for cross-component optimisations are missed. The techniques we describe
in this thesis aim to reverse this disadvantage by making use of high-level
abstract information about the components for
performing cross-component optimisation. The key is to specify, generate
and make use of metadata which characterise both data and software components,
and to take advantage of run-time information.
We propose a delayed evaluation, self-optimising (DESO) library of data-parallel
numerical routines. Delayed evaluation allows us to capture the control-flow
of a user program from within the
library at runtime. When evaluation is eventually forced, we construct,
at runtime, an optimised execution plan for the computation to be performed
by the user program. The fact that our routines are
purely data-parallel means that the key optimisation we have to perform
is to determine parallel data placements which minimise the overall execution
time of the parallel program.
A key challenge in optimising at runtime is that optimisation itself
must not be allowed to slow down the user program. We describe a range
of techniques which permit us to fulfil this requirement:
we work from optimised aggregate components which are not re-optimised
at runtime; we use carefully constructed mathematical metadata that facilitate
efficient optimisation; we re-use previously
calculated optimisation results at runtime and we use an incremental
optimisation strategy where we invest more optimisation time as optimisation
contexts are encountered more frequently.
We propose specific algorithms for optimising affine alignment of arrays
in data parallel programs, for detecting opportunities to re-use previous
optimisation results, and for optimising replication in
data parallel programs.
We have implemented our techniques in a parallel version of the widely used Basic Linear Algebra Subroutines (BLAS) library, and we provide initial performance results obtained with this library.
Olav Beckmann's thesis is available from his home page, http://www.doc.ic.ac.uk/~ob3
Abstract:
This thesis describes a novel approach to optimising the performance of distributed Java applications that make use of Remote Method Invocation (RMI) automatically while preserving the original application semantics.
The key enabling optimisation is call aggregation, in which we delay the execution of remote calls for as long as possible on the client side until a dependency forces their execution. The delayed calls are sent over to the server as a single unit, along with meta-data describing the remote calls. This reduces the number of network crossings, and the increased context information in conjunction with the meta-data enables the application of cross-call optimisations such as data sharing and dead-variable elimination. Other optimisations include server forwarding, which can reroute data communications to exploit fast connections, and plan caching, which is used to reduce run time overheads. Experimental evaluation suggests that the performance of applications under these optimisations can be comparable to that of implementing the aggregation and forwarding optimisations by hand.
The Veneer virtual Java Virtual Machine (vJVM) is presented as a flexible platform on which to base the RMI optimisations by providing a level of control over an executing program close to that of a customised interpreter while running on a standard JVM. Veneer can intercept selected methods of a program and delegate the process of execution to a user-defined executor, which is essentially a simple interpreter that may deviate from normal execution while executing the method if required.
There are many circumstances in which the optimisations might alter the semantics of an RMI program, and practical ways to detect and correct this are investigated. We develop a simple logical framework on which to reason about the optimisations, and we use it to show that call aggregation is safe provided that certain conditions are met.
Abstract:
This thesis concerns techniques for using hierarchical storage formats such as Morton layout as an alternative storage layout for regular scientific programs operating over dense two-dimensional arrays.
Programming languages with support for two-dimensional arrays use one of two linear mappings from two-dimensional array indices to locations in the machine's one-dimensional address space: row-major or column-major. Unfortunately, a wrong choice of one of these two layouts can lead to very poor spatial locality if an array is not traversed in the order in which it is stored. Although a simple loop interchange transformation may resolve the situation, such a transformation may not always be valid due to actual dependencies or dependencies assumed by conservative compiler analyses.
An attractive strategy is to store elements in such a way that both row-major and column-major traversals offer good spatial locality. Hierarchically-blocked non-linear storage layouts, such as Morton ordering have been proposed as a compromise between row-major and column-major layouts. Morton layout offers some spatial locality whether traversed row-wise or column-wise. The contributions of the thesis are:
Abstract:
This thesis is focused on improving execution time and precision of scalable pointer analysis. Such an analysis statically determines the targets of all pointer variables in a program. We formulate the analysis as a directed graph problem, where the solution can be obtained by a computation similar, in many ways, to transitive closure. As with transitive closure, identifying strongly connected components and transitive edges offers significant gains. However, our problem differs as the computation can result in new edges being added to the graph and, hence, dynamic algorithms are needed to efficiently identify these structures. Thus, pointer analysis has often been likened to the dynamic transitive closure problem.
Two new algorithms for dynamically maintaining the topological order of a directed graph are presented. The first is a unit change algorithm, meaning the solution must be recomputed immediately following an edge insertion. While this has a marginally inferior worse-case time bound, compared with a previous solution, it is far simpler to implement and has fewer restrictions. For these reasons, we find it to be faster in practice and provide an experimental study over random graphs to support this. Our second is a batch algorithm, meaning the solution can be updated after several insertions, and it is the first truly dynamic solution to obtain an optimal time bound of O(v+e+b) over a batch b of edge insertions. Again, we provide an experimental study over random graphs comparing this against the standard approach to topological sort. Furthermore, we demonstrate how both algorithms can be extended to the problem of dynamically detecting strongly connected components (i.e. cycles), thus achieving the first solutions which do not need to traverse the entire graph for half of all edge insertions.
Several other new techniques for improving pointer analysis are also presented. These include difference propagation, which avoids redundant work by tracking changes in the points-to sets, and a novel approach to field-sensitive analysis of C. Finally, a detailed study of numerous solving algorithms, evaluating our techniques and algorithms against previous work, is contained herein. Our benchmark suite consists of many common C programs ranging in size from 15,000-200,000 lines of code.
pdf available from David's home page
Abstract:
This thesis is devoted to reducing the interactive latency of image processing computations in visual effects. Film and television graphic artists depend upon low-latency feedback to receive a visual response to changes in effect parameters. We tackle latency with a domain-specific op- timising compiler which leverages high-level program metadata to guide key computational and memory hierarchy optimisations. This metadata encodes static and dynamic information about data dependence and patterns of memory access in the algorithms constituting a visual effect – features that are typically difficult to extract through program analysis – and presents it to the compiler in an explicit form. By using domain-specific information as a substitute for program analysis, our compiler is able to target a set of complex source-level optimisations that a ven- dor compiler does not attempt, before passing the optimised source to the vendor compiler for lower-level optimisation.
Three key metadata-supported optimisations are presented. The first is an adaptation of space and schedule optimisation – based upon well-known compositions of the loop fusion and array contraction transformations – to the dynamic working sets and schedules of a runtime- parameterised visual effect. This adaptation sidesteps the costly solution of runtime code generation by specialising static parameters in an offline process and exploiting dynamic metadata to adapt the schedule and contracted working sets at runtime to user-tunable parameters. The second optimisation comprises a set of transformations to generate SIMD ISA-augmented source code. Our approach differs from autovectorisation by using static metadata to identify parallelism, in place of data dependence analysis, and runtime metadata to tune the data layout to user-tunable parameters for optimal aligned memory access. The third optimisation comprises a related set of transformations to generate code for SIMT architectures, such as GPUs. Static dependence metadata is exploited to guide large-scale parallelisation for tens of thousands of in-flight threads. Optimal use of the alignment-sensitive, explicitly managed memory hierarchy is achieved by iden- tifying inter-thread and intra-core data sharing opportunities in memory access metadata.
A detailed performance analysis of these optimisations is presented for two industrially developed visual effects. In our evaluation we demonstrate up to 8.1x speed-ups on Intel and AMD multicore CPUs and up to 6.6x speed-ups on NVIDIA GPUs over our best hand-written implementations of these two effects. Programmability is enhanced by automating the generation of SIMD and SIMT implementations from a single programmer-managed scalar representation.