Paul Kelly: Selected Publications

(see end of page for slides from various talks, including my inaugural lecture (May 13 2009)

Bartosz D. Wozniak, Freddie D. Witherden, Francis P. Russell, Peter E. Vincent, Paul H. J. Kelly: GiMMiK - Generating bespoke matrix multiplication kernels for accelerators: Application to high-order Computational Fluid Dynamics. Computer Physics Communications 202: 12-22 (2016). here.
István Z. Reguly, Gihan R. Mudalige, Carlo Bertolli, Michael B. Giles, Adam Betts, Paul H. J. Kelly, David Radford: Acceleration of a Full-Scale Industrial CFD Application with OP2. IEEE Trans. Parallel Distrib. Syst. 27(5): 1265-1278 (2016). here.
Fabio Luporini, David A. Ham, Paul H. J. Kelly: An algorithm for the optimization of finite element integration loops. CoRR abs/1604.05872 (2016). Under review (preprint here).
Gheorghe-Teodor Bercea, Andrew T. T. McRae, David A. Ham, Lawrence Mitchell, Florian Rathgeber, Luigi Nardi, Fabio Luporini, Paul H. J. Kelly: A numbering algorithm for finite element on extruded meshes which avoids the unstructured mesh penalty. CoRR abs/1604.05937 (2016). Under review (preprint here).
Georgios Rokos, Gerard Gorman, Paul H. J. Kelly: A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures. Euro-Par 2015: 414-425. Available here.
Luigi Nardi, Bruno Bodin, M. Zeeshan Zia, John Mawer, Andy Nisbet, Paul H. J. Kelly, Andrew J. Davison, Mikel Luján, Michael F. P. O'Boyle, Graham D. Riley, Nigel Topham, Stephen B. Furber: Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM. ICRA 2015: 5783-5790. Available here.
Doru-Thom Popovici, Francis P. Russell, Karl A. Wilkinson, Chris-Kriton Skylaris, Paul H. J. Kelly, Franz Franchetti: Generating Optimized Fourier Interpolation Routines for Density Functional Theory Using SPIRAL. IPDPS 2015: 743-752. Available here.
Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. McRae, Gheorghe-Teodor Bercea, Graham R. Markall, Paul H. J. Kelly: Firedrake: automating the finite element method by composing abstractions. CoRR abs/1501.01809 (2015). Under review, preprint available here.
Luporini F, Varbanescu AL, Rathgeber F, Bercea G-T, Ramanujam J, Ham DA, Kelly PHJ et al., 2014, Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly, Acm Transactions on Architecture and Code Optimization, Vol: 11, ISSN: 1544-3566 Publisher link.
We study and systematically evaluate a class of composable code transformations that improve arithmetic intensity in local assembly operations, which represent a significant fraction of the execution time in finite element methods. Their performance optimization is indeed a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions, which vary among different problems, make it hard to determine an optimal sequence of successful transformations. Our investigation has resulted in the implementation of a compiler (called COFFEE) for local assembly kernels, fully integrated with a framework for developing finite element methods. The compiler manipulates abstract syntax trees generated from a domain-specific language by introducing domain-aware optimizations for instruction-level parallelism and register locality. Eventually, it produces C code including vector SIMD intrinsics. Experiments using a range of real-world finite element problems of increasing complexity show that significant performance improvement is achieved. The generality of the approach and the applicability of the proposed code transformations to other domains is also discussed.
Francis P. Russell, Karl Wilkinson, Paul H. J. Kelly, Chris-Kriton Skylaris, Optimised three-dimensional Fourier interpolation: An analysis of techniques and application to a linear-scaling density functional theory code Computer Physics Communications 01/2014 (here.
The Fourier interpolation of 3D data-sets is a performance critical operation in many fields, including certain forms of image processing and density functional theory (DFT) quantum chemistry codes based on plane wave basis sets, to which this paper is targeted. In this paper we describe three different algorithms for performing this operation built from standard discrete Fourier transform operations, and derive theoretical operation counts. The algorithms compared consist of the most straightforward implementation and two that exploit techniques such as phase-shifts and knowledge of zero padding to reduce computational cost. Through a library implementation (tintl) we explore the performance characteristics of these algorithms and the performance impact of different implementation choices on actual hardware. We present comparisons within the linear-scaling DFT code ONETEP where we replace the existing interpolation implementation with our library implementation configured to choose the most efficient algorithm. Within the ONETEP Fourier interpolation stages, we demonstrate speed-ups of over 1.55×.
Dense Planar SLAM. Renato F. Salas-Moreno, Ben Glocker, Paul H. J. Kelly, Andrew J. Davison. To appear in International Symposium on Mixed and Augmented Reality (ISMAR'14), Sept 2014.
Generalizing Run-time Tiling with the Loop Chain Abstraction, Michelle Mills Strout, Fabio Luporini, Christopher D. Krieger, Carlo Bertolli, Gheorghe-Teodor Bercea, Catherine Olschanowsky, J. Ramanujam, and Paul H.J. Kelly, To appear in 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 19-23, 2014.
Collingbourne P, Cadar C, Kelly PHJ, 2014, Symbolic Crosschecking of Data-Parallel Floating-Point Code, IEEE Transactions on Software Engineering, Vol: 40, Pages: 710-737, ISSN: 0098-5589. Publisher edition.
We present a symbolic execution-based technique for cross-checking programs accelerated using SIMD or OpenCL against an unaccelerated version, as well as a technique for detecting data races in OpenCL programs. Our techniques are implemented in KLEE-CL, a tool based on the symbolic execution engine KLEE that supports symbolic reasoning on the equivalence between expressions involving both integer and floating-point operations. While the current generation of constraint solvers provide effective support for integer arithmetic, the situation is different for floating-point arithmetic, due to the complexity inherent in such computations. The key insight behind our approach is that floating-point values are only reliably equal if they are essentially built by the same operations. This allows us to use an algorithm based on symbolic expression matching augmented with canonicalisation rules to determine path equivalence. Under symbolic execution, we have to verify equivalence along every feasible control-flow path. We reduce the branching factor of this process by aggressively merging conditionals, if-converting branches into select operations via an aggressive phi-node folding transformation. To support the Intel Streaming SIMD Extension (SSE) instruction set, we lower SSE instructions to equivalent generic vector operations, which in turn are interpreted in terms of primitive integer and floating-point operations. To support OpenCL programs, we symbolically model the OpenCL environment using an OpenCL runtime library targeted to symbolic execution. We detect data races by keeping track of all memory accesses using a memory log, and reporting a race whenever we detect that two accesses conflict. By representing the memory log symbolically, we are also able to detect races associated with symbolically-indexed accesses of memory objects. We used KLEE-CL to prove the bounded equivalence between scalar and data-parallel versions of floating-point programs and find a number - f issues in a variety of open source projects that use SSE and OpenCL, including mismatches between implementations, memory errors, race conditions and a compiler bug.
Nathan Chong, Alastair F. Donaldson, Paul H.J. Kelly, Jeroen Ketema and Shaz Qadeer, Barrier Invariants: a Shared State Abstraction for the Analysis of Data-Dependent GPU Kernels, ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, 2013 pdf (see Nathan Chong's pages for artifact).
Data-dependent GPU kernels, whose data or control flow are dependent on the input of the program, are difficult to verify because they require reasoning about shared state manipulated by many parallel threads. Existing verification techniques for GPU kernels achieve soundness and scalability by using a two-thread reduction and making the contents of the shared state nondeterministic each time threads synchronise at a barrier, to account for all possible thread interactions. This coarse abstraction prohibits verification of datadependent kernels. We present barrier invariants, a novel abstraction technique which allows key properties about the shared state of a kernel to be preserved across barriers during formal reasoning. We have integrated barrier invariants with the GPUVerify tool, and present a detailed case study showing how they can be used to verify three prefix sum algorithms, allowing efficient modular verification of a stream compaction kernel, a key building block for GPU programming. This analysis goes significantly beyond what is possible using existing verification techniques for GPU kernels.
Mudalige GR, Giles MB, Thiyagalingam J, et al., 2013, Design and Initial Performance of a High-level Unstructured Mesh Framework on Heterogeneous Parallel Systems, Parallel Computing, Vol:n/a, ISSN:0167-8191 From the publisher
OP2 is a high-level domain specific library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into multiple parallel implementations for execution on a range of back-end hardware platforms. In this paper we present the design and performance of OP2’s recent developments facilitating code generation and execution on distributed memory heterogeneous systems. OP2 targets the solution of numerical problems based on static unstructured meshes. We discuss the main design issues in parallelizing this class of applications. These include handling data dependencies in accessing indirectly referenced data and design considerations in generating code for execution on a cluster of multi-threaded CPUs and GPUs. Two representative CFD applications, written using the OP2 framework, are utilized to provide a contrasting benchmarking and performance analysis study on a number of heterogeneous systems including a large scale Cray XE6 system and a large GPU cluster. A range of performance metrics are benchmarked including runtime, scalability, achieved compute and bandwidth performance, runtime bottlenecks and systems energy consumption. We demonstrate that an application written once at a high-level using OP2 is easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.
Athanasios Konstantinidis, Paul H.J. Kelly, J. Ramanujam and P. Sadayappan, Parametric GPU Code Generation for Affine Loop Programs Languages and Compilers for Parallel Computing, 26th International Workshop, LCPC 2013, San Jose, California pdf.
Partitioning a parallel computation into finitely sized chunks for effective mapping onto a parallel machine is a critical concern for source-to-source compilation. In the context of OpenCL and CUDA, this translates to the definition of a uniform hyper-rectangular partitioning of the parallel execution space where each partition is subject to a fine-grained distribution of resources that has a direct yet hard to estimate impact on performance. This paper develops the first compilation scheme for generating parametrically tiled codes for affine loop programs on GPUs which facilitates run-time exploration of partitioning parameters as a fast and portable way of finding the ones that yield maximum performance. Our approach is based on a parametric tiling scheme for producing wavefronts of parallel rectangular partitions of parametric size and a novel runtime system that manages wavefront execution and local memory usage dynamically through an inspector-executor mechanism. Our experimental evaluation demonstrates the effectiveness of our approach for wavefront as well as rectangularly-parallel partitionings.
Renato F. Salas-Moreno, Richard A. Newcombe, Hauke Strasdat, Paul H. J. Kelly and Andrew J. Davison, SLAM++: Simultaneous Localisation and Mapping at the Level of Objects, in Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, June 2013. pdf.
We present the major advantages of a new ‘object oriented’ 3D SLAM paradigm, which takes full advantage in the loop of prior knowledge that many scenes consist of repeated, domain-speciﬁc objects and structures. As a hand-held depth camera browses a cluttered scene, realtime 3D object recognition and tracking provides 6DoF camera-object constraints which feed into an explicit graph of objects, continually reﬁned by efﬁcient pose-graph optimisation. This offers the descriptive and predictive power of SLAM systems which perform dense surface reconstruction, but with a huge representation compression. The object graph enables predictions for accurate ICP-based camera to model tracking at each live frame, and efﬁcient active search for new objects in currently undescribed image regions. We demonstrate real-time incremental SLAM in large, cluttered environments, including loop closure, relocalisation and the detection of moved objects, and of course the generation of an object level scene description with the potential to enable interaction.
David Birch Helen Liang, Paul H J Kelly, Glen Mullineux, Tony Field, Joan Ko, Alvise Simondetti: Multidisciplinary Engineering Models: Methodology and Case Study in Spreadsheet Analytics. European Spreadsheet Risks Interest Group 14th Annual Conference (EuSpRIG 2013)
This paper demonstrates a methodology to help practitioners maximise the utility of complex multidisciplinary engineering models implemented as spreadsheets, an area presenting unique challenges. As motivation we investigate the expanding use of Integrated Resource Management (IRM) models which assess the sustainability of urban masterplan designs. IRM models reflect the inherent complexity of multidisciplinary sustainability analysis by integrating models from many disciplines. This complexity makes their use time-consuming and reduces their adoption.
We present a methodology and toolkit for analysing multidisciplinary engineering models implemented as spreadsheets to alleviate such problems and increase their adoption. For a given output a relevant slice of the model is extracted, visualised and analysed by computing model and interdisciplinary metrics. A sensitivity analysis of the extracted model supports engineers in their optimisation efforts. These methods expose, manage and reduce model complexity and risk whilst giving practitioners insight into multidisciplinary model composition. We report application of the methodology to several generations of an industrial IRM model and detail the insight generated, particularly considering model evolution.
Christopher D. Krieger, Michelle Mills Strout, Catherine Olschanowsky, Andrew Stone, tephen Guzik, Xinfeng Gao, Carlo Bertolli, Paul H.J. Kelly, Gihan Mudalige, Brian Van Straalen and Sam Williams, Loop Chaining: A Programming Abstraction For Balancing Locality and Parallelism 18th International Workshop on High-level Parallel Programming Models and Supportive Environments
There is a significant, established code base in the computational science community. Some of these codes have been parallelized already but are now encountering scalability issues due to poor data locality, inefficient data distributions, or load imbalance. In this work, we introduce a new abstraction called loop chaining in which a sequence of parallel and/or reduction loops that explicitly share data are grouped together into a chain. Once specified, a chain of loops can be viewed as a set of iterations under a partial ordering dictated by data dependencies that as part of the abstraction should be exposed to avoid interprocedural program analysis. Thus a loop chain is a partially ordered set of iterations that makes scheduling and determining data distributions across loops possible for a compiler and/or runtime system. The flexibility of being able to schedule across loops enables better management of the data locality and parallelism tradeoff. In this paper, we define the loop chaining concept and present three case studies using loop chains in scientific codes: the sparse matrix Jacobi benchmark, a domain-specific library, Chombo, used in full applications with structured grids, and a domain-specific library, OP2, used in full applications with unstructured grids. Preliminary results for the Jacobi benchmark show that a loop chain enabled optimization, full sparse tiling, results in a speedup of as much as 2.68x over a parallelized, blocked implementation on a multicore system with 40 cores.
Tobias Grosser, Albert Cohen, Paul H J Kelly, J Ramanujam, P Sadayappan, Sven Verdoolaege, Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles GPGPU6 - Sixth Workshop on General-purpose Computing Using GPUs (colocated with ASPLOS'13, March 2013).
Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer “time” loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism.
One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs.
Simon A. Spacey, Wayne Luk, Daniel Kuhn, Paul H. J. Kelly: Parallel partitioning for distributed systems using sequential assignment. J. Parallel Distrib. Comput. 73(2): 207-219 (2013)
This paper introduces a method to combine the advantages of both task parallelism and fine-grained co-design specialisation to achieve faster execution times than either method alone on distributed heterogeneous architectures. The method uses a novel mixed integer linear programming formalisation to assign code sections from parallel tasks to share computational components with the optimal trade-off between acceleration from component specialism and serialisation delay. The paper provides results for software benchmarks partitioned using the method and formal implementations of previous alternatives to demonstrate both the practical tractability of the linear programming approach and the increase in program acceleration potential deliverable.
Francis P Russell and Paul H J Kelly, Optimised Code Generation for Finite Element Local Assembly Using Symbolic Manipulation. ACM Transactions on Mathematical Software. Volume 39, Issue 4, July 2013, Article 26 pdf.
Automated code generators for finite element local assembly have facilitated the exploration of a number of different implementation strategies within the generated code. However, even with respect to a theoretical performance indicator such as operation count, there is not currently a known optimal strategy for performing local assembly. We explore a code generation strategy based on symbolic integration and polynomial factorisation designed to expose optimisation opportunities. We present our implementation of a finite element assembly code generator based on these techniques. We compare the operation count and performance of our generated code against quadrature and tensor contraction based implementations generated by the FEniCS Form Compiler (FFC) under multiple compilers. Although the approach of using symbolic integration and factorisation for optimising finite element assembly is not new, we distinguish our work through certain design choices and our strategy for optimising common sub-expressions. Under one compiler, we show reductions in the operation count of the compiled code across a number of variational forms of up to 4 times. We also explore the impact on numerical accuracy of some of the FFC tensor contraction optimisations.
Simon A. Spacey, Wayne Luk, Paul H. J. Kelly, Daniel Kuhn: Improving communication latency with the write-only architecture. J. Parallel Distrib. Comput. 72(12): 1617-1627 (2012)
This paper introduces a novel execution paradigm called the Write-Only Architecture (WOA) that reduces communication latency overheads by up to a factor of five over previous methods. The WOA writes data through distributed control flow logic rather than using a read–write paradigm or a centralised message hub which allows tasks to be partitioned at a fine-grained level without suffering from excessive communication overheads on distributed systems. In this paper we provide formal assignment results for software benchmarks partitioned using the WOA and previous execution paradigms for distributed heterogeneous architectures along with bounds and complexity information to demonstrate the robust performance improvements possible with the WOA.
C. Bertolli, A. Betts, N. Loriant, G. Mudalige, D. Radford, M. Giles, and P. Kelly, Compiler Optimizations for Industrial Unstructured Mesh CFD Applications on GPUs, to appear in the 25th International Workshop on Languages and Compilers for Parallel Computing (LCPC'12), September 2012.
G. J. Gorman, James Southern, P. E. Farrell, M. D. Piggott, Georgios Rokos, Paul H. J. Kelly: Hybrid OpenMP/MPI Anisotropic Mesh Smoothing. Procedia CS 9: 1513-1522 (2012) (doi) (Proceedings of the International Conference on Computational Science, ICCS 2012)
Mesh smoothing is an important algorithm for the improvement of element quality in unstructured mesh finite element methods. A new optimisation based mesh smoothing algorithm is presented for anisotropic mesh adaptivity. It is shown that this smoothing kernel is very effective at raising the minimum local quality of the mesh. A number of strategies are employed to reduce the algorithm's cost while maintaining its effectiveness in improving overall mesh quality. The method is parallelised using hybrid OpenMP/MPI programming methods, and graph colouring to identify independent sets. Different approaches are explored to achieve good scaling performance within a shared memory compute node.
Carlo Bertolli, Adam Betts, Paul H. J. Kelly, Gihan R. Mudalige, Mike B. Giles: Mesh independent loop fusion for unstructured mesh applications. Conf. Computing Frontiers 2012: 43-52
Applications based on unstructured meshes are typically compute intensive, leading to long running times. In principle, state-of-the-art hardware, such as multi-core CPUs and many-core GPUs, could be used for their acceleration but these esoteric architectures require specialised knowledge to achieve optimal performance. OP2 is a parallel programming layer which attempts to ease this programming burden by allowing programmers to express parallel iterations over elements in the unstructured mesh through an API call, a so-called OP2-loop. The OP2 compiler infrastructure then uses source-to-source transformations to realise a parallel implementation of each OP2-loop and discover opportunities for optimisation.
In this paper, we describe how several compiler techniques can be effectively utilised in tandem to increase the performance of unstructured mesh applications. In particular, we show how whole-program analysis --- which is often inhibited due to the size of the control flow graph - often becomes feasible as a result of the OP2 programming model, facilitating aggressive optimisation. We subsequently show how whole-program analysis then becomes an enabler to OP2-loop optimisations. Based on this, we show how a classical technique, namely loop fusion, which is typically difficult to apply to unstructured mesh applications, can be defined at compile-time. We examine the limits of its application and show experimental results on a computational fluid dynamic application benchmark, assessing the performance gains due to loop fusion.
G.R. Mudalige, I. Reguly, M.B. Giles, C. Bertolli and P.H.J. Kelly. OP2: An Active Library Framework for Solving Unstructured Mesh-based Applications on Multi-Core and Many-Core Architectures. In Proceedings of Innovative Parallel Computing (InPar '12). May 13-14 2012. San Jose, California. (PDF)
OP2 is an ``active'' library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into different parallel implementations for execution on different back-end hardware platforms. In this paper we present the design of the current OP2 library, and investigate its capabilities in achieving performance portability, near-optimal performance, and scaling on modern multi-core and many-core processor based systems. A key feature of this work is OP2’s recent extension facilitating the development and execution of applications on a distributed memory cluster of GPUs.
We discuss the main design issues in parallelizing unstructured mesh based applications on heterogeneous platforms. These include handling data dependencies in accessing indirectly referenced data, the impact of unstructured mesh data layouts (array of structs vs. struct of arrays) and design considerations in generating code for execution on a cluster of GPUs. A representative CFD application written using the OP2 framework is utilized to provide a contrasting benchmarking and performance analysis study on a range of multi-core/many-core systems. These include multi-core CPUs from Intel (Westmere and Sandy Bridge) and AMD (Magny-Cours), GPUs from NVIDIA (GTX560Ti, Tesla C2070), a distributed memory CPU cluster (Cray XE6) and a distributed memory GPU cluster (Tesla C2050 GPUs with InfiniBand). OP2’s design choices are explored with quantitative insights into their contributions to performance. We demonstrate that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.
G.R. Mudalige, M.B. Giles, C. Bertolli, and P.H.J. Kelly. Predictive Modeling and Analysis of OP2 on Distributed Memory GPU Clusters. SIGMETRICS Perform. Eval. Rev. 40, 2 (2012). (PDF)
OP2 is an "active" library framework for the development and solution of unstructured mesh based applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back-end to different multi-core/many-core hardware. This paper presents a predictive performance analysis and benchmarking study of OP2 on heterogeneous cluster systems. We first present the design of a new OP2 back-end that enables the execution of applications on distributed memory clusters, and benchmark its performance during the solution of a 1.5M and 26M edge-based CFD application written using OP2. Benchmark systems include a large-scale CrayXE6 system and an Intel Westmere/InfiniBand cluster. We then apply performance modeling to predict the application’s performance on an NVIDIA Tesla C2070 based GPU cluster, enabling us to compare OP2’s performance capabilities on emerging distributed memory heterogeneous systems. Results illustrate the performance benefits that can be gained through many-core solutions both on single-node and heterogeneous configurations in comparison to traditional homogeneous cluster systems for this class of applications.
Court, C.A.; Kelly, P.H.J.; , Loop-Directed Mothballing: Power Gating Execution Units Using Runtime Loop Analysis, Micro, IEEE , vol.31, no.6, pp.29-38, Nov.-Dec. 2011
Power gating reduces static power by either disabling whole units or dynamically resizing units to meet application demands. The Loop-Directed Mothballing technique lets users power gate execution units by recording utilization of individual units in loops, and by power gating units according to two utilization thresholds. LDM offers on average 10.3 percent total power savings with low performance loss.
Publisher's site (doi: 10.1109/MM.2011.92)
Peter Collingbourne, Cristian Cadar and Paul Kelly, Symbolic Testing of OpenCL Code. Haifa Verification Conference, December 2011. (pdf)
We present an effective technique for crosschecking a C or C++ program against an accelerated OpenCL version, as well as a technique for detecting data races in OpenCL programs. Our techniques are implemented in KLEE-CL, a symbolic execution engine based on KLEE and KLEE-FP that supports symbolic reasoning on the equivalence between symbolic values.
Our approach is to symbolically model the OpenCL environment using an OpenCL runtime library targeted to symbolic execution. Using this model we are able to run OpenCL programs symbolically, keeping track of memory accesses for the purpose of race detection. We then compare the symbolic result against the plain C or C++ implementation in order to detect mismatches between the two versions.
We applied KLEE-CL to the Parboil benchmark suite, the Bullet physics library and the OP2 library, in which we were able to find a total of seven errors: two mismatches between the OpenCL and C implementations, three memory errors, one OpenCL compiler bug and one race condition.
G. R. Markall, A. Slemmer, D. A .Ham, P. H. J. Kelly, C. D. Cantwell, and S. J. Sherwin, Finite element assembly strategies on multi- and many-core architectures. Accepted (Dec 2011) for International Journal for Numerical Methods in Fluids. available here
We demonstrate that radically differing implementations of finite element methods are needed on multi-core (CPU) and many-core (GPU) architectures, if their respective performance potential is to be realised. Our numerical investigations using a finite element advection-diffusion solver show that increased performance on each architecture can only be achieved by committing to specific and diverse algorithmic choices that cut across the high-level structure of the implementation. Making these commitments to achieve high performance for a single architecture leads to a loss of performance portability. Data structures that include redundant data but enable coalesced memory accesses are faster on many-core architectures, whereas redundancy-free data structures that are accessed indirectly are faster on multi-core architectures. The Addto algorithm for global assembly is optimal on multi-core architectures, whereas the Local Matrix Approach is optimal on many-core architectures despite requiring more computation than the Addto algorithm. These results demonstrate the value in making the correct choice of algorithm and data structure when implementing finite element methods, spectral element methods and low-order discontinuous Galerkin methods on modern high-performance architectures.
G.R. Mudalige, M.B. Giles, C. Bertolli and P.H.J Kelly. Predictive Modeling and Analysis of OP2 on Distributed Memory GPU Clusters. In: 2nd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 11), held in conjunction with IEEE/ACM Supercomputing 2011 (SC'11), Seattle, WA, USA.
OP2 is an "active" library framework for the development and solution of unstructured mesh-based applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back-end to different multi-core/many-core hardware. This paper presents a summary of a predictive performance analysis and benchmarking study of OP2 on heterogeneous cluster systems. In this work, an industrial representative CFD application written using the OP2 framework is benchmarked during the solution of an unstructured mesh of 1.5M and 26M edges. Benchmark systems include a large-scale Cray XE6 system and an Intel Westmere/InfiniBand cluster. Performance modeling is then used to predict the application’s performance on an NVIDIA Tesla C2070-based GPU cluster, enabling the comparison of OP2's performance capabilities on emerging distributed memory heterogeneous systems. Results illustrate the performance benefits that can be gained through many-core solutions both on single-node and heterogeneous configurations in comparison to traditional homogeneous cluster systems for this class of application.
George Rokos, Gerard Gorman, Paul H J Kelly, Accelerating Anisotropic Mesh Adaptivity on nVIDIA's CUDA Using Texture Interpolation. Euro-Par 2011.
Anisotropic mesh smoothing is used to generate optimised meshes for Computational Fluid Dynamics (CFD). Adapting the size and shape of elements in an unstructured mesh to a specification encoded in a metric tensor field is done by relocating mesh vertices. This computationally intensive task can be accelerated by engaging nVIDIA's CUDA-enabled GPUs. This article describes the algorithmic background, the design choices and the implementation details that led to a meshsmoothing application running in double-precision on a Tesla C2050 board. Engaging CUDA's texturing hardware to manipulate the metric tensor field accelerates execution by up to 6.2 times, leading to a total speedup of up to 148 times over the serial CPU code and up to 15 times over the 12-threaded OpenMP code..
Peter Collingbourne, Cristian Cadar and Paul H J Kelly, Symbolic Crosschecking of Floating-Point and SIMD Code. EuroSys 2011. pdf.
We present an effective technique for crosschecking an IEEE 754 floating-point program and its SIMD-vectorized version, implemented in KLEE-FP, an extension to the KLEE symbolic execution tool that supports symbolic reasoning on the equivalence between floating-point values. The key insight behind our approach is that floatingpoint values are only reliably equal if they are essentially built by the same operations. As a result, our technique works by lowering the Intel Streaming SIMD Extension (SSE) instruction set to primitive integer and floating-point operations, and then using an algorithm based on symbolic expression matching augmented with canonicalization rules. Under symbolic execution, we have to verify equivalence along every feasible control-flow path. We reduce the branching factor of this process by aggressively merging conditionals, if-converting branches into select operations via an aggressive phi-node folding transformation. We applied KLEE-FP to OpenCV, a popular open source computer vision library. KLEE-FP was able to successfully crosscheck 51 SIMD/SSE implementations against their corresponding scalar versions, proving the bounded equivalence of 41 of them (i.e., on images up to a certain size), and finding inconsistencies in the other 10.
MB Giles, GR Mudalige, Z Sharif, G Markall, PHJ Kelly Performance Analysis and Optimisation of the OP2 Framework on Many-core Architectures The Computer Journal 2011, doi: 10.1093/comjnl/bxr062 pdf and from the publisher
This paper presents a benchmarking, performance analysis and optimization study of the OP2 "active" library, which provides an abstraction framework for the parallel execution of unstructured mesh applications. OP2 aims to decouple the scientific specification of the application from its parallel implementation, and thereby achieve code longevity and near-optimal performance through re-targeting the application to execute on different multi-core/many-core hardware. Runtime performance results are presented for a representative unstructured mesh application on a variety of many-core processor systems, including traditional X86 architectures from Intel (Xeon based on the older Penryn and current Nehalem micro-architectures) and GPU offerings from NVIDIA (GTX260, Tesla C2050). Our analysis demonstrates the contrasting performance between the use of CPU (OpenMP) and GPU (CUDA) parallel implementations for the solution of an industrial-sized unstructured mesh consisting of about 1.5 million edges. Results show the significance of choosing the correct partition and thread-block configuration, the factors limiting the GPU performance and insights into optimizations for improved performance.
MB Giles, GR Mudalige, Z Sharif, G Markall, PHJ Kelly, Performance Analysis of the OP2 Framework on Many-core Architectures ACM SIGMETRICS Performance Evaluation Review, March 2011 pdf This paper is superceded by the journal version above.
C.D.Cantwell, S.J.Sherwin, R.M.Kirby, P.H.J.Kelly, From h to p efficiently: selecting the optimal spectral/hp discretisation in three dimensions, Mathematical Modelling of Natural Phenomena / Volume 6 / Issue03 / January 2011, pp 84-96. Published online by Cambridge University Press: 16 May 2011 DOI: http://dx.doi.org/10.1051/mmnp/20116304 (pdf).
There is a growing interest in high-order finite and spectral/hp element methods using continuous and discontinuous Galerkin formulations. In this paper we investigate the effect of h- and p-type refinement on the relationship between runtime performance and solution accuracy. The broad spectrum of possible domain discretisations makes establishing a performance-optimal selection non-trivial. Through comparing the runtime of different implementations for evaluating operators over the space of discretisations with a desired solution tolerance, we demonstrate how the optimal discretisation and operator implementation may be selected for a specified problem. Furthermore, this demonstrates the need for codes to support both low- and high-order discretisations.
C.D.Cantwell, S.J.Sherwin, R.M.Kirby, P.H.J.Kelly, From h to p efficiently: strategy selection for operator evaluation on hexahedral and tetrahedral elements, Comp. and Fluids (2010), doi:10.1016/j.compfluid.2010.08.012. doi
A spectral/hp element discretisation permits both geometric flexibility and beneficial convergence properties to be attained simultaneously. The choice of elemental polynomial order has a profound effect on the efficiency of different implementation strategies with their performance varying substantially for low and high order spectral/hp discretisations. We examine how careful selection of the strategy minimises computational cost across a range of polynomial orders in three dimensions and compare how different operators, and the choice of element shape, lead to different break-even points between the implementations. In three dimensions, higher expansion orders quickly lead to a large increase in the number of element-interior modes, particularly in hexahedral elements. For a typical boundary.interior modal decomposition, this can rapidly lead to a poor performance from a global approach, while a sum-factorisation technique, exploiting the tensor-product structure of elemental expansions, leads to better performance. Furthermore, increased memory requirements may cause an implementation to show poor runtime performance on a given system, even if the strict operation count is minimal, due to detrimental caching effects and other machine-dependent factors.
Graham R. Markall, David A. Ham and Paul H. J. Kelly. Towards generating optimised multiplatform finite element solvers for GPUs from high-level specifications. in Proceedings of the 10th International Conference on Computational Science (ICCS 2010). Amsterdam, Netherlands. June, 2010. pdf
We argue that producing maintainable high-performance implementations of finite element methods for multiple targets requires that they are written using a high-level domain-specific language. We make the case for using one such language, the Unified Form Language (UFL), by discussing how it allows the generation of high-performance code from maintainable sources. We support this case by showing that optimal implementations of a finite element solver written for a Graphics Processing Unit and a multicore CPU require the use of different algorithms and data formats that are embodied by the UFL representation. Finally we describe a prototype compiler that generates low-level code from high-level specifications, and outline how the high-level UFL representation can be lowered to facilitate optimisation using existing techniques prior to code generation.
David J Pearce and Paul H J Kelly, A Batch Algorithm for Maintaining a Topological Order. Australasian Computer Science Conference, ACSC10 (January 2010).
The dynamic topological order problem is that of efficiently updating a topological order after some edge(s) are inserted into a graph. Much prior work exists on the unit-change version of this problem, where the order is updated after every single insertion. No previous (non-trivial) algorithms are known for the batch version of the problem, where the order is updated after every batch of insertions. We present the first such algorithm. This requires O(min{k.(v+e), ve}) time to process any sequence of k insertion batches. This is achieved by only recomputing those region(s) of the order affected by the inserted edges. In many cases, our algorithm will only traverse small portions of the graph when processing a batch. We empirically evaluate our algorithm against previous algorithms for this problem, and find that it performs well when the batch size is sufficiently large.
Lee Howes, Anton Lokhmotov, Alastair F. Donaldson, and Paul H.J. Kelly Towards Metaprogramming for Parallel Systems on a Chip. In Proceedings of the International Euro-Par Workshops 2009, HPPC'09, Lecture Notes in Computer Science, Delft, The Netherlands, August 2009. Springer. pdf.
We demonstrate that the performance of commodity parallel systems significantly depends on low-level details, such as storage layout and iteration space mapping, which motivates the need for tools and techniques that separate a high-level algorithm description from low-level mapping and tuning. We propose to build a tool based on the concept of decoupled Access/Execute metadata which allow the programmer to specify both execution constraints and memory access pattern of a computation kernel.
Lee Howes, Anton Lokhmotov, Paul H.J. Kelly, Alastair F. Donaldson, Decoupled Access/Execute Metaprogramming for GPU-Accelerated Systems. In 2009 Symposium on Application Accelerators in High Performance Computing (SAAHPC'09), July 28-30 2009.
We describe the evaluation of several implementations of a simple image processing filter on an NVIDIA GTX 280 card. Our experimental results show that performance depends significantly on low-level details such as data layout and iteration space mapping which complicate code development and maintenance. We propose extending a CUDA or OpenCL like model with decoupled Access/Execute (AEcute) metadata, describing its iteration space ordering and partitioning (execute metadata) and its memory access patterns (access metadata).
Jay L. T. Cornwall, Lee Howes, Paul H. J. Kelly, Phil Parsonage and Bruno Nicoletti, High-Performance SIMT Code Generation in an Active Visual Effects Library. In Proceedings of the Conference on Computing Frontiers, 2009. pdf.
SIMT (Single-Instruction Multiple-Thread) is an emerging programming paradigm for high-performance computational accelerators, pioneered in current and next generation GPUs and hybrid CPUs. We present a domain-specic active library supported approach to SIMT code generation and optimisation in the field of visual effects. Our approach uses high-level metadata and runtime context to guide and to ensure the correctness of optimisation-driven code transformations and to implement runtime-context-sensitive optimisations. Our advanced optimisations require no analysis of the original C++ kernel code and deliver 1.3x to 6.6x speedups over syntax-directed translation on GeForce 8800 GTX and GTX 260 GPUs with two commercial visual effects.
Simon A. Spacey, Wayne Luk, Paul H.J. Kelly, and Daniel Kuhn. Rapid Design Space Visualisation through Hardware/Software Partitioning. IEEE Southern Programmable Logic Conference, Sa Carlos, Brazil, April 2009 (IEEE Explore)
Lee W. Howes, Anton Lokhmotov, Alastair F. Donaldson, and Paul H.J. Kelly, Deriving Efficient Data Movement from Decoupled Access/Execute Software 4th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC09). pdf
On multi-core architectures with software-managed memories, effectively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explicitly specify both memory access pattern and execution schedule of a computation kernel, the compiler or runtime system can derive efficient data movement even if analysis of kernel code is difficult or impossible. We have developed a framework of C++ classes for decoupled access/execute specifications, allowing for automatic communication optimisations such as software pipelining and double buffering. We have used these classes to implement a set of benchmarks, which exhibit data reuse and non-affine access functions. We demonstrate the ease and efficiency of programming the Cell BE architecture using our techniques by comparing against alternative benchmark implementations, which use hand-written DMA transfers and software-based caching.

Tristan Perryman, Paul H. J. Kelly, Anton Lokhmotov and Tony Field, Generating CUDA code at runtime: specializing accelerator code to runtime data Many-core and Reconfigurable Supercomputing Network Workshop, Imperial College London, Sept 2008 pdf.
Lee W. Howes, Anton Lokhmotov, Paul H.J. Kelly, Anthony J. Field. Optimising component composition using indexed dependence metadata. In Proceedings of the 1st International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC). Lake Como, Italy. November 8, 2008. pdf
This paper explores the use of dependence metadata for optimising composition in component-based parallel programs. The idea is for each component to carry additional information about how points in its iteration space map to memory locations associated with its input and output data structures. When two components are composed this information can be used to implement optimisations that would otherwise require expensive analysis of the components' code at the time of composition. This dependence metadata facilitates a number of cross-component optimisations -- in this paper we focus on loop fusion and array contraction. We describe a prototype framework, based on the CLooG loop generator tool, that embodies these ideas and report experimental performance results for three non-trivial parallel benchmarks. Our results show execution time reductions of up to 50% using the proposed framework on an eight-core Intel Xeon system.
Francis P. Russell, Michael R. Mellor, Paul H. J. Kelly and Olav Beckmann, DESOLA: an Active Linear Algebra Library Using Delayed Evaluation and Runtime Code Generation. Science of Computer Programming, accepted for publication (June 2008). Published in Volume 76, Issue 4, 1 April 2011, Pages 227-242 Special issue on library-centric software design (LCSD 2006) pdf (http://dx.doi.org/10.1016/j.scico.2008.06.002)
Active libraries can be defined as libraries which play an active part in the compilation, in particular, the optimisation of their client code. This paper explores the implementation of an active dense linear algebra library by delaying evaluation of expressions built using library calls, then generating code at runtime for the compositions that occur. The key optimisations in this context are loop fusion and array contraction.
Our prototype C++ implementation, DESOLA, automatically fuses loops arising from different client calls, identifies unnecessary intermediate temporaries, and contracts temporary arrays to scalars. Performance is evaluated using a benchmark suite of linear solvers from ITL (Iterative Template Library), and is compared with MTL (Matrix Template Library), ATLAS (Automatically Tuned Linear Algebra) and IMKL (Intel Math Kernel Library). Excluding runtime compilation overheads (caching means they occur only on the first iteration), for larger matrix sizes, performance matches or exceeds MTL; when fusion of matrix operations occurs, performance exceeds that of ATLAS and IMKL.
P. Collingbourne and P. Kelly. Inference of session types from control flow. In Formal Foundations of Embedded Software and Component-Based Software Architectures (FESCA 2008), 2008. pdf.
This is a study of a technique for deriving the session type of a program written in a statically typed imperative language from its control flow. We impose on our unlabelled session type syntax a well-formedness constraint based upon normalisation and explore the effects thereof. We present our inference algorithm declaratively and in a form suitable for implementation, and illustrate it with examples. We then present an implementation of the algorithm using a program analysis and transformation toolkit.
Antoniu Pop, Sebastian Pop, Harsha Jagasia, Jan Sjodin, Paul H J Kelly, Improving GCC Infrastructure for Streamization. GCC Developers' Summit, June 17-19, 2008 (Ottawa, Canada). pdf
GCC needs a strategy to support future multicore architectures, which will probably include heterogeneous accelerator-like designs with explicit management of scratchpad memories; some have further restrictions, for example SIMD, with limited synchronization capabilities. Some platforms will probably offer hardware support for streaming, transactions and speculation. The purpose of this paper is to give a survey and evaluation of some automatic and manual techniques for improving support for such targets in GCC. We focus on translation of sequential code for such platforms, i.e. the translation to task graphs and their communication and memory access operations. The paper provides an evaluation of the communication library support on a quad-core AMD Phenom[tm] 9550 processor. We use these experiments to tune the automatic task partitioning algorithm implemented in GCC. The paper concludes with recommendations for strategic developments of GCC to support a stream programming language, and improve the automatic generation of streamized tasks.
Ashley W Brown, Paul H J Kelly and Wayne Luk, Profile-directed speculative optimization of reconfigurable floating point data paths 2nd HiPEAC Workshop on Reconfigurable Computing, January 27, 2008. pdf.
This paper presents a methodology for generating floating-point arithmetic hardware designs which are, for suitable applications, dramatically reduced in size, while still retaining performance. We use a profiling tool for floating-point value ranges to identify arithmetic operations where the shifting required for alignment and normalisation is almost always small. We synthesise hardware with reduced-size barrel-shifters, but always detect when operands lie outside the range this optimised hardware can handle. These rare out-of-range operations are handled by a separate full floating-point implementation, either on-chip or by returning calculations to the host. Thus the system suffers no compromise in IEEE754 compliance. This paper presents results for two benchmark applications which profiling suggested would be profitable. We demonstrate the potential for this technique to yield an increase in parallel computing power of up to 43%, with a (correctable) error rate of less than 5%.
David J Pearce, Paul H J Kelly and Chris Hankin, Efficient Field-Sensitive Pointer Analysis of C . ACM Trans. Program. Lang. Syst. 30, 1 (Nov. 2007), 4. DOI= http://doi.acm.org/10.1145/1290520.1290524. Journal version of workshop paper below.
The subject of this paper is flow- and context-insensitive pointer analysis. We present a novel approach for precisely modelling struct variables and indirect function calls. Our method emphasises efficiency and simplicity and consists of an extension to the existing system of set-constraints. We obtain an O(n4) bound on the time needed to solve constraint sets from our extended language. This gives, for the rst time, some insight into the hardness of performing field-sensitive pointer analysis of C. Furthermore, we experimentally evaluate the time versus precision trade-off for our method by comparing against the field-insensitive equivalent. Our benchmark suite consists of 11 common C programs ranging in size from 15,000 to 200,000 lines of code. Our results indicate the field-sensitive analysis is more expensive to compute, but yields signicantly better precision. In addition, our technique has been integrated into the GNU Compiler GCC (version 4.1). Finally, we identify several previously unknown issues with an alternative and less precise approach to modelling struct variables, known as field-based analysis. Further details are available here.
Jay L. T. Cornwall, Paul H. J. Kelly, Phil Parsonage and Bruno Nicoletti. Explicit Dependence Metadata in an Active Visual Effects Library. LCPC 2007. pdf.
Developers need to be able to write code using high-level, reusable black-box components. Also essential is confidence that code can be mapped to an efficient implementation on the available hardware, with robust high performance. In this paper we present a prototype component library being developed to deliver this for industrial visual effects applications. Components are based on abstract algorithmic skeletons that provide metadata characterizing data accesses and dependence constraints. Metadata is combined at run-time to build a polytope representation which supports aggressive inter-component loop fusion. We present results for a wavelet-transform-based degraining filter running on multicore PC hardware, demonstrating 3.4x.5.3x speed-ups, improved parallel efficiency and a 30% reduction in memory consumption without compromising the program structure.
Henry Falconer, Paul H J Kelly, David M Ingram, Michael R Mellor, Tony Field and Olav Beckmann, A declarative framework for analysis and optimization. Compiler Construction (CC07), March 2007. pdf.
DeepWeaver-1 is tool supporting cross-cutting program analysis and transformation components, called ``weaves''. Like an aspect, a DeepWeaver weave consists of a query part, and a part which may modify code. DeepWeaver's query language is based on Prolog, and provides access to data-flow and control-flow reachability analyses. DeepWeaver provides a declarative way to access the internal structure of methods, and supports cross-cutting weaves which operate on code blocks from different parts of the codebase simultaneously. DeepWeaver operates at the level of bytecode, but offers predicates to extract structured control flow constructs. This paper motivates the design, and demonstrate some its power, using a sequence of examples including performance profiling and domain-specific performance optimisations for database access and remote method invocation.
Ashley W Brown, Paul H J Kelly and Wayne Luk, Profiling floating point value ranges for reconfigurable implementation HiPEAC Workshop on Reconfigurable Computing, Ghent, Belgium, January 28, 2007 pdf.
Reconfigurable architectures offer potential for performance enhancement by specializing the implementation of floating-point arithmetic. This paper presents FloatWatch, a dynamic execution profiling tool designed to identify where an application can benefit from reduced precision or reduced range in floating-point computations. FloatWatch operates on x86 binaries, and generates a profile output file recording, for each instruction and line of source code, the overall range of floating-point values, the bucketised sub-ranges of values, and the maximum difference between 64-bit and 32-bit executions.
We present results from the tool on a suite of four benchmark codes. Our tool indicates potential performance loss due to denormal values, and helps to identify opportunities for using dual fixed-point arithmetic representation which has proved effective for reconfigurable designs. Our results show that applications often have highly modal value distributions, offering promise for aggressive floating-point arithmetic optimisations.
Francis P Russell, Michael R Mellor, Paul H J Kelly and Olav Beckmann, An active linear algebra library using delayed evaluation and runtime code generation In Library-Centric Software Design LCSD'06. pdf (extended abstract).
Active libraries can be defined as libraries which play an active part in the compilation (in particular, the optimisation) of their client code. This paper explores the idea of delaying evaluation of expressions built using library calls, then generating code at runtime for the particular compositions that occur. We explore this idea with a dense linear algebra library for C++. The key optimisations in this context are loop fusion and array contraction.
Ashley W Brown, Wayne Luk and Paul H J Kelly, Generating Hardware Designs by Source-Code Transformation. In STS06: Software Transformation Systems Workshop, October 2006. pdf (preliminary version)
Current and potential users for field-programmable gate arrays (FPGAs) are increasingly looking to high-level languages as a means to widen applicability and cope with ever-increasing transistor counts. We present a transformation system for one such high-level language for electronic circuit design, which goes some way towards bridging the growing gulf between the domain and architecture experts. We demonstrate its effectiveness by realistic, albeit relatively small, case studies; performance improvements of up to 70% have been achieved.
David J. Pearce and Paul H.J. Kelly. A dynamic topological sort algorithm for directed acyclic graphs. Journal of Experimental Algorithmics 11 (Jan. 2007), 1.7. ACM Digital Library. pdf.
We consider the problem of maintaining the topological order of a directed acyclic graph (DAG) in the presence of edge insertions and deletions. We present a new algorithm and, although this has inferior time complexity compared with the best previously known result, we find that its simplicity leads to better performance in practice. In addition, we provide an empirical comparison against the three main alternatives over a large number of random DAGs. The results show our algorithm is the best for sparse digraphs and only a constant factor slower than the best on dense digraphs.
Jay L T Cornwall, Olav Beckmann and Paul H J Kelly, Accelerating a C++ Image Processing Library with a GPU. POHLL 2006: Workshop on Performance Optimization for High-Level Languages and Libraries (colocated with IPDPS06, Rhodes). IEEE Press. pdf.
This paper presents work-in-progress towards a C++ source-to-source translator that automatically seeks parallelisable code fragments and replaces them with code for a graphics co-processor. We report on our experience with accelerating an industrial image processing library. To increase the effectiveness of our approach, we exploit some domain-specific knowledge of the library's semantics. We outline the architecture of our translator and how it uses the ROSE source-to-source transformation library to overcome complexities in the C++ language. Techniques for parallel analysis and source transformation are presented in light of their uses in GPU code generation. We conclude with results from a performance evaluation of two examples, image blending and an erosion filter, hand-translated with our parallelisation techniques. We show that our approach has potential and explain some of the remaining challenges in building an effective tool.
Jeyarajan Thiyagalingam, Olav Beckmann and Paul H. J. Kelly.
Is Morton layout competitive for large two dimensional arrays, yet?. In Concurrency And Computation: Practice and Experience Vol 18, Number 11, September 2006 (pages 1509-1539). pdf.
Two-dimensional arrays are generally arranged in memory in row-major order or column-major order. Traversing a row-major array in column-major order, or vice-versa, leads to poor spatial locality. With large arrays the performance loss can be a factor of 10 or more. This paper explores the Morton storage layout, which has substantial spatial locality whether traversed in row-major or column-major order.
Using a small suite of dense kernels working on two-dimensional arrays, we have carried out an extensive study of the impact of poor array layout and of whether Morton layout can offer an attractive compromise. We show that Morton layout can lead to better performance than the worse of the two canonical layouts; however, the performance of Morton layout compared to the better choice of canonical layout is often disappointing. We further study one simple improvement of the basic Morton scheme: we show that choosing the correct alignment for the base address of an array in Morton layout can sometimes significantly improve the competitiveness of this layout.
Jeyarajan Thiyagalingam, Olav Beckmann, and Paul H. J. Kelly, Minimizing Associativity Conflicts in Morton Layout. PPAM 2005 Workshop on HPC Linear Algebra Libraries for Computers with Multilevel Memories. In Proceedings of PPAM05, Springer Verlag LNCS 3911. pdf.
Hierarchically-blocked non-linear storage layouts, such as the Morton ordering, have been shown to be a potentially attractive compromise between row-major and column-major for two-dimensional arrays. When combined with appropriate optimizations,Morton layout offers some spatial locality whether traversed row- or column-wise. However, for linear algebra routines with larger problem sizes, the layout shows diminishing returns. It is our hypothesis that associativity conflicts between Morton blocks cause this behavior and we show that carefully arranging the Morton blocks can minimize this effect. We explore one such arrangement and report our preliminary results.
Karen Osmond, Olav Beckmann, Anthony J. Field, and Paul H. J. Kelly A Domain-Specific Interpreter for Parallelizing a Large Mixed-Language Visualisation Application . LCPC 2005. pdf
We describe a technique for performing domain-specific optimisation based on the formation of an execution plan from calls made to a domain-specific library. The idea is to interpose a proxy layer between the application and the library that delays execution of the library code and, in so doing, captures a recipe for the computation required. This creates the opportunity for a .domain-specific interpreter. to analyse the recipe and generate an optimised execution plan. We demonstrate the idea by showing how it can be used to implement coarse grained tiling and parallelisation optimisations in MayaVi, a 44,000-line visualisation application written in Python and VTK, with no change to the MayaVi code base. We present a generic mechanism for interposing a domain-specific interpreter in Python applications, together with experimental results demonstrating the technique.s effectiveness in the context of MayaVi. For certain visualisation problems, in particular the rendering of isosurfaces in an unstructured mesh fluid flow simulation, we demonstrate significant speedups from coarse grained tiling, and from both SMP and distributed-memory parallelisation.
David J. Pearce, Matthew Webster, Robert Berry and Paul H.J. Kelly. Profiling with AspectJ Online from publisher (with subscription). (Accepted for publication in Software Practice and Experience; please email for a copy).
This paper investigates whether AspectJ can be used for efficient profiling of Java programs. Profiling differs from other applications of AOP (e.g. tracing), since it necessitates efficient and often complex interactions with the target program. As such, it was uncertain whether AspectJ could achieve this goal. Therefore, we investigate four common profiling problems (heap usage, object lifetime, wasted time and time-spent) and report on how well AspectJ handles them. For each, we provide an efficient implementation, discuss any trade-offs or limitations and present the results of an experimental evaluation into the costs of using it. Our conclusion is that, while there are some minor language issues to be resolved, AspectJ does offer an efficient platform for profiling.
Paul H J Kelly and Olav Beckmann, Generative and Adaptive Methods in Performance Programming Parallel Processing Letters, Vol. 15, No. 3, pp.239-256 (September 2005) pdf.
Performance programming is characterized by the need to structure software components to exploit the context of use. Relevant context includes the target processor architecture, the available resources (number of processors, network capacity), prevailing resource contention, the values and shapes of input and intermediate data structures, the schedule and distribution of input data delivery, and the way the results are to be used. This paper concerns adapting to dynamic context: adaptive algorithms, malleable and migrating tasks, and application structures based on dynamic component composition. Adaptive computations use metadata associated with software components --- performance models, dependence information, data size and shape. Computation itself is interwoven with planning and optimizing the computation process, using this metadata. This reflective nature motivates metaprogramming techniques. We present a research agenda aimed at developing a modelling framework which allows us to characterize both computation and dynamic adaptation in a way that allows systematic optimization.
Marc Hull, Olav Beckmann, Paul H. J. Kelly, MEProf: Modular Extensible Profiling for Eclipse. Eclipse Technology eXchange (eTX) Workshop at OOPSLA, October 2004, ACM Digital Library. pdf.
This paper presents a profiling plug-in for IBM's Eclipse development environment. Our approach characterises profiling as an interactive exploration of a large virtual database of information about the execution of a program. We define a high-level, graphical environment for programming a profiling pipeline, which specifies how profiling data is collected, filtered and visualized, and allows the user to write custom Java code that can intercept and manipulate the profiling data passed between these stages. We use Aspect- Oriented Programming (AOP) to program the collection of profiling information, allowing this process to be tailored to particular program contexts and domain-specific program characteristics.
Olav Beckmann, Anthony J. Field, Gerard Gorman, Andrew Huff, Marc Hull and Paul H. J. Kelly. Overcoming Barriers to Restructuring in a Modular Visualization Environment. Languages, Compilers, and Run-time Support for Scalable Systems (LCR), October 2004. ACM International Conference Proceeding Series; Vol. 81.
This paper explores the potential for automatic cross-component optimisation in the Python / VTK-based MayaVi modular visualisation environment. The idea is to delay execution of the VTK components called from the MayaVi tool, which requires no significant structural change to the MayaVi code base, but which opens up the possibility for dynamic performance optimisations such as tiling, fusion, memoisation and shared-memory parallelisation. The paper concludes with experimental results on an unstructured mesh hierarchy model from an adaptive three-dimensional gravity current simulation.
Marc Hull, Olav Beckmann, Gerard Gorman, Anthony J. Field and Paul H. J. Kelly. Cross-Component Restructuring in the MayaVi Visualisation Pipeline. Compilers for Parallel Computing (CPC2004) (July 2004).
This paper presents preliminary results from a project which aims at overcoming barriers to cross-component optimisation in the Python/vtk-based MayaVi modular visualisation environment. We use a remarkably effective delayed-evaluation technique which requires no significant structural change to the MayaVi code base --- but which opens up the possibility for performance optimisations such as tiling, fusion and memoisation, as well as adding functionality by supporting demand-driven execution. The paper concludes with experimental results on an unstructured mesh hierarchy model from an adaptive three-dimensional gravity current simulation.
(Unpublished workshop paper: please email for a copy).
David J. Pearce, Paul H. J. Kelly and Chris Hankin. 2004. Efficient field-sensitive pointer analysis for C. In Proceedings of the ACM-SIGPLAN-SIGSOFT Workshop on Program Analysis For Software Tools and Engineering (Washington DC, USA, June 07-08, 2004). PASTE '04. ACM Press, New York, NY, 37-42. DOI= http://doi.acm.org/10.1145/996821.996835 [postscript, pdf ]
The subject of this paper is flow- and context-insensitive pointer analysis. We present a novel approach for precisely modelling struct variables and indirect function calls. Our method emphasises efficiency and simplicity and consists of an extension to the existing system of set-constraints. Furthermore, we evaluate the trade-off of time versus precision of using our method, versus a less accurate analysis. Our benchmark suite consists of 7 common C programs ranging in size from 5000 to 150,000 lines of code. Our results indicate the field-sensitive analysis is more expensive to compute, but yields signicantly better precision.

David J. Pearce and Paul H. J. Kelly. A dynamic algorithm for topologically sorting directed acyclic graphs, In Proceedings of the 3rd international Workshop on Efficient and experimental Algorithms (WEA04), May 2004, Lecture Notes in Computer Science. © Springer-Verlag LNCS 3059 pp.383-398. [postscript / PDF]
We consider how to maintain the topological order of a directed acyclic graph (DAG) in the presence of edge insertions and deletions. We present a new algorithm and, although this has marginally inferior time complexity compared with the best previously known result, we find that its simplicity leads to better performance in practice. In addition, we provide an empirical comparison against three alternatives over a large number of random DAG's. The results show our algorithm is the best for sparse graphs and, surprisingly, that an alternative with poor theoretical complexity performs marginally better on dense graphs.

Olav Beckmann, Alastair Houghton, Paul H J Kelly and Michael Mellor, Run-time code generation in C++ as a foundation for domain-specific optimisation In Domain-Specific Program Generation International Seminar, Dagstuhl Castle, Germany, March 23-28, 2003, Revised Papers Lecture Notes in Computer Science, Vol. 3016 Lengauer, C.; Batory, D.; Consel, C.; Odersky, M. (Eds.) 2004. ISBN: 3-540-22119-0 (pages 291-306).
The TaskGraph Library is a C++ library for dynamic code generation, which combines specialisation with dependence analysis and loop restructuring. A TaskGraph represents a fragment of code which is constructed and manipulated at run-time, then compiled, dynamically linked and executed. TaskGraphs are initialised using macros and overloading, which forms a simplified, C-like sub-language with first-class arrays and no pointer arithmetic. Once a TaskGraph has been constructed, we can analyse its dependence structure and perform optimisations. In this paper, we present the design of the TaskGraph library, and two sample applications to demonstrate its use for runtime code specialisation and restructuring optimisation.
postscript(Draft), pdf(Draft).
Jeyarajan Thiyagalingam, Olav Beckmann, and Paul H. J. Kelly "Improving the Performance of Morton Layout by Array Alignment and Loop Unrolling: Reducing the Price of Naivety" LCPC 2003: The 16th International Workshop on Languages and Compilers for Parallel Computing. Springer LNCS 2958 pp.241-257. pdf (workshop reader version)
Hierarchically-blocked non-linear storage layouts, such as the Morton ordering, have been proposed as a compromise between row-major and columnmajor for two-dimensional arrays. Morton layout offers some spatial locality whether traversed row-wise or column-wise. The goal of this paper is to make this an attractive compromise, offering close to the performance of row-major traversal of row-major layout, while avoiding the pathological behaviour of columnmajor traversal. We explore how spatial locality of Morton layout depends on the alignment of the arrays base address, and how unrolling has to be aligned to reduce address calculation overhead. We conclude with extensive experimental results using five common processors and a small suite of benchmark kernels.
David J. Pearce, Paul H.J. Kelly and Chris Hankin. Online Cycle Detection and Difference Propagation: Applications to Pointer Analysis. Software Quality Journal, volume 12(4), pages 309-335, 2004. Access via Kluwer site. If you don't have access to the Kluwer online journals please email me for a copy.
This paper presents and evaluates a number of techniques to improve the execution time of interprocedural pointer analysis in the context of C programs. The analysis is formulated as a graph of set constraints and solved using a worklist algorithm. Indirections lead to new constraints being added during this procedure. The solution process can be simplified by identifying cycles, and we present a novel online algorithm for doing this. We also present a difference propagation scheme which avoids redundant work by tracking changes to each solution set. The effectiveness of these and other methods are shown in an experimental study over 12 common `C' programs ranging between 1000 to 150,000 lines of code.\end
This is a substantially-extended and revised version of the SCAM2003 workshop paper below.
David J Pearce, Paul H J Kelly and Chris Hankin "Online Cycle Detection and Difference Propagation for Pointer Analysis" Third IEEE International Workshop on Source Code Analysis and Manipulation (SCAM2003), September 2003, Amsterdam, Netherlands ps pdf.
This paper presents and evaluates a number of techniques to improve the execution time of interprocedural pointer analysis in the context of large C programs. The analysis is formulated as a graph of set constraints and solved using a worklist algorithm. Indirections lead to new constraints being added during this process.
In this work, we present a new algorithm for online cycle detection, and a difference propagation technique which records changes in a variable's solution. Effectiveness of these and other methods are evaluated experimentally using nine common `C' programs ranging between 1000 to 55000 lines of code.
Douglas J Brear, Thibaut Weise, Tim Wiffen, Kwok Cheung Yeung, Sarah A M Bennett and Paul H J Kelly "Search strategies for Java bottleneck location by dynamic instrumentation". UK Performance Engineering Workshop (UKPEW 2003), Warwick, UK (July 2003). postscript, pdf
An extended version of this paper is published in IEE Proceedings - Software:
- D.J. Brear, T. Weise, T. Wiffen, K.C. Yeung, S.A.M. Bennett, P.H.J. Kelly. "Search strategies for Java bottleneck location by dynamic instrumentation" IEE Proceedings - Software, vol. 150, no. 04, August 2003, p. 235-241 ( doi:10.1049/ip-sen:20030807).
We have developed a prototype tool that supports instrumentation of distributed Java applications by on-the-fly deployment of interposition code at user-selectable program points. This paper explores the idea, originated in the Paradyn Performance Consultant, of systematically searching for performance bottlenecks by progressive refinement. We present the callgraph search algorithm in detail, and discuss a number of shortcomings with the approach, some of which can be addressed by improving the search strategy. We support our conclusions with two application examples. This is a report of work in progress, aimed at stimulating further investigation of this interesting approach.
Jeyarajan Thiyagalingam, Olav Beckmann, Paul H.J. Kelly, An exhaustive evaluation of row-major, column-major and Morton layouts for large two-dimensional arrays UK Performance Engineering Workshop (UKPEW 2003), Warwick, UK (July 2003). postscript, pdf.
Morton layout is a compromise storage layout between the programming language mandated layouts row-major and column-major, providing substantial locality of reference when traversed in either direction. This paper explores the performance of Morton, row-major and column-major layouts in detail on some representative architectures. Using a small suite of dense kernels working on two-dimensional arrays, we have carried out an extensive study of the impact of poor array layout and of whether Morton layout can offer an attractive compromise. Whether Morton layout is better than traversing a column-major array in row-major order (or vice versa) depends on problem size and architecture. Morton layout generally leads to much more consistent performance and only a small improvement in its performance could make it an attractive alternative.
Kwok Cheung Yeung and Paul H J Kelly, "Optimizing Java RMI programs by communication restructuring". In Proceedings, Middleware 2003: ACM/IFIP/USENIX International Middleware Conference, D Schmidt, M Endler, eds (Springer Verlag LNCS 2672) 2003. pp.324-343. postscript, pdf
We present an automated run-time optimization framework that can improve the performance of distributed applications written using Java RMI whilst preserving its semantics. Java classes are modified at load-time in order to intercept RMI calls as they occur. RMI calls are not executed immediately, but are delayed for as long as possible. When a dependence forces execution of the delayed calls, the aggregated calls are sent over to the remote server to be executed in one step. This reduces network overhead and the quantity of data sent, since data can be shared between calls. The sequence of calls may be cached on the server side along with any known constants in order to speed up future calls. A remote server may also make RMI calls to another remote server on behalf of the client if necessary. Our results show that the techniques can speed up distributed programs significantly, especially when operating across slower networks. We also discuss some of the challenges involved in maintaining program semantics, and show how the approach can be used for more ambitious optimizations in the future.
Ariel N Burton and Paul H J Kelly, Performance prediction of paging workloads using lightweight tracing..
- Journal version: Future Generation Computer Systems Volume 22, Issue 7 , August 2006, Pages 784-793 (special issue for the PMEO-PDS'03 workshop) ScienceDirect.
- Workshop version: Proceedings of International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS03) (Geyong Min, Mohamed Ould-Khaoua, eds). IEEE Computer Society Press (2003). PDF, postscript.
A trace of a workload's system calls can be obtained with minimal interference, and can be used to drive repeatable experiments to evaluate system configuration alternatives. Replaying system call traces alone sometimes leads to inaccurate predictions because paging, and access to memory-mapped files, are not modelled.
This paper extends tracing to handle such workloads. At trace capture time, the application's page-level virtual memory access is monitored. The size of the page access trace, and capture overheads, are reduced by excluding recently-accessed pages. This leads to a slight loss of accuracy. Using a suite of memory-intensive applications, we evaluate the capture overhead and measure the predictive accuracy of the approach.
Kwok Yeung, Paul H J Kelly and Sarah Bennett, Dynamic instrumentation for Java using a virtual JVM. In Performance Analysis and Grid Computing (Getov et al, eds, Kluwer 2003). pdf
Dynamic instrumentation, meaning modification of an application's instructions at run-time in order to monitor its behaviour, is a very powerful foundation for a wide range of program manipulation tools. This paper concerns the problem of implementing dynamic instrumentation for a managed run-time environment such as a Java Virtual Machine (JVM). We present a flexible new approach based on a ``virtual'' JVM, which runs above a standard JVM but intercepts application control flow in order to allow it to be modified at run-time. Our Veneer Virtual JVM works by fragmenting each method's bytecode at specified points (such as basic blocks). The fragmentation process can include static analysis passes which associate dependence and liveness metadata with each block in order to facilitate run-time optimization. We conclude with some preliminary performance results, and discuss further applications of the tool.
Peter Liniker, Olav Beckmann and Paul H J Kelly, Delayed Evaluation, Self-Optimizing Software Components as a Programming Model. In Euro-Par 2002, Springer LNCS 2400 pp.666-. pdf.
We argue that delayed-evaluation scientific software components, which dynamically change their behaviour according to their calling context at runtime offer a possible way of bridging the apparent conflict between the quality of scientific software and its performance. Rather than equipping scientific software components with a \emph{performance interface} which allows the caller to supply the context information that is lost when building abstract software components, we propose to recapture this lost context information at runtime. This paper is accompanied by a public release of a parallel linear algebra library with both C and C++ language interfaces which implements this proposal. We demonstrate the usability of this library by showing that it can be used to supply linear algebra component functionality to an existing external software package. We give preliminary performance figures and discuss avenues for future work.
Jeyarajan Thiyagalingam and Paul H J Kelly, "Is Morton layout competitive for large two-dimensional arrays?". In Euro-Par 2002, Springer LNCS 2400 pp.280-. pdf.
Two-dimensional arrays are generally arranged in memory in row-major order or column-major order. Sophisticated programmers, or occasionally sophisticated compilers, match the loop structure to the language's storage layout in order to maximise spatial locality. Unsophisticated programmers do not, and the performance loss is often dramatic --- up to a factor of 20. With knowledge of how the array will be used, it is often possible to choose between the two layouts in order to maximise spatial locality. In this paper we study the Morton storage layout, which has substantial spatial locality whether traversed in row-major or column-major order. We present results from a suite of simple application kernels which show that, on the AMD Athlon and Pentium III, for arrays larger than 256 x 256, Morton array layout, even implemented with a lookup table with no compiler support, is always within 61% of both row-major and column-major --- and is sometimes faster.
P.H.J. Kelly, S. Pelagatti and M. Rossiter, "Instant-access cycle-stealing for parallel applications requiring interactive response". In Euro-Par 2002 pdf.
Powerpoint talk
In this paper we study the use of idle cycles in a network of desktop workstations under unfavourable conditions: we aim to use idle cycles to improve the responsiveness of interactive applications through parallelism. Unlike much prior work in the area, our focus is on response time, not throughput, and short jobs - of the order of a few seconds. We therefore assume a high level of primary activity by the desktop workstations' users, and aim to keep interference with their work within reasonable limits. We present a fault-tolerant, low-administration service for identifying idle machines, which can usually assign a group of processors to a task in less than 200ms. Unusually, the system has no job queue: each job is started immediately with the resources which are predicted to be available. Using trace-driven simulation we study allocation policy for a stream of parallel jobs. Results show that even under heavy load it is possible to accommodate multiple concurrent guest jobs and obtain good speedup with very small disruption of host applications.
A.J. Field, P.H.J. Kelly and T.L. Hansen, "Optimizing Shared Reduction Variables in MPI Programs". In Euro-Par 2002 pdf.
CFL (Communication Fusion Library) is an experimental C++ library which supports shared reduction variables in MPI programs. It uses overloading to distinguish private variables from replicated, shared variables, and automatically introduces MPI communication to keep replicated data consistent. This paper concerns a simple but surprisingly effective technique which improves performance substantially: CFL operators are executed lazily in order to expose opportunities for run-time, context-dependent, optimization such as message aggregation and operator fusion. We evaluate the idea using both toy benchmarks and a `production' code for simulating plankton population dynamics in the upper ocean. The results demonstrate the software engineering benefits that accrue from the use of the library and show that performance close to that of manually optimized code can be achieved automatically in many cases.
David J Pearce and Paul H J Kelly, "GILK - a dynamic instrumentation tool for the Linux kernel". Proceedings of the 12th International Conference on Computer Performance Evaluation - Modelling techniques and tools (TOOLS'02), Springer Verlag LNCS 2324 (April 2002). pdf. Powerpoint presentation: here
Paul H J Kelly, Olav Beckmann, Tony Field and Scott Baden, "Themis: Component dependence metadata in adaptive parallel applications". Parallel Processing Letters, Vol. 11, No. 4 (2001) 455-470. From World Scientific
Earlier version presented at the International Workshop on High-Level Parallel Programming (HLPP2001), Orleans, France (March 2001). pdf.
Andrew J Bennett, Paul H J Kelly and Ross A Paterson, "Pipelined functional tree accesses and updates: scheduling, synchronization, caching and coherence". Journal of Functional Programming, Volume 11, Number 4 (July 2001). pp359-393. Available from CUP - try here or CUP's home page.
A J Field, T L Hansen and P H J Kelly, "Run-time fusion of MPI calls in a parallel C++ library". Poster paper at LCPC2000, The 13th International Workshop on Languages and Compilers for High-Performance Computing, Yorktown Heights, August 2000. Springer LNCS 2017 pp.363-.
Olav Beckmann and Paul H J Kelly, "A Review of Data Placement Optimisation for Data Parallel Component Composition". Presented at CMPP2000, the International Workshop on Constructive Methods for Parallel Programming (colocated at MPC2000). postscript, pdf. Published in "Constructive Methods for Parallel Programming", Sergei Gorlatch and Christian Lengauer, eds. Advances in Computation: Theory and Practice, V.10: Constructive Methods for Parallel Programming (Nova Science ISBN/ASIN: 1590333748).
Sarah Talbot and Paul H J Kelly, "Adaptive Proxies: handling widely-shared data in shared-memory multiprocessors" In Euro-Par2000, Springer LNCS 1900 pp.567-. pdf. Powerpoint talk
Ariel N Burton and Paul H J Kelly, "Reproducing Inter-process Synchronization for Performance Prediction using Lightweight System Call Tracing", pp 247-258 In J.T.Bradley, N.J.Davies (Eds), Proceedings of the Fifteenth Annual UK Performance Engineering Workshop, Tech. Rep. CSTR-99-007, Dept. of Computer Science, University of Bristol, ISBN 0 9524027 8 5, July 1999. pdf.
Paul H J Kelly and Frank S Taylor, "Coordination Languages". Chapter 14 in "Research Directions in Parallel Functional Programming", Hammond and Michaelson (eds), Springer Verlag (1999).
Olav Beckmann and Paul H J Kelly, "A Linear Algebra Formulation for Optimising Replication in Data Parallel Programs". LCPC'99, Springer LNCS 1863 pp.100-. pdf.
Ariel N. Burton and Paul H. J. Kelly. Tracing and re-executing operating system calls for reproducible performance experiments. Computers and Electrical Engineering: an International Journal Volume 26(3-4) pp.261-278 (Pergamon/Elsevier) April 2000. pdf
Olav Beckmann and Paul H J Kelly, "Efficient Interprocedural Data Placement Optimization in a Parallel Library". LCR'98, Springer LNCS 1511 pp.123-. pdf.
Sarah A. M. Talbot and Paul H. J. Kelly, "Stable Performance for cc-NUMA using First Touch Page Placement and Reactive Proxies". HPCS'98, Kluwer postscript, pdf.
Sarah A. M. Talbot and Paul H. J. Kelly, "Reactive Proxies: a Flexible Protocol Extension to Reduce ccNUMA Node Controller Contention". Distinguished paper in EuroPar'98 Springer LNCS 1470 pp.1062-. pdf
Olav Beckmann and Paul H J Kelly. "Data Distribution at Run-Time: Re-Using Execution Plans", EuroPar'98 Springer LNCS 1470 pp413- pdf
Ariel N. Burton and Paul H. J. Kelly. "Workload characterisation using lightweight system call tracing and re-execution", IEEE International Performance, Computing and Communications Conference, IPCCC'98. pdf
Jones, R. W. M. and Kelly, P. H. J. "Backwards-compatible bounds checking for arrays and pointers in C programs". in ``Third International Workshop on Automated Debugging'', M. Kamkar and D. Byers, eds (Linkoping University Electronic Press), 1997. pdf (see here for on-line proceedings).
Beckmann, O. and Kelly, P. H. J. "Experiments with Parallelising Numerical Applications via DESOLibraries". In "PCW'97: Proc. Sixth Parallel Computing Workshop", Parallel Computing Research Centre, Fujitsu Laboratories Ltd, Kawasaki, Japan, 1997. pdf
Beckmann, O. and Kelly, P. H. J. "Runtime Interprocedural Data Placement Optimization for Lazy Parallel Libraries". In ``EuroPar'97'' (Springer LNCS, August 1997). pdf
Field, A. J., Kelly, P. H. J., and Qian, W.. "M-Tree: A Parallel Abstract Data Type for Block-Irregular Adaptive Applications". Distinguished paper in ``EuroPar'97'' (Springer LNCS, August 1997). pdf
A. J. Bennett and P. H. J. Kelly and J. G. Refstrup and S. A. M. Talbot "Using Proxies to Reduce Controller Contention in Large Shared-Memory Multiprocessors". In EuroPar'96 (Lyon, August 1996), Springer Verlag LNCS 1124 pp.445--.
S. A. M. Talbot, A. J. Bennett and P. H. J. Kelly. "Cautious, Machine-Independent Performance Tuning for Shared-Memory Multiprocessors". In EuroPar'96 (Lyon, August 1996), Springer Verlag LNCS 1123 pp.106. pdf
Andrew J. Bennett, Paul H. J. Kelly and Ross Paterson. "Derivation and Performance of a Pipelined Transaction Processor". IEEE Symposium on Parallel and Distributed Processing, Dallas, December 1994, IEEE Press. pdf
Andrew J. Bennett and Paul H. J. Kelly. "Efficient shared-memory support for parallel graph reduction". In Fifth Generation Computer Systems, 1997. pdf
Paul Anderson and David Bolton and Paul Kelly. "Paragon Specifications: Structure, Analysis and Implementation". Future Generation Computer Systems Vol.10 No.1 (April 1994), pp.137--148.
Murray,K. and T. Stiemerling,T. and Wilkinson,T. and P. Kelly,P. "Angel: Resource Unification in a 64-bit Micro-Kernel". In Proceedings of 27th Hawaii International Conference on Systems Science, June 1994.
Andrew J. Bennett and Paul H. J. Kelly. "Eliminating Invalidation in Coherent-Cache Parallel Graph Reduction". In PARLE 94 - Parallel Architectures and Languages Europe, Athens, July 1994. C. Halatsis, D. Maritsas, G. Philokyprou, S. Theodoridis (eds). Springer Verlag LNCS 817 (pp.375-386) pdf.
J. Darlington, A.J. Field, P.G. Harrison, P.H.J. Kelly, D.W.N. Sharp, Q. Wu and R.L. While. "Parallel Programming Using Skeleton Functions". In PARLE'93: Parallel Architectures and Languages Europe, Springer Verlag LNCS 694 (1993), pp.146--160. pdf
Murray, K.A., Wilkinson, T.J., Stiemerling, T.R. and Kelly, P.H.J. "Angel: resource unification in a 64-bit microkernel". In Proceedings of the Twenty-Seventh Annual Hawaii International Conference on System Sciences, pages 106--115, Jan. 1994. pdf
K. Murray, A. Saulsbury, T. Stiemerling, T. Wilkinson, P. Kelly and P. Osmon. "Design and Implementation of an Object-Orientated 64-bit Single Address Space Microkernel". In 2nd USENIX Symposium on Microkernels and other Kernel Architectures, San Diego, September 1993. pdf
D. Bolton and C.L. Hankin and P.H.J. Kelly. "An Operational Semantics for {Paragon}: A Design Notation for Parallel Architectures" New Generation Computing Vol.9 (July 1991), pp.171--197, OHMSHA, Ltd and Springer Verlag.
Paul H.J. Kelly. Functional Programming for Loosely-coupled Multiprocessors. Pitman/MIT Press, 1989. Now available on-line, by kind permission of MIT Press.
Paul Anderson, Paul Kelly and Phil Winterbottom. "The Feasibility of a general-purpose parallel computer using WSI". Fifth Generation Computer Systems, 6 (1990) pp.241-253. North-Holland. doi:10.1016/0167-739X(90)90022-6
David Bolton, Chris Hankin, Paul H J Kelly. Parallel object-oriented descriptions of graph reduction machines Fifth Generation Computer Systems, 6 (1990) pp.225-. North-Holland. doi:10.1016/0167-739X(90)90021-5

Invited and Keynote talks (including slides and videos where available)

"Turing Tariff" Reduction: architectures, compilers and languages to break the universality barrier. Lunchtime talk at the Department of Computing, Imperial College London (June 2020). Slides are here.
Compiling for data not code Contribution to a Dagstuhl Seminar on Tensor Computations: Applications and Optimization (March 2020). Slides are here
Firedrake: the architecture of a compiler that automates the finite element method. Talk presented at Oxford and Edinburgh in Oct-Nov 2019. Slides are here.
Delivering and generalising domain-specific program optimisations, opening keynote talk at the 7th International Workshop on Polyhedral Compilation Techniques (IMPACT'17), Stockholm, Jan 23 2017 (colocated with the HiPEAC conference). Slides available here.
How does domain specificity enable domain-specific performance optimisations?, South of England Regional Programming Language Seminar (S-REPLS), 20th November 2015. Video is available here.
Synthesis versus Analysis: What Do We Actually Gain from Domain-Specificity? Opening keynote talk at the LCPC 2015,The 28th International Workshop on Languages and Compilers for Parallel Computing
PDF
Domain-specific performance optimisations (DSOs) can prove extremely profitable. My group at Imperial has worked on six or seven DSO different projects, mostly in computational science applications. This talk aims to reflect on our experiences. One aspect, of course, is whether we have a stand-alone domain-specific language (DSL), a DSL embedded in a general host language, or an “active library” whose implementation delivers DSOs, perhaps across sequences of calls. A key question, though, is just what enables us to deliver a DSO. Is it some special semantic property deriving from the domain? Is it because the DSL abstracts from implementation details – enabling the compiler to make choices that would be committed in lower-level code? Is it that the DSL captures large-scale dataflows that are obscured when coded in a conventional general-purpose language? Is it simply that we know that particular optimisations are good for a particular context? The talk will explore this question with reference to our DSO projects in finite-element methods, unstructured meshes, linear algebra and Fourier interpolation. This is joint work with many collaborators.
Invited talk: ISC15 (session on Programming Models on the Road to Exascale): Domain-Specific Representations in Code Generation for Mesh-Based Computational Science Applications (July 2015)
Distinguished Speaker, Queens University Belfast: Compiler technology for solving PDEs with performance portability (May 2015)
Distinguished lecture series, Department of Computer Science, University of Chicago, title: Compiler Technology for Solving PDEs with Performance Portability (November 13th 2014)
Keynote talk, WOLFHPC: Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (at SC14): Compiler technology for solving PDEs with performance portability (November 2014)
Keynote talk, Scottish Programming Languages Symposium (SPLS): Abstraction is not always theft: domain-specific representations in code generation for mesh-based computational science applications (October 2014)
Software Abstractions for many-core software engineering (talk on October 11th 2011 at the London Mathematical Society's Computer Science Day, which this year was devoted to High Performance Scientific Computing at the Exascale.).
PDF
What is the right code to generate, for a given hardware platform? How does this change as problem parameters change? This talk presents some recent work-in-progress exploring domain-specific languages and active libraries as a way to automate code generation for multicore and manycore platforms, and to capture the space of alternative implementation choices at a higher level than a compiler for a general-purpose language can. Paul Kelly will illustrate the potential for this idea by looking at our recent work on unstructured-mesh fluid dynamics, and finite element methods. By choosing the abstraction carefully, we can capture design choices far beyond what a conventional compiler can do - and, in the extreme, engage the users in selecting algorithms and numerical methods that match the capabilities of the underlying hardware to meet end-to-end objectives for solution quality.
Over and over again: the discipline of parallel software engineering (Inaugural Lecture, May 13th 2009)
PDF.
Programming seems to have a fundamentally serial nature . planning the steps a computer will take. The reality of computing machines is not like that at all: computation and data are distributed in space, and multiple activities can take place in complex, overlapped and interacting ways. Engineering software to do this is really hard, because the best design choices involve partitioning data and scheduling work at multiple scales. Furthermore, partitioning and scheduling often cut across the logical structures that we rely on for abstraction and reuse of software components. This talk will trace this problem back to the earliest days of computing . then show some of our most recent work that offers the prospect for overcoming these challenges. The key idea is to build software tools that support, and promote, programming at a conceptual level that exposes performance optimisations without committing to details prematurely. That is: build software libraries that capture the computational structure. Then, by using knowledge and experience of particular application domains, generate efficient parallel code from it.
Generative and adaptive methods in performance programming
Invited talk CMPP2004 (Constructive Methods in Parallel Programming), July 2004.
Powerpoint, PDF.
Performance programming is characterised by the need to structure software components to exploit the context of use. Relevant context includes the target processor architecture, the available resources (parallelism, network, contention), the values and shapes of input and intermediate data structures, the schedule and distribution of input data delivery, and the way the results are to be used. This talk focuses on adapting to dynamic context: adaptive algorithms, malleable and migrating tasks, and application structures based on dynamic component composition. Adaptive computations use metadata -- performance models, dependence information, data size and shape. Computation itself is interwoven with planning and optimising the computation process, using this metadata. This reflective nature motivates both multi-stage and metaprogramming techniques. This talk presents a research agenda aimed at developing a modelling framework which allows us to characterise both computation and dynamic adaptation in a way that allows systematic optimisation.
Run-time code generation in C++ as a foundation for domain-specific optimization Slides from research seminar at University of Greenwich, November 2004.
Powerpoint, PDF
Multi-stage programming languages internalise the idea of generating code at runtime, then executing it as part of your program. This talk is about our library for doing this in C++, and how it might be useful in scientific applications. Specialisation is certainly powerful - generating code based on runtime knowledge of data values. We also support meta-programming - having built some code, you can programmatically mess with it. Our "TaskGraph" library supports a suite of loop restructuring transformations, including loop fusion, interchange, tiling and skewing (this comes for free as we build on the Stanford SUIF framework). We present a couple of motivating examples, which develop the idea of "domain-specific optimisation components" - a library of metaprogram code that encodes expertise about particular loops, applications or tuning strategies.
"A Library for Explicit Dynamic Code Generation and Optimization in C++".
Slides from presentation at Dagstuhl Workshop 03131 on Domain-Specific Program Generation (March 2003)
Powerpoint. Superceded by similar updated talk above.
The TaskGraph Library is a C++ library for dynamic code generation, which combines specialisation with dependence analysis and loop restructuring. A TaskGraph represents a fragment of code which is constructed and manipulated at run-time, then compiled, dynamically linked and executed. TaskGraphs are initialised using macros and overloading, which forms a simplified, C-like sub-language with first-class arrays and no pointer arithmetic. Once a TaskGraph has been constructed, we can analyse dependence structure and perform optimisations. In this talk, we present the design of the TaskGraph library, present several sample applications for this library which demonstrate its use for runtime code specialisation and optimisation and discuss related work and future applications and developments of the library.
"Distributed Java Applications: Dynamic Instrumentation and Automatic Optimization"
Slides from presentation at Dagstuhl workshop on Performance Analysis and Distributed Computing (August 2002)
Powerpoint, PDF.
"Abstract, declarative control over partitioning in parallel functional programs: experiences with Caliban".
Invited talk at the Workshop on Parallel Functional Programming, UCL, September 1998 pdf.
"Bounds checking for C".
Slides from presentation at AADEBUG'97.
postscript, PDF

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.