(see end of page for slides from various talks, including my inaugural lecture (May 13 2009)
We study and systematically evaluate a class of composable code transformations that improve arithmetic intensity in local assembly operations, which represent a significant fraction of the execution time in finite element methods. Their performance optimization is indeed a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions, which vary among different problems, make it hard to determine an optimal sequence of successful transformations. Our investigation has resulted in the implementation of a compiler (called COFFEE) for local assembly kernels, fully integrated with a framework for developing finite element methods. The compiler manipulates abstract syntax trees generated from a domain-specific language by introducing domain-aware optimizations for instruction-level parallelism and register locality. Eventually, it produces C code including vector SIMD intrinsics. Experiments using a range of real-world finite element problems of increasing complexity show that significant performance improvement is achieved. The generality of the approach and the applicability of the proposed code transformations to other domains is also discussed.
The Fourier interpolation of 3D data-sets is a performance critical operation in many fields, including certain forms of image processing and density functional theory (DFT) quantum chemistry codes based on plane wave basis sets, to which this paper is targeted. In this paper we describe three different algorithms for performing this operation built from standard discrete Fourier transform operations, and derive theoretical operation counts. The algorithms compared consist of the most straightforward implementation and two that exploit techniques such as phase-shifts and knowledge of zero padding to reduce computational cost. Through a library implementation (tintl) we explore the performance characteristics of these algorithms and the performance impact of different implementation choices on actual hardware. We present comparisons within the linear-scaling DFT code ONETEP where we replace the existing interpolation implementation with our library implementation configured to choose the most efficient algorithm. Within the ONETEP Fourier interpolation stages, we demonstrate speed-ups of over 1.55×.
We present a symbolic execution-based technique for cross-checking programs accelerated using SIMD or OpenCL against an unaccelerated version, as well as a technique for detecting data races in OpenCL programs. Our techniques are implemented in KLEE-CL, a tool based on the symbolic execution engine KLEE that supports symbolic reasoning on the equivalence between expressions involving both integer and floating-point operations. While the current generation of constraint solvers provide effective support for integer arithmetic, the situation is different for floating-point arithmetic, due to the complexity inherent in such computations. The key insight behind our approach is that floating-point values are only reliably equal if they are essentially built by the same operations. This allows us to use an algorithm based on symbolic expression matching augmented with canonicalisation rules to determine path equivalence. Under symbolic execution, we have to verify equivalence along every feasible control-flow path. We reduce the branching factor of this process by aggressively merging conditionals, if-converting branches into select operations via an aggressive phi-node folding transformation. To support the Intel Streaming SIMD Extension (SSE) instruction set, we lower SSE instructions to equivalent generic vector operations, which in turn are interpreted in terms of primitive integer and floating-point operations. To support OpenCL programs, we symbolically model the OpenCL environment using an OpenCL runtime library targeted to symbolic execution. We detect data races by keeping track of all memory accesses using a memory log, and reporting a race whenever we detect that two accesses conflict. By representing the memory log symbolically, we are also able to detect races associated with symbolically-indexed accesses of memory objects. We used KLEE-CL to prove the bounded equivalence between scalar and data-parallel versions of floating-point programs and find a number - f issues in a variety of open source projects that use SSE and OpenCL, including mismatches between implementations, memory errors, race conditions and a compiler bug.
Data-dependent GPU kernels, whose data or control flow are dependent on the input of the program, are difficult to verify because they require reasoning about shared state manipulated by many parallel threads. Existing verification techniques for GPU kernels achieve soundness and scalability by using a two-thread reduction and making the contents of the shared state nondeterministic each time threads synchronise at a barrier, to account for all possible thread interactions. This coarse abstraction prohibits verification of datadependent kernels. We present barrier invariants, a novel abstraction technique which allows key properties about the shared state of a kernel to be preserved across barriers during formal reasoning. We have integrated barrier invariants with the GPUVerify tool, and present a detailed case study showing how they can be used to verify three prefix sum algorithms, allowing efficient modular verification of a stream compaction kernel, a key building block for GPU programming. This analysis goes significantly beyond what is possible using existing verification techniques for GPU kernels.
OP2 is a high-level domain specific library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into multiple parallel implementations for execution on a range of back-end hardware platforms. In this paper we present the design and performance of OP2’s recent developments facilitating code generation and execution on distributed memory heterogeneous systems. OP2 targets the solution of numerical problems based on static unstructured meshes. We discuss the main design issues in parallelizing this class of applications. These include handling data dependencies in accessing indirectly referenced data and design considerations in generating code for execution on a cluster of multi-threaded CPUs and GPUs. Two representative CFD applications, written using the OP2 framework, are utilized to provide a contrasting benchmarking and performance analysis study on a number of heterogeneous systems including a large scale Cray XE6 system and a large GPU cluster. A range of performance metrics are benchmarked including runtime, scalability, achieved compute and bandwidth performance, runtime bottlenecks and systems energy consumption. We demonstrate that an application written once at a high-level using OP2 is easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.
Partitioning a parallel computation into finitely sized chunks for effective mapping onto a parallel machine is a critical concern for source-to-source compilation. In the context of OpenCL and CUDA, this translates to the definition of a uniform hyper-rectangular partitioning of the parallel execution space where each partition is subject to a fine-grained distribution of resources that has a direct yet hard to estimate impact on performance. This paper develops the first compilation scheme for generating parametrically tiled codes for affine loop programs on GPUs which facilitates run-time exploration of partitioning parameters as a fast and portable way of finding the ones that yield maximum performance. Our approach is based on a parametric tiling scheme for producing wavefronts of parallel rectangular partitions of parametric size and a novel runtime system that manages wavefront execution and local memory usage dynamically through an inspector-executor mechanism. Our experimental evaluation demonstrates the effectiveness of our approach for wavefront as well as rectangularly-parallel partitionings.
We present the major advantages of a new ‘object oriented’ 3D SLAM paradigm, which takes full advantage in the loop of prior knowledge that many scenes consist of repeated, domain-specific objects and structures. As a hand-held depth camera browses a cluttered scene, realtime 3D object recognition and tracking provides 6DoF camera-object constraints which feed into an explicit graph of objects, continually refined by efficient pose-graph optimisation. This offers the descriptive and predictive power of SLAM systems which perform dense surface reconstruction, but with a huge representation compression. The object graph enables predictions for accurate ICP-based camera to model tracking at each live frame, and efficient active search for new objects in currently undescribed image regions. We demonstrate real-time incremental SLAM in large, cluttered environments, including loop closure, relocalisation and the detection of moved objects, and of course the generation of an object level scene description with the potential to enable interaction.
This paper demonstrates a methodology to help practitioners maximise the utility of complex multidisciplinary engineering models implemented as spreadsheets, an area presenting unique challenges. As motivation we investigate the expanding use of Integrated Resource Management (IRM) models which assess the sustainability of urban masterplan designs. IRM models reflect the inherent complexity of multidisciplinary sustainability analysis by integrating models from many disciplines. This complexity makes their use time-consuming and reduces their adoption.
We present a methodology and toolkit for analysing multidisciplinary engineering models implemented as spreadsheets to alleviate such problems and increase their adoption. For a given output a relevant slice of the model is extracted, visualised and analysed by computing model and interdisciplinary metrics. A sensitivity analysis of the extracted model supports engineers in their optimisation efforts. These methods expose, manage and reduce model complexity and risk whilst giving practitioners insight into multidisciplinary model composition. We report application of the methodology to several generations of an industrial IRM model and detail the insight generated, particularly considering model evolution.
There is a significant, established code base in the computational science community. Some of these codes have been parallelized already but are now encountering scalability issues due to poor data locality, inefficient data distributions, or load imbalance. In this work, we introduce a new abstraction called loop chaining in which a sequence of parallel and/or reduction loops that explicitly share data are grouped together into a chain. Once specified, a chain of loops can be viewed as a set of iterations under a partial ordering dictated by data dependencies that as part of the abstraction should be exposed to avoid interprocedural program analysis. Thus a loop chain is a partially ordered set of iterations that makes scheduling and determining data distributions across loops possible for a compiler and/or runtime system. The flexibility of being able to schedule across loops enables better management of the data locality and parallelism tradeoff. In this paper, we define the loop chaining concept and present three case studies using loop chains in scientific codes: the sparse matrix Jacobi benchmark, a domain-specific library, Chombo, used in full applications with structured grids, and a domain-specific library, OP2, used in full applications with unstructured grids. Preliminary results for the Jacobi benchmark show that a loop chain enabled optimization, full sparse tiling, results in a speedup of as much as 2.68x over a parallelized, blocked implementation on a multicore system with 40 cores.
Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer “time” loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism.
One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs.
This paper introduces a method to combine the advantages of both task parallelism and fine-grained co-design specialisation to achieve faster execution times than either method alone on distributed heterogeneous architectures. The method uses a novel mixed integer linear programming formalisation to assign code sections from parallel tasks to share computational components with the optimal trade-off between acceleration from component specialism and serialisation delay. The paper provides results for software benchmarks partitioned using the method and formal implementations of previous alternatives to demonstrate both the practical tractability of the linear programming approach and the increase in program acceleration potential deliverable.
Automated code generators for finite element local assembly have facilitated the exploration of a number of different implementation strategies within the generated code. However, even with respect to a theoretical performance indicator such as operation count, there is not currently a known optimal strategy for performing local assembly. We explore a code generation strategy based on symbolic integration and polynomial factorisation designed to expose optimisation opportunities. We present our implementation of a finite element assembly code generator based on these techniques. We compare the operation count and performance of our generated code against quadrature and tensor contraction based implementations generated by the FEniCS Form Compiler (FFC) under multiple compilers. Although the approach of using symbolic integration and factorisation for optimising finite element assembly is not new, we distinguish our work through certain design choices and our strategy for optimising common sub-expressions. Under one compiler, we show reductions in the operation count of the compiled code across a number of variational forms of up to 4 times. We also explore the impact on numerical accuracy of some of the FFC tensor contraction optimisations.
This paper introduces a novel execution paradigm called the Write-Only Architecture (WOA) that reduces communication latency overheads by up to a factor of five over previous methods. The WOA writes data through distributed control flow logic rather than using a read–write paradigm or a centralised message hub which allows tasks to be partitioned at a fine-grained level without suffering from excessive communication overheads on distributed systems. In this paper we provide formal assignment results for software benchmarks partitioned using the WOA and previous execution paradigms for distributed heterogeneous architectures along with bounds and complexity information to demonstrate the robust performance improvements possible with the WOA.
Mesh smoothing is an important algorithm for the improvement of element quality in unstructured mesh finite element methods. A new optimisation based mesh smoothing algorithm is presented for anisotropic mesh adaptivity. It is shown that this smoothing kernel is very effective at raising the minimum local quality of the mesh. A number of strategies are employed to reduce the algorithm's cost while maintaining its effectiveness in improving overall mesh quality. The method is parallelised using hybrid OpenMP/MPI programming methods, and graph colouring to identify independent sets. Different approaches are explored to achieve good scaling performance within a shared memory compute node.
Applications based on unstructured meshes are typically compute intensive, leading to long running times. In principle, state-of-the-art hardware, such as multi-core CPUs and many-core GPUs, could be used for their acceleration but these esoteric architectures require specialised knowledge to achieve optimal performance. OP2 is a parallel programming layer which attempts to ease this programming burden by allowing programmers to express parallel iterations over elements in the unstructured mesh through an API call, a so-called OP2-loop. The OP2 compiler infrastructure then uses source-to-source transformations to realise a parallel implementation of each OP2-loop and discover opportunities for optimisation.
In this paper, we describe how several compiler techniques can be effectively utilised in tandem to increase the performance of unstructured mesh applications. In particular, we show how whole-program analysis --- which is often inhibited due to the size of the control flow graph - often becomes feasible as a result of the OP2 programming model, facilitating aggressive optimisation. We subsequently show how whole-program analysis then becomes an enabler to OP2-loop optimisations. Based on this, we show how a classical technique, namely loop fusion, which is typically difficult to apply to unstructured mesh applications, can be defined at compile-time. We examine the limits of its application and show experimental results on a computational fluid dynamic application benchmark, assessing the performance gains due to loop fusion.
OP2 is an ``active'' library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into different parallel implementations for execution on different back-end hardware platforms. In this paper we present the design of the current OP2 library, and investigate its capabilities in achieving performance portability, near-optimal performance, and scaling on modern multi-core and many-core processor based systems. A key feature of this work is OP2’s recent extension facilitating the development and execution of applications on a distributed memory cluster of GPUs.
We discuss the main design issues in parallelizing unstructured mesh based applications on heterogeneous platforms. These include handling data dependencies in accessing indirectly referenced data, the impact of unstructured mesh data layouts (array of structs vs. struct of arrays) and design considerations in generating code for execution on a cluster of GPUs. A representative CFD application written using the OP2 framework is utilized to provide a contrasting benchmarking and performance analysis study on a range of multi-core/many-core systems. These include multi-core CPUs from Intel (Westmere and Sandy Bridge) and AMD (Magny-Cours), GPUs from NVIDIA (GTX560Ti, Tesla C2070), a distributed memory CPU cluster (Cray XE6) and a distributed memory GPU cluster (Tesla C2050 GPUs with InfiniBand). OP2’s design choices are explored with quantitative insights into their contributions to performance. We demonstrate that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.
OP2 is an "active" library framework for the development and solution of unstructured mesh based applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back-end to different multi-core/many-core hardware. This paper presents a predictive performance analysis and benchmarking study of OP2 on heterogeneous cluster systems. We first present the design of a new OP2 back-end that enables the execution of applications on distributed memory clusters, and benchmark its performance during the solution of a 1.5M and 26M edge-based CFD application written using OP2. Benchmark systems include a large-scale CrayXE6 system and an Intel Westmere/InfiniBand cluster. We then apply performance modeling to predict the application’s performance on an NVIDIA Tesla C2070 based GPU cluster, enabling us to compare OP2’s performance capabilities on emerging distributed memory heterogeneous systems. Results illustrate the performance benefits that can be gained through many-core solutions both on single-node and heterogeneous configurations in comparison to traditional homogeneous cluster systems for this class of applications.
Power gating reduces static power by either disabling whole units or dynamically resizing units to meet application demands. The Loop-Directed Mothballing technique lets users power gate execution units by recording utilization of individual units in loops, and by power gating units according to two utilization thresholds. LDM offers on average 10.3 percent total power savings with low performance loss.
Publisher's site (doi: 10.1109/MM.2011.92)
We present an effective technique for crosschecking a C or C++ program against an accelerated OpenCL version, as well as a technique for detecting data races in OpenCL programs. Our techniques are implemented in KLEE-CL, a symbolic execution engine based on KLEE and KLEE-FP that supports symbolic reasoning on the equivalence between symbolic values.
Our approach is to symbolically model the OpenCL environment using an OpenCL runtime library targeted to symbolic execution. Using this model we are able to run OpenCL programs symbolically, keeping track of memory accesses for the purpose of race detection. We then compare the symbolic result against the plain C or C++ implementation in order to detect mismatches between the two versions.
We applied KLEE-CL to the Parboil benchmark suite, the Bullet physics library and the OP2 library, in which we were able to find a total of seven errors: two mismatches between the OpenCL and C implementations, three memory errors, one OpenCL compiler bug and one race condition.
We demonstrate that radically differing implementations of finite element methods are needed on multi-core (CPU) and many-core (GPU) architectures, if their respective performance potential is to be realised. Our numerical investigations using a finite element advection-diffusion solver show that increased performance on each architecture can only be achieved by committing to specific and diverse algorithmic choices that cut across the high-level structure of the implementation. Making these commitments to achieve high performance for a single architecture leads to a loss of performance portability. Data structures that include redundant data but enable coalesced memory accesses are faster on many-core architectures, whereas redundancy-free data structures that are accessed indirectly are faster on multi-core architectures. The Addto algorithm for global assembly is optimal on multi-core architectures, whereas the Local Matrix Approach is optimal on many-core architectures despite requiring more computation than the Addto algorithm. These results demonstrate the value in making the correct choice of algorithm and data structure when implementing finite element methods, spectral element methods and low-order discontinuous Galerkin methods on modern high-performance architectures.
OP2 is an "active" library framework for the development and solution of unstructured mesh-based applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back-end to different multi-core/many-core hardware. This paper presents a summary of a predictive performance analysis and benchmarking study of OP2 on heterogeneous cluster systems. In this work, an industrial representative CFD application written using the OP2 framework is benchmarked during the solution of an unstructured mesh of 1.5M and 26M edges. Benchmark systems include a large-scale Cray XE6 system and an Intel Westmere/InfiniBand cluster. Performance modeling is then used to predict the application’s performance on an NVIDIA Tesla C2070-based GPU cluster, enabling the comparison of OP2's performance capabilities on emerging distributed memory heterogeneous systems. Results illustrate the performance benefits that can be gained through many-core solutions both on single-node and heterogeneous configurations in comparison to traditional homogeneous cluster systems for this class of application.
Anisotropic mesh smoothing is used to generate optimised meshes for Computational Fluid Dynamics (CFD). Adapting the size and shape of elements in an unstructured mesh to a specification encoded in a metric tensor field is done by relocating mesh vertices. This computationally intensive task can be accelerated by engaging nVIDIA's CUDA-enabled GPUs. This article describes the algorithmic background, the design choices and the implementation details that led to a meshsmoothing application running in double-precision on a Tesla C2050 board. Engaging CUDA's texturing hardware to manipulate the metric tensor field accelerates execution by up to 6.2 times, leading to a total speedup of up to 148 times over the serial CPU code and up to 15 times over the 12-threaded OpenMP code..
We present an effective technique for crosschecking an IEEE 754 floating-point program and its SIMD-vectorized version, implemented in KLEE-FP, an extension to the KLEE symbolic execution tool that supports symbolic reasoning on the equivalence between floating-point values. The key insight behind our approach is that floatingpoint values are only reliably equal if they are essentially built by the same operations. As a result, our technique works by lowering the Intel Streaming SIMD Extension (SSE) instruction set to primitive integer and floating-point operations, and then using an algorithm based on symbolic expression matching augmented with canonicalization rules. Under symbolic execution, we have to verify equivalence along every feasible control-flow path. We reduce the branching factor of this process by aggressively merging conditionals, if-converting branches into select operations via an aggressive phi-node folding transformation. We applied KLEE-FP to OpenCV, a popular open source computer vision library. KLEE-FP was able to successfully crosscheck 51 SIMD/SSE implementations against their corresponding scalar versions, proving the bounded equivalence of 41 of them (i.e., on images up to a certain size), and finding inconsistencies in the other 10.
This paper presents a benchmarking, performance analysis and optimization study of the OP2 "active" library, which provides an abstraction framework for the parallel execution of unstructured mesh applications. OP2 aims to decouple the scientific specification of the application from its parallel implementation, and thereby achieve code longevity and near-optimal performance through re-targeting the application to execute on different multi-core/many-core hardware. Runtime performance results are presented for a representative unstructured mesh application on a variety of many-core processor systems, including traditional X86 architectures from Intel (Xeon based on the older Penryn and current Nehalem micro-architectures) and GPU offerings from NVIDIA (GTX260, Tesla C2050). Our analysis demonstrates the contrasting performance between the use of CPU (OpenMP) and GPU (CUDA) parallel implementations for the solution of an industrial-sized unstructured mesh consisting of about 1.5 million edges. Results show the significance of choosing the correct partition and thread-block configuration, the factors limiting the GPU performance and insights into optimizations for improved performance.
There is a growing interest in high-order finite and spectral/hp element methods using continuous and discontinuous Galerkin formulations. In this paper we investigate the effect of h- and p-type refinement on the relationship between runtime performance and solution accuracy. The broad spectrum of possible domain discretisations makes establishing a performance-optimal selection non-trivial. Through comparing the runtime of different implementations for evaluating operators over the space of discretisations with a desired solution tolerance, we demonstrate how the optimal discretisation and operator implementation may be selected for a specified problem. Furthermore, this demonstrates the need for codes to support both low- and high-order discretisations.
A spectral/hp element discretisation permits both geometric flexibility and beneficial convergence properties to be attained simultaneously. The choice of elemental polynomial order has a profound effect on the efficiency of different implementation strategies with their performance varying substantially for low and high order spectral/hp discretisations. We examine how careful selection of the strategy minimises computational cost across a range of polynomial orders in three dimensions and compare how different operators, and the choice of element shape, lead to different break-even points between the implementations. In three dimensions, higher expansion orders quickly lead to a large increase in the number of element-interior modes, particularly in hexahedral elements. For a typical boundary.interior modal decomposition, this can rapidly lead to a poor performance from a global approach, while a sum-factorisation technique, exploiting the tensor-product structure of elemental expansions, leads to better performance. Furthermore, increased memory requirements may cause an implementation to show poor runtime performance on a given system, even if the strict operation count is minimal, due to detrimental caching effects and other machine-dependent factors.
We argue that producing maintainable high-performance implementations of finite element methods for multiple targets requires that they are written using a high-level domain-specific language. We make the case for using one such language, the Unified Form Language (UFL), by discussing how it allows the generation of high-performance code from maintainable sources. We support this case by showing that optimal implementations of a finite element solver written for a Graphics Processing Unit and a multicore CPU require the use of different algorithms and data formats that are embodied by the UFL representation. Finally we describe a prototype compiler that generates low-level code from high-level specifications, and outline how the high-level UFL representation can be lowered to facilitate optimisation using existing techniques prior to code generation.
The dynamic topological order problem is that of efficiently updating a topological order after some edge(s) are inserted into a graph. Much prior work exists on the unit-change version of this problem, where the order is updated after every single insertion. No previous (non-trivial) algorithms are known for the batch version of the problem, where the order is updated after every batch of insertions. We present the first such algorithm. This requires O(min{k.(v+e), ve}) time to process any sequence of k insertion batches. This is achieved by only recomputing those region(s) of the order affected by the inserted edges. In many cases, our algorithm will only traverse small portions of the graph when processing a batch. We empirically evaluate our algorithm against previous algorithms for this problem, and find that it performs well when the batch size is sufficiently large.
We demonstrate that the performance of commodity parallel systems significantly depends on low-level details, such as storage layout and iteration space mapping, which motivates the need for tools and techniques that separate a high-level algorithm description from low-level mapping and tuning. We propose to build a tool based on the concept of decoupled Access/Execute metadata which allow the programmer to specify both execution constraints and memory access pattern of a computation kernel.
We describe the evaluation of several implementations of a simple image processing filter on an NVIDIA GTX 280 card. Our experimental results show that performance depends significantly on low-level details such as data layout and iteration space mapping which complicate code development and maintenance. We propose extending a CUDA or OpenCL like model with decoupled Access/Execute (AEcute) metadata, describing its iteration space ordering and partitioning (execute metadata) and its memory access patterns (access metadata).
SIMT (Single-Instruction Multiple-Thread) is an emerging programming paradigm for high-performance computational accelerators, pioneered in current and next generation GPUs and hybrid CPUs. We present a domain-specic active library supported approach to SIMT code generation and optimisation in the field of visual effects. Our approach uses high-level metadata and runtime context to guide and to ensure the correctness of optimisation-driven code transformations and to implement runtime-context-sensitive optimisations. Our advanced optimisations require no analysis of the original C++ kernel code and deliver 1.3x to 6.6x speedups over syntax-directed translation on GeForce 8800 GTX and GTX 260 GPUs with two commercial visual effects.
On multi-core architectures with software-managed memories, effectively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explicitly specify both memory access pattern and execution schedule of a computation kernel, the compiler or runtime system can derive efficient data movement even if analysis of kernel code is difficult or impossible. We have developed a framework of C++ classes for decoupled access/execute specifications, allowing for automatic communication optimisations such as software pipelining and double buffering. We have used these classes to implement a set of benchmarks, which exhibit data reuse and non-affine access functions. We demonstrate the ease and efficiency of programming the Cell BE architecture using our techniques by comparing against alternative benchmark implementations, which use hand-written DMA transfers and software-based caching.
This paper explores the use of dependence metadata for optimising composition in component-based parallel programs. The idea is for each component to carry additional information about how points in its iteration space map to memory locations associated with its input and output data structures. When two components are composed this information can be used to implement optimisations that would otherwise require expensive analysis of the components' code at the time of composition. This dependence metadata facilitates a number of cross-component optimisations -- in this paper we focus on loop fusion and array contraction. We describe a prototype framework, based on the CLooG loop generator tool, that embodies these ideas and report experimental performance results for three non-trivial parallel benchmarks. Our results show execution time reductions of up to 50% using the proposed framework on an eight-core Intel Xeon system.
Active libraries can be defined as libraries which play an active part in the compilation, in particular, the optimisation of their client code. This paper explores the implementation of an active dense linear algebra library by delaying evaluation of expressions built using library calls, then generating code at runtime for the compositions that occur. The key optimisations in this context are loop fusion and array contraction.
Our prototype C++ implementation, DESOLA, automatically fuses loops arising from different client calls, identifies unnecessary intermediate temporaries, and contracts temporary arrays to scalars. Performance is evaluated using a benchmark suite of linear solvers from ITL (Iterative Template Library), and is compared with MTL (Matrix Template Library), ATLAS (Automatically Tuned Linear Algebra) and IMKL (Intel Math Kernel Library). Excluding runtime compilation overheads (caching means they occur only on the first iteration), for larger matrix sizes, performance matches or exceeds MTL; when fusion of matrix operations occurs, performance exceeds that of ATLAS and IMKL.
This is a study of a technique for deriving the session type of a program written in a statically typed imperative language from its control flow. We impose on our unlabelled session type syntax a well-formedness constraint based upon normalisation and explore the effects thereof. We present our inference algorithm declaratively and in a form suitable for implementation, and illustrate it with examples. We then present an implementation of the algorithm using a program analysis and transformation toolkit.
GCC needs a strategy to support future multicore architectures, which will probably include heterogeneous accelerator-like designs with explicit management of scratchpad memories; some have further restrictions, for example SIMD, with limited synchronization capabilities. Some platforms will probably offer hardware support for streaming, transactions and speculation. The purpose of this paper is to give a survey and evaluation of some automatic and manual techniques for improving support for such targets in GCC. We focus on translation of sequential code for such platforms, i.e. the translation to task graphs and their communication and memory access operations. The paper provides an evaluation of the communication library support on a quad-core AMD Phenom[tm] 9550 processor. We use these experiments to tune the automatic task partitioning algorithm implemented in GCC. The paper concludes with recommendations for strategic developments of GCC to support a stream programming language, and improve the automatic generation of streamized tasks.
This paper presents a methodology for generating floating-point arithmetic hardware designs which are, for suitable applications, dramatically reduced in size, while still retaining performance. We use a profiling tool for floating-point value ranges to identify arithmetic operations where the shifting required for alignment and normalisation is almost always small. We synthesise hardware with reduced-size barrel-shifters, but always detect when operands lie outside the range this optimised hardware can handle. These rare out-of-range operations are handled by a separate full floating-point implementation, either on-chip or by returning calculations to the host. Thus the system suffers no compromise in IEEE754 compliance. This paper presents results for two benchmark applications which profiling suggested would be profitable. We demonstrate the potential for this technique to yield an increase in parallel computing power of up to 43%, with a (correctable) error rate of less than 5%.
The subject of this paper is flow- and context-insensitive pointer analysis. We present a novel approach for precisely modelling struct variables and indirect function calls. Our method emphasises efficiency and simplicity and consists of an extension to the existing system of set-constraints. We obtain an O(n4) bound on the time needed to solve constraint sets from our extended language. This gives, for the rst time, some insight into the hardness of performing field-sensitive pointer analysis of C. Furthermore, we experimentally evaluate the time versus precision trade-off for our method by comparing against the field-insensitive equivalent. Our benchmark suite consists of 11 common C programs ranging in size from 15,000 to 200,000 lines of code. Our results indicate the field-sensitive analysis is more expensive to compute, but yields signicantly better precision. In addition, our technique has been integrated into the GNU Compiler GCC (version 4.1). Finally, we identify several previously unknown issues with an alternative and less precise approach to modelling struct variables, known as field-based analysis. Further details are available here.
Developers need to be able to write code using high-level, reusable black-box components. Also essential is confidence that code can be mapped to an efficient implementation on the available hardware, with robust high performance. In this paper we present a prototype component library being developed to deliver this for industrial visual effects applications. Components are based on abstract algorithmic skeletons that provide metadata characterizing data accesses and dependence constraints. Metadata is combined at run-time to build a polytope representation which supports aggressive inter-component loop fusion. We present results for a wavelet-transform-based degraining filter running on multicore PC hardware, demonstrating 3.4x.5.3x speed-ups, improved parallel efficiency and a 30% reduction in memory consumption without compromising the program structure.
DeepWeaver-1 is tool supporting cross-cutting program analysis and transformation components, called ``weaves''. Like an aspect, a DeepWeaver weave consists of a query part, and a part which may modify code. DeepWeaver's query language is based on Prolog, and provides access to data-flow and control-flow reachability analyses. DeepWeaver provides a declarative way to access the internal structure of methods, and supports cross-cutting weaves which operate on code blocks from different parts of the codebase simultaneously. DeepWeaver operates at the level of bytecode, but offers predicates to extract structured control flow constructs. This paper motivates the design, and demonstrate some its power, using a sequence of examples including performance profiling and domain-specific performance optimisations for database access and remote method invocation.
Reconfigurable architectures offer potential for performance
enhancement by specializing the implementation of floating-point
arithmetic. This paper presents FloatWatch, a dynamic execution
profiling tool designed to identify where an application can benefit
from reduced precision or reduced range in floating-point
computations. FloatWatch operates on x86 binaries, and generates a
profile output file recording, for each instruction and line of source
code, the overall range of floating-point values, the bucketised
sub-ranges of values, and the maximum difference between 64-bit and
32-bit executions.
We present results from the tool on a suite of four benchmark
codes. Our tool indicates potential performance loss due to denormal
values, and helps to identify opportunities for using dual fixed-point
arithmetic representation which has proved effective for
reconfigurable designs. Our results show that applications often have
highly modal value distributions, offering promise for aggressive
floating-point arithmetic optimisations.
Active libraries can be defined as libraries which play an active part in the compilation (in particular, the optimisation) of their client code. This paper explores the idea of delaying evaluation of expressions built using library calls, then generating code at runtime for the particular compositions that occur. We explore this idea with a dense linear algebra library for C++. The key optimisations in this context are loop fusion and array contraction.
Current and potential users for field-programmable gate arrays (FPGAs) are increasingly looking to high-level languages as a means to widen applicability and cope with ever-increasing transistor counts. We present a transformation system for one such high-level language for electronic circuit design, which goes some way towards bridging the growing gulf between the domain and architecture experts. We demonstrate its effectiveness by realistic, albeit relatively small, case studies; performance improvements of up to 70% have been achieved.
We consider the problem of maintaining the topological order of a directed acyclic graph (DAG) in the presence of edge insertions and deletions. We present a new algorithm and, although this has inferior time complexity compared with the best previously known result, we find that its simplicity leads to better performance in practice. In addition, we provide an empirical comparison against the three main alternatives over a large number of random DAGs. The results show our algorithm is the best for sparse digraphs and only a constant factor slower than the best on dense digraphs.
This paper presents work-in-progress towards a C++ source-to-source translator that automatically seeks parallelisable code fragments and replaces them with code for a graphics co-processor. We report on our experience with accelerating an industrial image processing library. To increase the effectiveness of our approach, we exploit some domain-specific knowledge of the library's semantics. We outline the architecture of our translator and how it uses the ROSE source-to-source transformation library to overcome complexities in the C++ language. Techniques for parallel analysis and source transformation are presented in light of their uses in GPU code generation. We conclude with results from a performance evaluation of two examples, image blending and an erosion filter, hand-translated with our parallelisation techniques. We show that our approach has potential and explain some of the remaining challenges in building an effective tool.
Two-dimensional arrays are generally arranged in memory in row-major order or column-major order. Traversing a row-major array in column-major order, or vice-versa, leads to poor spatial locality. With large arrays the performance loss can be a factor of 10 or more. This paper explores the Morton storage layout, which has substantial spatial locality whether traversed in row-major or column-major order.
Using a small suite of dense kernels working on two-dimensional arrays, we have carried out an extensive study of the impact of poor array layout and of whether Morton layout can offer an attractive compromise. We show that Morton layout can lead to better performance than the worse of the two canonical layouts; however, the performance of Morton layout compared to the better choice of canonical layout is often disappointing. We further study one simple improvement of the basic Morton scheme: we show that choosing the correct alignment for the base address of an array in Morton layout can sometimes significantly improve the competitiveness of this layout.
Hierarchically-blocked non-linear storage layouts, such as the Morton ordering, have been shown to be a potentially attractive compromise between row-major and column-major for two-dimensional arrays. When combined with appropriate optimizations,Morton layout offers some spatial locality whether traversed row- or column-wise. However, for linear algebra routines with larger problem sizes, the layout shows diminishing returns. It is our hypothesis that associativity conflicts between Morton blocks cause this behavior and we show that carefully arranging the Morton blocks can minimize this effect. We explore one such arrangement and report our preliminary results.
We describe a technique for performing domain-specific optimisation based on the formation of an execution plan from calls made to a domain-specific library. The idea is to interpose a proxy layer between the application and the library that delays execution of the library code and, in so doing, captures a recipe for the computation required. This creates the opportunity for a .domain-specific interpreter. to analyse the recipe and generate an optimised execution plan. We demonstrate the idea by showing how it can be used to implement coarse grained tiling and parallelisation optimisations in MayaVi, a 44,000-line visualisation application written in Python and VTK, with no change to the MayaVi code base. We present a generic mechanism for interposing a domain-specific interpreter in Python applications, together with experimental results demonstrating the technique.s effectiveness in the context of MayaVi. For certain visualisation problems, in particular the rendering of isosurfaces in an unstructured mesh fluid flow simulation, we demonstrate significant speedups from coarse grained tiling, and from both SMP and distributed-memory parallelisation.
This paper investigates whether AspectJ can be used for efficient profiling of Java programs. Profiling differs from other applications of AOP (e.g. tracing), since it necessitates efficient and often complex interactions with the target program. As such, it was uncertain whether AspectJ could achieve this goal. Therefore, we investigate four common profiling problems (heap usage, object lifetime, wasted time and time-spent) and report on how well AspectJ handles them. For each, we provide an efficient implementation, discuss any trade-offs or limitations and present the results of an experimental evaluation into the costs of using it. Our conclusion is that, while there are some minor language issues to be resolved, AspectJ does offer an efficient platform for profiling.
Performance programming is characterized by the need to structure software components to exploit the context of use. Relevant context includes the target processor architecture, the available resources (number of processors, network capacity), prevailing resource contention, the values and shapes of input and intermediate data structures, the schedule and distribution of input data delivery, and the way the results are to be used. This paper concerns adapting to dynamic context: adaptive algorithms, malleable and migrating tasks, and application structures based on dynamic component composition. Adaptive computations use metadata associated with software components --- performance models, dependence information, data size and shape. Computation itself is interwoven with planning and optimizing the computation process, using this metadata. This reflective nature motivates metaprogramming techniques. We present a research agenda aimed at developing a modelling framework which allows us to characterize both computation and dynamic adaptation in a way that allows systematic optimization.
This paper presents a profiling plug-in for IBM's Eclipse development environment. Our approach characterises profiling as an interactive exploration of a large virtual database of information about the execution of a program. We define a high-level, graphical environment for programming a profiling pipeline, which specifies how profiling data is collected, filtered and visualized, and allows the user to write custom Java code that can intercept and manipulate the profiling data passed between these stages. We use Aspect- Oriented Programming (AOP) to program the collection of profiling information, allowing this process to be tailored to particular program contexts and domain-specific program characteristics.
This paper explores the potential for automatic cross-component optimisation in the Python / VTK-based MayaVi modular visualisation environment. The idea is to delay execution of the VTK components called from the MayaVi tool, which requires no significant structural change to the MayaVi code base, but which opens up the possibility for dynamic performance optimisations such as tiling, fusion, memoisation and shared-memory parallelisation. The paper concludes with experimental results on an unstructured mesh hierarchy model from an adaptive three-dimensional gravity current simulation.
This paper presents preliminary results from a project which aims at overcoming barriers to cross-component optimisation in the Python/vtk-based MayaVi modular visualisation environment. We use a remarkably effective delayed-evaluation technique which requires no significant structural change to the MayaVi code base --- but which opens up the possibility for performance optimisations such as tiling, fusion and memoisation, as well as adding functionality by supporting demand-driven execution. The paper concludes with experimental results on an unstructured mesh hierarchy model from an adaptive three-dimensional gravity current simulation.
(Unpublished workshop paper: please email for a copy).
The subject of this paper is flow- and context-insensitive pointer analysis. We present a novel approach for precisely modelling struct variables and indirect function calls. Our method emphasises efficiency and simplicity and consists of an extension to the existing system of set-constraints. Furthermore, we evaluate the trade-off of time versus precision of using our method, versus a less accurate analysis. Our benchmark suite consists of 7 common C programs ranging in size from 5000 to 150,000 lines of code. Our results indicate the field-sensitive analysis is more expensive to compute, but yields signicantly better precision.
We consider how to maintain the topological order of a directed acyclic graph (DAG) in the presence of edge insertions and deletions. We present a new algorithm and, although this has marginally inferior time complexity compared with the best previously known result, we find that its simplicity leads to better performance in practice. In addition, we provide an empirical comparison against three alternatives over a large number of random DAG's. The results show our algorithm is the best for sparse graphs and, surprisingly, that an alternative with poor theoretical complexity performs marginally better on dense graphs.
The TaskGraph Library is a C++ library for dynamic code generation,
which combines specialisation with dependence analysis and loop
restructuring. A TaskGraph represents a fragment of code which is
constructed and manipulated at run-time, then compiled, dynamically
linked and executed. TaskGraphs are initialised using macros and
overloading, which forms a simplified, C-like sub-language with
first-class arrays and no pointer arithmetic. Once a TaskGraph has
been constructed, we can analyse its dependence structure and
perform optimisations. In this paper, we present the design of the
TaskGraph library, and two sample applications to demonstrate its
use for runtime code specialisation and restructuring optimisation.
postscript(Draft),
pdf(Draft).
Hierarchically-blocked non-linear storage layouts, such as the Morton ordering, have been proposed as a compromise between row-major and columnmajor for two-dimensional arrays. Morton layout offers some spatial locality whether traversed row-wise or column-wise. The goal of this paper is to make this an attractive compromise, offering close to the performance of row-major traversal of row-major layout, while avoiding the pathological behaviour of columnmajor traversal. We explore how spatial locality of Morton layout depends on the alignment of the arrays base address, and how unrolling has to be aligned to reduce address calculation overhead. We conclude with extensive experimental results using five common processors and a small suite of benchmark kernels.
This paper presents and evaluates a number of techniques to improve the execution time of interprocedural pointer analysis in the context of C programs. The analysis is formulated as a graph of set constraints and solved using a worklist algorithm. Indirections lead to new constraints being added during this procedure. The solution process can be simplified by identifying cycles, and we present a novel online algorithm for doing this. We also present a difference propagation scheme which avoids redundant work by tracking changes to each solution set. The effectiveness of these and other methods are shown in an experimental study over 12 common `C' programs ranging between 1000 to 150,000 lines of code.\end
This is a substantially-extended and revised version of the SCAM2003 workshop paper below.
This paper presents and evaluates a number of techniques to improve the execution time of interprocedural pointer analysis in the context of large C programs. The analysis is formulated as a graph of set constraints and solved using a worklist algorithm. Indirections lead to new constraints being added during this process.
In this work, we present a new algorithm for online cycle detection, and a difference propagation technique which records changes in a variable's solution. Effectiveness of these and other methods are evaluated experimentally using nine common `C' programs ranging between 1000 to 55000 lines of code.
An extended version of this paper is published in IEE Proceedings - Software:
We have developed a prototype tool that supports instrumentation of distributed Java applications by on-the-fly deployment of interposition code at user-selectable program points. This paper explores the idea, originated in the Paradyn Performance Consultant, of systematically searching for performance bottlenecks by progressive refinement. We present the callgraph search algorithm in detail, and discuss a number of shortcomings with the approach, some of which can be addressed by improving the search strategy. We support our conclusions with two application examples. This is a report of work in progress, aimed at stimulating further investigation of this interesting approach.
Morton layout is a compromise storage layout between the programming language mandated layouts row-major and column-major, providing substantial locality of reference when traversed in either direction. This paper explores the performance of Morton, row-major and column-major layouts in detail on some representative architectures. Using a small suite of dense kernels working on two-dimensional arrays, we have carried out an extensive study of the impact of poor array layout and of whether Morton layout can offer an attractive compromise. Whether Morton layout is better than traversing a column-major array in row-major order (or vice versa) depends on problem size and architecture. Morton layout generally leads to much more consistent performance and only a small improvement in its performance could make it an attractive alternative.
We present an automated run-time optimization framework that can improve the performance of distributed applications written using Java RMI whilst preserving its semantics. Java classes are modified at load-time in order to intercept RMI calls as they occur. RMI calls are not executed immediately, but are delayed for as long as possible. When a dependence forces execution of the delayed calls, the aggregated calls are sent over to the remote server to be executed in one step. This reduces network overhead and the quantity of data sent, since data can be shared between calls. The sequence of calls may be cached on the server side along with any known constants in order to speed up future calls. A remote server may also make RMI calls to another remote server on behalf of the client if necessary. Our results show that the techniques can speed up distributed programs significantly, especially when operating across slower networks. We also discuss some of the challenges involved in maintaining program semantics, and show how the approach can be used for more ambitious optimizations in the future.
A trace of a workload's system calls can be obtained with minimal interference, and can be used to drive repeatable experiments to evaluate system configuration alternatives. Replaying system call traces alone sometimes leads to inaccurate predictions because paging, and access to memory-mapped files, are not modelled.
This paper extends tracing to handle such workloads. At trace capture time, the application's page-level virtual memory access is monitored. The size of the page access trace, and capture overheads, are reduced by excluding recently-accessed pages. This leads to a slight loss of accuracy. Using a suite of memory-intensive applications, we evaluate the capture overhead and measure the predictive accuracy of the approach.
Dynamic instrumentation, meaning modification of an application's instructions at run-time in order to monitor its behaviour, is a very powerful foundation for a wide range of program manipulation tools. This paper concerns the problem of implementing dynamic instrumentation for a managed run-time environment such as a Java Virtual Machine (JVM). We present a flexible new approach based on a ``virtual'' JVM, which runs above a standard JVM but intercepts application control flow in order to allow it to be modified at run-time. Our Veneer Virtual JVM works by fragmenting each method's bytecode at specified points (such as basic blocks). The fragmentation process can include static analysis passes which associate dependence and liveness metadata with each block in order to facilitate run-time optimization. We conclude with some preliminary performance results, and discuss further applications of the tool.
We argue that delayed-evaluation scientific software components, which dynamically change their behaviour according to their calling context at runtime offer a possible way of bridging the apparent conflict between the quality of scientific software and its performance. Rather than equipping scientific software components with a \emph{performance interface} which allows the caller to supply the context information that is lost when building abstract software components, we propose to recapture this lost context information at runtime. This paper is accompanied by a public release of a parallel linear algebra library with both C and C++ language interfaces which implements this proposal. We demonstrate the usability of this library by showing that it can be used to supply linear algebra component functionality to an existing external software package. We give preliminary performance figures and discuss avenues for future work.
Two-dimensional arrays are generally arranged in memory in row-major order or column-major order. Sophisticated programmers, or occasionally sophisticated compilers, match the loop structure to the language's storage layout in order to maximise spatial locality. Unsophisticated programmers do not, and the performance loss is often dramatic --- up to a factor of 20. With knowledge of how the array will be used, it is often possible to choose between the two layouts in order to maximise spatial locality. In this paper we study the Morton storage layout, which has substantial spatial locality whether traversed in row-major or column-major order. We present results from a suite of simple application kernels which show that, on the AMD Athlon and Pentium III, for arrays larger than 256 x 256, Morton array layout, even implemented with a lookup table with no compiler support, is always within 61% of both row-major and column-major --- and is sometimes faster.
In this paper we study the use of idle cycles in a network of desktop workstations under unfavourable conditions: we aim to use idle cycles to improve the responsiveness of interactive applications through parallelism. Unlike much prior work in the area, our focus is on response time, not throughput, and short jobs - of the order of a few seconds. We therefore assume a high level of primary activity by the desktop workstations' users, and aim to keep interference with their work within reasonable limits. We present a fault-tolerant, low-administration service for identifying idle machines, which can usually assign a group of processors to a task in less than 200ms. Unusually, the system has no job queue: each job is started immediately with the resources which are predicted to be available. Using trace-driven simulation we study allocation policy for a stream of parallel jobs. Results show that even under heavy load it is possible to accommodate multiple concurrent guest jobs and obtain good speedup with very small disruption of host applications.
CFL (Communication Fusion Library) is an experimental C++ library which supports shared reduction variables in MPI programs. It uses overloading to distinguish private variables from replicated, shared variables, and automatically introduces MPI communication to keep replicated data consistent. This paper concerns a simple but surprisingly effective technique which improves performance substantially: CFL operators are executed lazily in order to expose opportunities for run-time, context-dependent, optimization such as message aggregation and operator fusion. We evaluate the idea using both toy benchmarks and a `production' code for simulating plankton population dynamics in the upper ocean. The results demonstrate the software engineering benefits that accrue from the use of the library and show that performance close to that of manually optimized code can be achieved automatically in many cases.
Domain-specific performance optimisations (DSOs) can prove extremely profitable. My group at Imperial has worked on six or seven DSO different projects, mostly in computational science applications. This talk aims to reflect on our experiences. One aspect, of course, is whether we have a stand-alone domain-specific language (DSL), a DSL embedded in a general host language, or an “active library” whose implementation delivers DSOs, perhaps across sequences of calls. A key question, though, is just what enables us to deliver a DSO. Is it some special semantic property deriving from the domain? Is it because the DSL abstracts from implementation details – enabling the compiler to make choices that would be committed in lower-level code? Is it that the DSL captures large-scale dataflows that are obscured when coded in a conventional general-purpose language? Is it simply that we know that particular optimisations are good for a particular context? The talk will explore this question with reference to our DSO projects in finite-element methods, unstructured meshes, linear algebra and Fourier interpolation. This is joint work with many collaborators.
What is the right code to generate, for a given hardware platform? How does this change as problem parameters change? This talk presents some recent work-in-progress exploring domain-specific languages and active libraries as a way to automate code generation for multicore and manycore platforms, and to capture the space of alternative implementation choices at a higher level than a compiler for a general-purpose language can. Paul Kelly will illustrate the potential for this idea by looking at our recent work on unstructured-mesh fluid dynamics, and finite element methods. By choosing the abstraction carefully, we can capture design choices far beyond what a conventional compiler can do - and, in the extreme, engage the users in selecting algorithms and numerical methods that match the capabilities of the underlying hardware to meet end-to-end objectives for solution quality.
Programming seems to have a fundamentally serial nature . planning the steps a computer will take. The reality of computing machines is not like that at all: computation and data are distributed in space, and multiple activities can take place in complex, overlapped and interacting ways. Engineering software to do this is really hard, because the best design choices involve partitioning data and scheduling work at multiple scales. Furthermore, partitioning and scheduling often cut across the logical structures that we rely on for abstraction and reuse of software components. This talk will trace this problem back to the earliest days of computing . then show some of our most recent work that offers the prospect for overcoming these challenges. The key idea is to build software tools that support, and promote, programming at a conceptual level that exposes performance optimisations without committing to details prematurely. That is: build software libraries that capture the computational structure. Then, by using knowledge and experience of particular application domains, generate efficient parallel code from it.
Performance programming is characterised by the need to structure
software components to exploit the context of use. Relevant context
includes the target processor architecture, the available resources
(parallelism, network, contention), the values and shapes of input and
intermediate data structures, the schedule and distribution of input
data delivery, and the way the results are to be used. This talk
focuses on adapting to dynamic context: adaptive algorithms, malleable
and migrating tasks, and application structures based on dynamic
component composition. Adaptive computations use metadata --
performance models, dependence information, data size and shape.
Computation itself is interwoven with planning and optimising the
computation process, using this metadata. This reflective nature
motivates both multi-stage and metaprogramming techniques. This talk
presents a research agenda aimed at developing a modelling framework
which allows us to characterise both computation and dynamic
adaptation in a way that allows systematic optimisation.
The TaskGraph Library is a C++ library for dynamic code generation,
which combines specialisation with dependence analysis and loop
restructuring. A TaskGraph represents a fragment of code which is
constructed and manipulated at run-time, then compiled, dynamically
linked and executed. TaskGraphs are initialised using macros and
overloading, which forms a simplified, C-like sub-language with
first-class arrays and no pointer arithmetic. Once a TaskGraph has
been constructed, we can analyse dependence structure and perform
optimisations. In this talk, we present the design of the
TaskGraph library, present several sample applications for this
library which demonstrate its use for runtime code specialisation
and optimisation and discuss related work and future applications
and developments of the library.
PDF
PDF
PDF.
Invited talk CMPP2004
(Constructive Methods in Parallel Programming), July 2004.
Powerpoint,
PDF.
Powerpoint,
PDF
Multi-stage programming languages internalise the idea of generating
code at runtime, then executing it as part of your program. This talk
is about our library for doing this in C++, and how it might be useful
in scientific applications. Specialisation is certainly powerful -
generating code based on runtime knowledge of data values. We also
support meta-programming - having built some code, you can
programmatically mess with it. Our "TaskGraph" library supports a
suite of loop restructuring transformations, including loop fusion,
interchange, tiling and skewing (this comes for free as we build on
the Stanford SUIF framework). We present a couple of motivating
examples, which develop the idea of "domain-specific optimisation
components" - a library of metaprogram code that encodes expertise
about particular loops, applications or tuning strategies.
Slides from presentation at Dagstuhl
Workshop 03131 on Domain-Specific Program Generation (March 2003)
Powerpoint. Superceded by similar updated talk above.
Slides from presentation at Dagstuhl
workshop on Performance Analysis and Distributed Computing (August 2002)
Powerpoint,
PDF.
Invited talk at the Workshop on Parallel Functional Programming, UCL, September 1998
pdf.
Slides from presentation at AADEBUG'97.
postscript,
PDF