Next!

Symbolic Testing of OpenCL Code

Peter Collingbourne

Thursday 24st November, 3:30pm, William Penney Lab, Room 219a

We present an effective technique for crosschecking a C program against an accelerated OpenCL version, as well as a technique for detecting data races in OpenCL programs. Our techniques are implemented in KLEE-CL, a symbolic execution engine based on KLEE and KLEE-FP that supports symbolic reasoning on the equivalence between symbolic values. Our approach is to symbolically model the OpenCL environment using an OpenCL runtime library targeted to symbolic execution. Using this model we are able to run OpenCL programs symbolically, keeping track of memory accesses for the purpose of race detection. We then compare the symbolic result against the plain C program in order to detect mismatches between the two versions. We applied KLEE-CL to the Parboil benchmark suite, the Bullet physics library and the OP2 library, in which we were able to find a total of seven errors: three mismatches between the OpenCL and C implementations, two memory errors, one OpenCL compiler bug and one race condition.

Upcoming

Past Talks

How Prefix Sums Work

Nathan Chong

Tuesday 1st November, 2pm, William Penney Lab, Room 219a

Imagine you have a multithreaded program where each thread produces an arbitrary number of outputs (only known at runtime). You would like to write these outputs into an array so that the results are packed contiguously (that is, all of thread0’s outputs followed by all of thread1’s outputs, and so on). The serial solution is easy, but what about a parallel one?

Prefix sums are an elegant parallel solution and it turns out they are fundamental to a large number of other parallel problems. So many that they may be considered to be a parallel primitive. The purpose of this talk is to appreciate how prefix sums work under the hood. If I have time I’ll also talk about how prefix sums have been essential to my work in particle-based problems.

prefix sum
PDF

GNU epsilon: an Extensible Programming Language

Luca Saiu

Monday 20th June, 3pm, Huxley Building, Room 342

Programming languages should be designed to be growable by users, built upon a very small language and a set of syntactic abstraction features aiming at rewriting complex programs into a combination of simple forms which are easy to analyse, reason upon and implement, according to a user-supplied specification directing syntax, control, static analysis and optimization. This work presents the advantages and challenges of such a reductionistic design philosophy and documents the experience of building upon one specific core language, starting from its neighbourhood and exploring much further.


Bio Luca Saiu is finishing his PhD at the LIPN Computer Science Laboratory of Université Paris 13, France. His research revolves around programming language semantics and implementation; he is particularly interested in improving programming language expressivity by employing minimalistic, extensible designs. His other interests include networks and parallel architectures. Luca Saiu got his Bachelor's and Master's Degrees in Computer Science at the University of Pisa, Italy, and has long been a programmer and free software activist. He's co-author of Marionnet, a network simulation tool: http://www.marionnet.org.

PDF

The Tensor Contraction Engine (TCE): A Domain-Specific Approach to Synthesizing High-Performance Codes for Quantum Chemistry

J. (Ram) Ramanujam

Thursday 16th June, 3pm, Huxley Building, lecture theater 145

The performance of large scientific computations is influenced by the choice of program optimizations, among other things. The accurate modeling of the electronic structure of atoms and molecules in quantum chemistry involves computationally intensive tensor contractions over large multi-dimensional arrays. We are developing a program synthesis system, the Tensor Contraction Engine, that translates a high-level specification of such computations into efficient, parallel code tailored to the characteristics of the target architecture. This talk provides an overview of TCE and discusses several components of the synthesis system are described, focusing on performance optimization issues.


Bio J. (Ram) Ramanujam received the B. Tech. degree in Electrical Engineering from the Indian Institute of Technology, Madras, India in 1983, and his M.S. and Ph. D. degrees in Computer Science from The Ohio State University in 1987 and 1990 respectively. He is currently the John E. and Beatrice L. Ritter Distinguished Professor in the Department of Electrical and Computer Engineering at Louisiana State University (LSU). In addition, he holds a joint faculty appointment with the LSU Center for Computation and Technology. His research interests are in compilers and runtime systems for high-performance computing, domain-specific languages and compilers for parallel computing, embedded systems, and high-level hardware synthesis. He has participated in several NSF- funded projects including the Tensor Contraction Engine and the Pluto project for automatic parallelization. Additional details can be found at http://www.ece.lsu.edu/jxr/.

Results on GPU-accelerated Bundle Adjustment with CUDA

Renato Salas-Moreno

Monday 6th June, 3pm, William Penney Lab, Access Grid room

Previously I talked about preliminary background on Bundle Adjustment, which in computer vision is one of the final steps in feature-based 3D reconstruction with a moving camera, aiming to optimise the estimated set of point and camera parameters (increasing trajectory accuracy and reducing error build-up).

In this talk I will present results of a hybrid GPU/CPU implementation showing up to 10x speedup compared to a traditionally serial algorithm. I will also discuss some difficulties experienced during the implementation that limit the amount of parallelism that could be further enhanced.

PDF

Optimisations for OP2 Programs

Carlo Bertolli

Monday 11th April, 3pm, William Penney Lab, Access Grid room

In this SPO group brainstorming session we will discuss optimisation opportunities for OP2 programs, focusing in the case of their CUDA implementation. Optimisation techniques that will be subject of the meeting will be loop fusion and splitting, as well as optimised GPU memory management, and the underlying performance models used to evaluate the improvement provided by an optimisation and to characterise best optimisations.

PDF

WCET Estimates of Component-Based Systems using Traces

Adam Betts

Monday 28th March, 3pm, William Penney Lab, Access Grid room

No software to analyse = no program model = no WCET estimate = potential catastrophe. This is the daunting reality of tomorrow's real-time software infrastructure as component-based software engineering becomes ubiquitous.
The issue at hand is that components get shipped to clients under the proviso that its disassembly is strictly forbidden, to protect intellectual property. Static analysis immediately chokes. The only alternative - end-to-end testing - is dusted off and readied for action, but it lacks support from test-generation techniques, which generally target functional behaviour alone.
This talk exposes a grey-box solution which builds the program model on the fly from a set of execution traces, rather than through inspection of the source code or disassembly. In effect, instrumentation points provide visibility of the paths taken through the component during testing. We contrast and compare different ways that instrumentation points can be placed and how they affect the accuracy of the WCET estimate. Our results indicate that simple coverage metrics, such as basic block and branch, provide a bounded WCET estimate on the benchmarks under analysis, whereas end-to-end testing almost always fails.

PDF

An Optimization Theory for Data Parallel Stencil-based Parallel Applications

Massimiliano Meneghin

Monday 21st March, 3pm, William Penney Lab, Access Grid room

During the talk a new optimization theory for stencil-based applications is presented. The theory is centered both on a modification of the well known owner-computes rule and on basic but powerful properties of toroidal spaces. The proposed optimization techniques provide notable results in different computational aspects: from the reduction of communication overhead to the reduction of computation time, through the minimization of memory requirements without performance loss.
All classical optimization theory is based on defining transformations that can produce optimized programs which are computationally equivalent to the original ones. According to Kennedy, two programs are equivalent if, from the same input data, they produce identical output data.
As other proposed modifications to the owner-computes rule, stencil application feature of being characterized by a set of consecutive steps is exploited. For such configurations, it is possible to define specific two phase optimizations. The first phase is characterized by the application of program transformations which result in an efficient computation of an output that can be easily converted into the original one. In other words the transformed program defined by the first phase is not computational equivalent with respect to the original one.
The second phase converts the output of the previous phase back into the original one exploiting optimized technique in order to introduce the lowest additional overhead. The phase guarantees the computational equivalence of the approach. Obviously, in order to define an interesting new optimization technique, it has to be proved that the overall performance of the two phases sequence is greater than the one of the original program.
Exploiting a structured approach and studying this optimization theory on stencils featuring specific patterns of functional dependencies, a set of novel transformations, which result in significant optimizations, have been discovered. Among the new transformations, the most notable one, which aims to reduce the number of communications necessary to implement a stencil-based application, turns out to be the best optimization technique amongst those cited in the literature.

PDF

Symbolic Crosschecking of Floating-Point and SIMD Code

Peter Collingbourne

Monday 14th March, 1pm, William Penney Lab, Access Grid room

We present an effective technique for crosschecking an IEEE 754 floating-point program and its SIMD-vectorized version, implemented in KLEE-FP, an extension to the KLEE symbolic execution tool that supports symbolic reasoning on the equivalence between floating-point values.
The key insight behind our approach is that floating-point values are only reliably equal if they are essentially built by the same operations. As a result, our technique works by lowering the Intel Streaming SIMD Extension (SSE) instruction set to primitive integer and floating-point operations, and then using an algorithm based on symbolic expression matching augmented with canonicalization rules.
Under symbolic execution, we have to verify equivalence along every feasible control-flow path. We reduce the branching factor of this process by aggressively merging conditionals, if-converting branches into select operations via an aggressive phi-node folding transformation.
We applied KLEE-FP to OpenCV, a popular open source computer vision library. KLEE-FP was able to successfully crosscheck 51 SIMD/SSE implementations against their corresponding scalar versions, proving the bounded equivalence of 41 of them (i.e., on images up to a certain size), and finding inconsistencies in the other 10.

PDF

A Framework for Analysing and Profiling City Designs

David Birch

Monday 28th February, 3pm, William Penney Lab, Access Grid room

The process of designing a new suburb or city from scratch is ripe for the application of software performance optimisation techniques. This presentation will explore these opportunities and introduce a framework to automatically analyse a city design across multiple disciplines for multiple metrics on multiple design levels under multiple scenarios. The goal of the system is to gain cross-discipline insight into the design not otherwise attainable on a scale not otherwise attainable.
We will demonstrate:

  • Analysing a city design for carbon, water, heat, gas and electricity usage using an Integrated Resource Management Model.
  • Analysing photovoltaic potential using an open source lighting simulator.
  • Using the Syncity toolkit from Imperial’s Urban Energy Systems group we will analyse the placement of power generation and inter-conversion technology within a city as well as the resource transport network required.
  • As these analyses are integrated into our Framework we will them seek to profile and “debug” the city design using results from each analysis to identify cross-discipline design insight.

    city scape

    The talk will consist of a discussion of the urban masterplanning process and the opportunities identified for the application of software performance optimisation techniques. Before focusing on the analysis framework developed and its data structures before finishing by demoing results from each analysis integrated and a 3d computer game like environment for interactive exploration of city performance.
    The research is demonstrated in collaboration with engineering consultancy Arup.

    Accelerating Bundle Adjustment with CUDA

    Renato Salas Moreno

    Monday 21st February, 3pm, William Penney Lab, Access Grid room

    In computer vision, Bundle Adjustment is one of the final steps performed after feature-based 3D reconstruction of a moving camera with unknown motion. By optimising the estimated 3D points positions and camera parameters, it helps to increase the accuracy of the camera trajectory (decreasing the effects of measurement noise) and reduce the error-buildup (preventing tracking failure).
    Traditionally it was viewed as an offline or background process involving an iterative solution of a non-linear least squares equation across all the available information used for reconstruction. Current optimisations often involve exploiting the sparse nature of various matrices to avoid unnecessary computations.
    In this talk I will explain in a little more detail how Bundle Adjustment is solved using the Levenberg–Marquardt algorithm and discuss insights on my ongoing implementation using CUDA, which will exploit the inherit parallelism found at various steps, aiming to speedup the process for real-time camera tracking.

    A Programming Model for Adaptive Emergency Management Applications

    Carlo Bertolli

    Monday 14th February, 3pm, William Penney Lab, Access Grid room

    Several complex and time-critical applications require the existence of novel distributed and dynamical platforms composed of a variety of fixed and mobile processing nodes and networks. Notable examples of such applications are crisis and emergency management and natural phenomenon prediction. Typically, emergency forecasting consists of complex fluid dynamic applications modelling a specific area (e.g. a river near a town), with the aim of predicting critical situations (e.g. a flood) well in advance (e.g. a week), to enable civil protection personnel to minimise the impact on the environment and population. Nevertheless, real-time prediction information when the emergency is in progress would help the optimisation of certain aspects of management. This requires to apply prediction techniques also during the evolution of the emergency, and to scatter the information to a set of distributed and possibly mobile nodes. In some cases it may be impractical or inefficient to map all forecasting computations to a high-performance server, a more practical solution consisting in mapping them to geographically localised computing nodes and networks w.r.t. the input and output of the computation, at the cost of a lower quality prediction. It is hence clear that in this scenario we need the development of applications able to adapt their behaviour according to the dynamical platform conditions, such as the presence of specific classes of computing resources and the actual network availability. In this presentation I will introduce a solution to the development of adaptive applications based on the design of a new programming language, featuring multi-versioning and adaptivity constructs.

    PDF

    Model & Design Insight in Integrated Resource Management

    David Birch

    Monday 7th February, 3pm, William Penney Lab, Access Grid room

    In architecture and civil engineering Integrated Resource Management (IRM) models are an emerging design support tool for Urban Masterplanning (the design of suburb, city or region level architecture). IRM models allow a design to be analysed from many disparate engineering disciplines to give a composite, coherent assessment of the sustainability of a design.
    Despite their recent success IRM models are kept from reaching their potential by their inherent complexity, high data requirements and the difficulty of gaining sufficient model comprehension to enable project specific adaptation & implementation.
    In this presentation we will explore the application of techniques from software profiling and optimisation in an attempt to address these problems.
    Specifically we apply model extraction, slicing & visualisation to an IRM model. We then demonstrate the insight gained from model metrics before using a model slice as input for a sensitivity analysis to gain design insight for optimisation.
    Ultimately the work presented should empower analysts and designers with greater insight into their design space enabling them to work more efficiently toward a more sustainable built environment.
    This work is the result of collaboration with Helen Liang from Bath University & is demonstrated in conjunction with engineering firm Arup.

    Loop-Directed Mothballing: Power-Gating Execution Units Using Fast Analysis of Inner Loops

    Craig Court

    Monday 31st January, 3pm, William Penney Lab, Access Grid room

    Static power dissipation has been identified as a limiting factor in future microprocessor technologies. As current technology is approaching that limit, techniques are required both at the technology level and in the architecture design to reduce static power leakage. Power-gating is a popular architectural tool to reduce static power either by disabling whole units, or by dynamically resizing units to meet the application demands. In this presentation we will present Loop-Directed Mothballing to power-gate entire execution units, by recording the utilisation of individual units in innermost loops and power-gating the units according to two utilisation thresholds. An average of 10% total power saving is achieved with negligible performance loss. Two prior techniques are compared, and at first glance appear superior due to their power savings. However, taking performance into account using the energy-delay product (EDP), the two techniques show average results similar or inferior to an architecture without power-gating. Conversely, LDM shows a 10% average saving for EDP due to the low performance penalty.

    AVI

    Anisotropic Mesh Adaptivity on NVIDIA's CUDA

    Georgios Rokos

    Monday 24th January, 3pm, William Penney Lab, Access Grid room

    Mesh smoothing is a component of anisotropic mesh adaptivity, a numerical technology of some importance in Computational Fluid Dynamics (CFD). Adapting the size and shape of elements in an unstructured mesh to a specification encoded in a metric tensor field is done by relocating mesh vertices using an optimisation algorithm, like the one proposed by Pain et al. in 2001. This computationally heavy task can be accelerated by engaging NVIDIA's CUDA, a massively parallel and floating-point capable architecture. In order to ensure correct parallel execution, we implemented the parallel framework based on the use of independent sets, proposed by Freitag et al. in 1998. In this presentation we will demonstrate all related algorithms and architectural details, giving design and implementation descriptions and listing various CUDA code optimisations which can lead to a speedup of up to 190 times over the serial CPU code and up to 45 times over an eight-threaded OpenMP code. Performance analysis shows that CUDA's texture memory can assist in accelerating execution by 2.5 times and that high register usage is the main limiting factor for better performance.

    PDF

    Worst-Case Execution Time (WCET): the past, present & future

    Adam Betts

    Monday 17th January, 3pm, William Penney Lab, Access Grid room

    The holy grail of performance analysis in real-time systems is worst-case, rather than average-case, behaviour due to safety-critical requirements. In the past 20 years, considerable academic effort has been pooled into answering the WCET question. But what exactly does WCET analysis entail? Why does industry still continue to ignore the vast majority of WCET techniques? How does future software and hardware trends affect the complexity of WCET analysis? Indeed, will research into WCET still exist in 20 years? This seminar sheds light on these questions and much more.

    PDF

    Fault Tolerance for High-Performance Applications

    Carlo Bertolli

    Monday 6th december, 3pm, William Penney Lab, Access Grid room

    Existing Fault Tolerance techniques include replication, checkpointing and rollback recovery, and atomic actions. All these techniques have been employed in the past to support High-Performance applications, and their design is based on general programming models. The result is that performance analysis as well as its static and dynamic tuning is difficult, where not impossible. In other cases specialised techniques, developed for specific application cases, are employed. Unfortunately such techniques cannot be applied to a wide range of parallel applications, and/or they do not permit performance tuning. In this talk we start by considering high-level classes of parallel applications (namely task farm and data parallel programs) and we show how to develop checkpointing and rollback recovery protocols supporting a sufficiently general set of parallel applications, while avoiding main issues suffered by applications running protocols for more general programming models.

    PDF

    Updating System Software On-the-Fly with Arachne

    Nicolas Loriant

    Monday 29th november, 3pm, William Penney Lab, Access Grid room

    High performance software systems such as OS kernels, and Web servers, must achieve high performance and be available 24/7. Traditional updating techniques (i.e., patch, rebuild and restart) are impracticable for such systems. To takle this problem, we proposed an Aspect-Oriented (AO) approach to describing software adaptations. These adaptations are woven into running systems by rewriting its binary code on-the-fly. Our approach guarantees the continuous execution of the system under update without impeding its performance.
    In this talk, we present Arachne, An aspect oriented system. Arachne features an AO language similar to C dedicated to describing adaptations of high-performance C systems. In the Arachne language, adaptations are described in terms of what they perform, and when they are executed. Therefore, it allows developers to express adaptations according to the semantic of the base system. For example, the code of an adaptation may be executed before the call to a given function.
    In order to apply adaptations described in our AO language, Arachne features a runtime aspect weaver using on-the-fly binary code rewriting. Arachne performs aspect weaving by dynamically modifying the code executed by the processor without interrupting the execution of the woven system. It requires no prior modification to the modified system, the compilation process or the execution environment. Arachne can perform aspect weaving on both user- and kernel-space code. Moreover, it provides specialized rewriting mechanisms that trigger adaptation code significantly faster (up to 500 times) than generic ones (e.g., those found in debuggers).

    PDF

    An Active-Library Based Investigation into the Performance Optimisation of Linear Algebra and the Finite Element Method

    Francis Russell

    Monday 22nd november, 3pm, William Penney Lab, Access Grid room

    Active libraries are a technique whereby a library can take responsibility for the optimisation of its own code. By shifting responsibility for performance into the library, they enable library clients to write code without having to concern themselves with performance, and enable the library writer to focus on how to make the abstractions fast.
    This presentation will discuss an investigation of active libraries in the context of computational science. First, I consider sparse linear algebra as an extension to previous work from my MEng. Next, I examine active libraries as a tool to specify and explore optimisations a solver for partial differential equations using the finite element method.

    PDF

    Automated Finite Element Computations in the FEniCS Framework using General Purpose Graphics Processing Units - On the Road Towards Interactive Simulation

    Florian Rathgeber

    Monday 15th november, 3pm, William Penney Lab, Access Grid room

    Graphics Processing Units (GPUs) are established as a computational hardware platform superior in price/performance compared to general CPUs, but typically require specialized implementations and expert low-level knowledge. In this thesis the FEniCS framework for automated finite element solution of partial differential equations is extended to automatically generate GPU implementations, and achieve the expected speedup without sacrificing generality. An implementation using NVIDIAs Compute Unified Device Architecture (CUDA) of general finite element assembly and a conjugate gradient (CG) solver for the linear system of equations are presented for the DOLFIN problem solving environment. Extending the FEniCS form compiler FFC to generate specific CUDA kernels for assembly from a mathematical notation of the variational form retains the flexibility and degree of automation which is the basis of FEniCS. A matrix-free method, computing the matrix-vector product $Ax$ in the CG iteration without assembling the matrix $A$, is evaluated against the assembly method, and is shown to perform better for a class of problems. Benchmarking and profiling variational forms with different characteristics on a workstation show a significant speedup on the NVIDIA Tesla architecture of up to a factor 9 over the serial CPU implementation and reveal bottlenecks of both the CPU and GPU codes.
    Furthermore, a prototype implementation of computational steering for the FEniCS application Unicorn is presented, allowing to interactively change parameters of a running simulation and getting visual feedback in close to real time, presenting a possible application for GPU acceleration.

    PDF

    Consistency and Coherence in the ARM Architecture

    Nathan Chong

    Monday 1st november, 3pm, William Penney Lab, Access Grid room

    Memory consistency models define the ordering of memory accesses in a shared-memory multiprocessor. Therefore their importance stems from the fact that they are (a) part of the underlying semantics of any shared-memory multiprocessor program and (b) they are frequently unintuitive for programmers used to a sequentially consistent view of memory.
    Coherence is a related concept that defines the ordering of memory access *to the same location*. Coherence is often managed in hardware in a cache-coherence protocol such as MESI. High-level properties such as deadlock freedom are easy to verify at the specification level but can be trickier to prove if considering detail such as non-atomic transactions and protocol-level hardware dependencies (such as finite-sized buffers).
    This talk will discuss the ARM weakly-consistent memory model using litmus tests (small multiprocessor programs) and build up an intuitive understanding of the specification. I'll also briefly cover the formal verification of the latest AMBA4 coherence specification.

    PDF

    An Introduction to Polyhedral Compilation

    Athanasios Konstantinidis

    Monday 25th october, 3pm, William Penney Lab, Access Grid room

    Efficient programming of the new emerging microprocessor architectures like GPGPUs, multi-core, Cell BE etc, has proven to be challenging for programmers who are required to put significant effort in order to exploit their power. As a result, new programming languages and compiler technologies have been developed in order to reduce programming effort while increasing software performance. One of these technologies - the polyhedral model - is the only one that can provide fully automatic parallelization, loop-nest transformations and data movement of static control program segments. Although the applicability of the model is somewhat restrictive, computation kernels that fall into its scope can be found in many high performance applications with considerable impact on their overall performance. Recent studies though suggest that its applicability can actually be extended beyond static control kernels. In fact the real power of polyhedral compilation comes from its ability to represent complex compositions of program transformations as a single scheduling function thus eliminating phase ordering limitations of abstract syntax trees.
    This presentation will provide an introduction to the polyhedral model, highlighting its underlying principles and advantages compared with abstract syntax trees. Furthermore, the design of our new prototype polyhedral compiler will be presented in order to demonstrate how the technique is used in practice.

    PDF