Past Talks
Nathan Chong
Tuesday 1st November, 2pm, William Penney Lab, Room 219a
Imagine you have a multithreaded program where each thread
produces an arbitrary number of outputs (only known at runtime).
You would like to write these outputs into an array so that the
results are packed contiguously (that is, all of thread0’s
outputs followed by all of thread1’s outputs, and so on). The
serial solution is easy, but what about a parallel one?
Prefix sums are an elegant parallel solution and it turns
out they are fundamental to a large number of other parallel
problems. So many that they may be considered to be a parallel
primitive. The purpose of this talk is to appreciate how prefix
sums work under the hood. If I have time I’ll also talk about
how prefix sums have been essential to my work in particle-based
problems.
Luca Saiu
Monday 20th June, 3pm, Huxley Building, Room 342
Programming languages should be designed to be growable by
users, built upon a very small language and a set of syntactic
abstraction features aiming at rewriting complex programs into
a combination of simple forms which are easy to analyse, reason
upon and implement, according to a user-supplied specification
directing syntax, control, static analysis and optimization.
This work presents the advantages and challenges of such a
reductionistic design philosophy and documents the experience
of building upon one specific core language, starting from its
neighbourhood and exploring much further.
Bio Luca Saiu is finishing his PhD at the LIPN
Computer Science Laboratory of Université Paris 13, France.
His research revolves around programming language semantics and
implementation; he is particularly interested in improving
programming language expressivity by employing minimalistic,
extensible designs. His other interests include networks and
parallel architectures. Luca Saiu got his Bachelor's and
Master's Degrees in Computer Science at the University of Pisa,
Italy, and has long been a programmer and free software
activist. He's co-author of Marionnet, a network simulation
tool:
http://www.marionnet.org.
J. (Ram) Ramanujam
Thursday 16th June, 3pm, Huxley Building, lecture theater 145
The performance of large scientific computations is
influenced by the choice of program optimizations, among
other things. The accurate modeling of the electronic
structure of atoms and molecules in quantum chemistry
involves computationally intensive tensor contractions over
large multi-dimensional arrays. We are developing a program
synthesis system, the Tensor Contraction Engine, that
translates a high-level specification of such computations
into efficient, parallel code tailored to the
characteristics of the target architecture. This talk
provides an overview of TCE and discusses several components
of the synthesis system are described, focusing on
performance optimization issues.
Bio J. (Ram) Ramanujam received the B. Tech. degree in
Electrical Engineering from the Indian Institute of
Technology, Madras, India in 1983, and his M.S. and Ph. D.
degrees in Computer Science from The Ohio State University
in 1987 and 1990 respectively. He is currently the John E.
and Beatrice L. Ritter Distinguished Professor in the
Department of Electrical and Computer Engineering at
Louisiana State University (LSU). In addition, he holds a
joint faculty appointment with the LSU Center for
Computation and Technology. His research interests are in
compilers and runtime systems for high-performance
computing, domain-specific languages and compilers for
parallel computing, embedded systems, and high-level
hardware synthesis. He has participated in several NSF-
funded projects including the Tensor Contraction Engine and
the Pluto project for automatic parallelization. Additional
details can be found at http://www.ece.lsu.edu/jxr/.
Renato Salas-Moreno
Monday 6th June, 3pm, William Penney Lab, Access Grid room
Previously I talked about preliminary background on Bundle
Adjustment, which in computer vision is one of the final steps
in feature-based 3D reconstruction with a moving camera, aiming
to optimise the estimated set of point and camera parameters
(increasing trajectory accuracy and reducing error build-up).
In this talk I will present results of a hybrid GPU/CPU
implementation showing up to 10x speedup compared to a
traditionally serial algorithm. I will also discuss some
difficulties experienced during the implementation that limit
the amount of parallelism that could be further enhanced.
Carlo Bertolli
Monday 11th April, 3pm, William Penney Lab, Access Grid room
In this SPO group brainstorming session we will discuss
optimisation opportunities for OP2 programs, focusing in the
case of their CUDA implementation. Optimisation techniques that
will be subject of the meeting will be loop fusion and splitting,
as well as optimised GPU memory management, and the underlying
performance models used to evaluate the improvement provided by
an optimisation and to characterise best optimisations.
Adam Betts
Monday 28th March, 3pm, William Penney Lab, Access Grid room
No software to analyse = no program model = no WCET estimate
= potential catastrophe. This is the daunting reality of
tomorrow's real-time software infrastructure as component-based
software engineering becomes ubiquitous.
The issue at hand is that components get shipped to clients
under the proviso that its disassembly is strictly forbidden, to
protect intellectual property. Static analysis immediately
chokes. The only alternative - end-to-end testing - is dusted
off and readied for action, but it lacks support from
test-generation techniques, which generally target functional
behaviour alone.
This talk exposes a grey-box solution which builds the program
model on the fly from a set of execution traces, rather than
through inspection of the source code or disassembly. In effect,
instrumentation points provide visibility of the paths taken
through the component during testing. We contrast and compare
different ways that instrumentation points can be placed and
how they affect the accuracy of the WCET estimate. Our results
indicate that simple coverage metrics, such as basic block and
branch, provide a bounded WCET estimate on the benchmarks under
analysis, whereas end-to-end testing almost always fails.
Massimiliano Meneghin
Monday 21st March, 3pm, William Penney Lab, Access Grid room
During the talk a new optimization theory for stencil-based
applications is presented. The theory is centered both on a
modification of the well known owner-computes rule and on basic
but powerful properties of toroidal spaces. The proposed
optimization techniques provide notable results in different
computational aspects: from the reduction of communication
overhead to the reduction of computation time, through the
minimization of memory requirements without performance
loss.
All classical optimization theory is based on defining
transformations that can produce optimized programs which are
computationally equivalent to the original ones. According to
Kennedy, two programs are equivalent if, from the same input
data, they produce identical output data.
As other proposed modifications to the owner-computes rule,
stencil application feature of being characterized by a set
of consecutive steps is exploited. For such configurations, it
is possible to define specific two phase optimizations. The
first phase is characterized by the application of program
transformations which result in an efficient computation of
an output that can be easily converted into the original one.
In other words the transformed program defined by the first
phase is not computational equivalent with respect to the
original one.
The second phase converts the output of the previous phase
back into the original one exploiting optimized technique in
order to introduce the lowest additional overhead. The phase
guarantees the computational equivalence of the approach.
Obviously, in order to define an interesting new optimization
technique, it has to be proved that the overall performance of
the two phases sequence is greater than the one of the original
program.
Exploiting a structured approach and studying this optimization
theory on stencils featuring specific patterns of functional
dependencies, a set of novel transformations, which result in
significant optimizations, have been discovered. Among the new
transformations, the most notable one, which aims to reduce the
number of communications necessary to implement a stencil-based
application, turns out to be the best optimization technique
amongst those cited in the literature.
Peter Collingbourne
Monday 14th March, 1pm, William Penney Lab, Access Grid room
We present an effective technique for crosschecking an IEEE 754
floating-point program and its SIMD-vectorized version, implemented in
KLEE-FP, an extension to the KLEE symbolic execution tool that supports
symbolic reasoning on the equivalence between floating-point values.
The key insight behind our approach is that floating-point values
are only reliably equal if they are essentially built by the same
operations. As a result, our technique works by lowering the Intel
Streaming SIMD Extension (SSE) instruction set to primitive integer
and floating-point operations, and then using an algorithm based on
symbolic expression matching augmented with canonicalization rules.
Under symbolic execution, we have to verify equivalence along
every feasible control-flow path. We reduce the branching factor
of this process by aggressively merging conditionals, if-converting
branches into select operations via an aggressive phi-node folding
transformation.
We applied KLEE-FP to OpenCV, a popular open source computer vision
library. KLEE-FP was able to successfully crosscheck 51 SIMD/SSE
implementations against their corresponding scalar versions, proving
the bounded equivalence of 41 of them (i.e., on images up to a certain
size), and finding inconsistencies in the other 10.
David Birch
Monday 28th February, 3pm, William Penney Lab, Access Grid room
The process of designing a new suburb or city from scratch is ripe for the
application of software performance optimisation techniques. This presentation
will explore these opportunities and introduce a framework to automatically
analyse a city design across multiple disciplines for multiple metrics on
multiple design levels under multiple scenarios. The goal of the system is to
gain cross-discipline insight into the design not otherwise attainable on a
scale not otherwise attainable.
We will demonstrate:
Analysing a city design for carbon, water, heat, gas and electricity
usage using an Integrated Resource Management Model.
Analysing photovoltaic potential using an open source lighting
simulator.
Using the Syncity toolkit from Imperial’s Urban Energy Systems group
we will analyse the placement of power generation and inter-conversion
technology within a city as well as the resource transport network
required.
As these analyses are integrated into our Framework we will them seek to
profile and “debug” the city design using results from each analysis to
identify cross-discipline design insight.
The talk will consist of a discussion of the urban masterplanning process
and the opportunities identified for the application of software performance
optimisation techniques. Before focusing on the analysis framework developed
and its data structures before finishing by demoing results from each analysis
integrated and a 3d computer game like environment for interactive exploration
of city performance.
The research is demonstrated in collaboration with engineering consultancy
Arup.
Renato Salas Moreno
Monday 21st February, 3pm, William Penney Lab, Access Grid room
In computer vision, Bundle Adjustment is one of the final steps
performed after feature-based 3D reconstruction of a moving camera
with unknown motion. By optimising the estimated 3D points positions
and camera parameters, it helps to increase the accuracy of the
camera trajectory (decreasing the effects of measurement noise)
and reduce the error-buildup (preventing tracking failure).
Traditionally it was viewed as an offline or background process
involving an iterative solution of a non-linear least squares
equation across all the available information used for
reconstruction. Current optimisations often involve exploiting
the sparse nature of various matrices to avoid unnecessary
computations.
In this talk I will explain in a little more detail how Bundle
Adjustment is solved using the Levenberg–Marquardt algorithm and
discuss insights on my ongoing implementation using CUDA, which
will exploit the inherit parallelism found at various steps,
aiming to speedup the process for real-time camera tracking.
Carlo Bertolli
Monday 14th February, 3pm, William Penney Lab, Access Grid room
Several complex and time-critical applications require the existence
of novel distributed and dynamical platforms composed of a variety of
fixed and mobile processing nodes and networks. Notable examples of
such applications are crisis and emergency management and natural
phenomenon prediction. Typically, emergency forecasting consists of
complex fluid dynamic applications modelling a specific area (e.g.
a river near a town), with the aim of predicting critical situations
(e.g. a flood) well in advance (e.g. a week), to enable civil
protection personnel to minimise the impact on the environment and
population. Nevertheless, real-time prediction information when the
emergency is in progress would help the optimisation of certain
aspects of management. This requires to apply prediction techniques
also during the evolution of the emergency, and to scatter the
information to a set of distributed and possibly mobile nodes. In
some cases it may be impractical or inefficient to map all
forecasting computations to a high-performance server, a more
practical solution consisting in mapping them to geographically
localised computing nodes and networks w.r.t. the input and output
of the computation, at the cost of a lower quality prediction. It is
hence clear that in this scenario we need the development of
applications able to adapt their behaviour according to the dynamical
platform conditions, such as the presence of specific classes of
computing resources and the actual network availability. In this
presentation I will introduce a solution to the development of
adaptive applications based on the design of a new programming
language, featuring multi-versioning and adaptivity constructs.
David Birch
Monday 7th February, 3pm, William Penney Lab, Access Grid room
In architecture and civil engineering Integrated Resource Management (IRM) models
are an emerging design support tool for Urban Masterplanning (the design of suburb,
city or region level architecture). IRM models allow a design to be analysed from
many disparate engineering disciplines to give a composite, coherent assessment of
the sustainability of a design.
Despite their recent success IRM models are kept from reaching their potential by
their inherent complexity, high data requirements and the difficulty of gaining
sufficient model comprehension to enable project specific adaptation & implementation.
In this presentation we will explore the application of techniques from software profiling and optimisation
in an attempt to address these problems.
Specifically we apply model extraction, slicing & visualisation to an IRM model. We then
demonstrate the insight gained from model metrics before using a model slice as input for a
sensitivity analysis to gain design insight for optimisation.
Ultimately the work presented should empower analysts and designers
with greater insight into their design space enabling them to work more
efficiently toward a more sustainable built environment.
This work is the result of collaboration with Helen Liang from Bath University & is
demonstrated in conjunction with engineering firm Arup.
Craig Court
Monday 31st January, 3pm, William Penney Lab, Access Grid room
Static power dissipation has been identified as a limiting factor in
future microprocessor technologies. As current technology is approaching
that limit, techniques are required both at the technology level and in
the architecture design to reduce static power leakage. Power-gating is
a popular architectural tool to reduce static power either by disabling
whole units, or by dynamically resizing units to meet the application
demands. In this presentation we will present Loop-Directed Mothballing
to power-gate entire execution units, by recording the utilisation of
individual units in innermost loops and power-gating the units according
to two utilisation thresholds. An average of 10% total power saving is
achieved with negligible performance loss. Two prior techniques are
compared, and at first glance appear superior due to their power
savings. However, taking performance into account using the energy-delay
product (EDP), the two techniques show average
results similar or inferior to an architecture without power-gating.
Conversely, LDM shows a 10% average saving for EDP due to the low
performance penalty.
Georgios Rokos
Monday 24th January, 3pm, William Penney Lab, Access Grid room
Mesh smoothing is a component of anisotropic mesh adaptivity, a numerical technology of some importance in Computational Fluid Dynamics (CFD). Adapting the size and shape of elements in an unstructured mesh to a specification encoded in a metric tensor field is done by relocating mesh vertices using an optimisation algorithm, like the one proposed by Pain et al. in 2001. This computationally heavy task can be accelerated by engaging NVIDIA's CUDA, a massively parallel and floating-point capable architecture. In order to ensure correct parallel execution, we implemented the parallel framework based on the use of independent sets, proposed by Freitag et al. in 1998. In this presentation we will demonstrate all related algorithms and architectural details, giving design and implementation descriptions and listing various CUDA code optimisations which can lead to a speedup of up to 190 times over the serial CPU code and up to 45 times over an eight-threaded OpenMP code. Performance analysis shows that CUDA's texture memory can assist in accelerating execution by 2.5 times and that high register usage is the main limiting factor for better performance.
Adam Betts
Monday 17th January, 3pm, William Penney Lab, Access Grid room
The holy grail of performance analysis in real-time systems is worst-case, rather than average-case, behaviour due to safety-critical requirements. In the past 20 years, considerable academic effort has been pooled into answering the WCET question. But what exactly does WCET analysis entail? Why does industry still continue to ignore the vast majority of WCET techniques? How does future software and hardware trends affect the complexity of WCET analysis? Indeed, will research into WCET still exist in 20 years? This seminar sheds light on these questions and much more.
Carlo Bertolli
Monday 6th december, 3pm, William Penney Lab, Access Grid room
Existing Fault Tolerance techniques
include replication, checkpointing and
rollback recovery, and atomic actions. All
these techniques have been employed in the
past to support High-Performance applications,
and their design is based on general
programming models. The result is that
performance analysis as well as its static and
dynamic tuning is difficult, where not
impossible. In other cases specialised
techniques, developed for specific application
cases, are employed. Unfortunately such
techniques cannot be applied to a wide range
of parallel applications, and/or they do not
permit performance tuning. In this talk we
start by considering high-level classes of
parallel applications (namely task farm and
data parallel programs) and we show how to
develop checkpointing and rollback recovery
protocols supporting a sufficiently general
set of parallel applications, while avoiding
main issues suffered by applications running
protocols for more general programming
models.
Nicolas Loriant
Monday 29th november, 3pm, William Penney Lab, Access Grid room
High performance software systems such as OS kernels, and Web servers, must achieve high performance and be available 24/7. Traditional updating techniques (i.e., patch, rebuild and restart) are impracticable for such systems. To takle this problem, we proposed an Aspect-Oriented (AO) approach to describing software adaptations. These adaptations are woven into running systems by rewriting its binary code on-the-fly. Our approach guarantees the continuous execution of the system under update without impeding its performance.
In this talk, we present Arachne, An aspect oriented system. Arachne features an AO language similar to C dedicated to describing adaptations of high-performance C systems. In the Arachne language, adaptations are described in terms of what they perform, and when they are executed. Therefore, it allows developers to express adaptations according to the semantic of the base system. For example, the code of an adaptation may be executed before the call to a given function.
In order to apply adaptations described in our AO language, Arachne features a runtime aspect weaver using on-the-fly binary code rewriting. Arachne performs aspect weaving by dynamically modifying the code executed by the processor without interrupting the execution of the woven system. It requires no prior modification to the modified system, the compilation process or the execution environment. Arachne can perform aspect weaving on both user- and kernel-space code. Moreover, it provides specialized rewriting mechanisms that trigger adaptation code significantly faster (up to 500 times) than generic ones (e.g., those found in debuggers).
Francis Russell
Monday 22nd november, 3pm, William Penney Lab, Access Grid room
Active libraries are a technique whereby a library can take responsibility for the optimisation of its own code. By shifting responsibility for performance into the library, they enable library clients to write code without having to concern themselves with performance, and enable the library writer to focus on how to make the abstractions fast.
This presentation will discuss an investigation of active libraries in the context of computational science. First, I consider sparse linear algebra as an extension to previous work from my MEng. Next, I examine active libraries as a tool to specify and explore optimisations a solver for partial differential equations using the finite element method.
Florian Rathgeber
Monday 15th november, 3pm, William Penney Lab, Access Grid room
Graphics Processing Units (GPUs) are established as a computational hardware platform superior in price/performance compared to general CPUs, but typically require specialized implementations and expert low-level knowledge. In this thesis the FEniCS framework for automated finite element solution of partial differential equations is extended to automatically generate GPU implementations, and achieve the expected speedup without sacrificing generality. An implementation using NVIDIAs Compute Unified Device Architecture (CUDA) of general finite element assembly and a conjugate gradient (CG) solver for the linear system of equations are presented for the DOLFIN problem solving environment. Extending the FEniCS form compiler FFC to generate specific CUDA kernels for assembly from a mathematical notation of the variational form retains the flexibility and degree of automation which is the basis of FEniCS. A matrix-free method, computing the matrix-vector product $Ax$ in the CG iteration without assembling the matrix $A$, is evaluated against the assembly method, and is shown to perform better for a class of problems. Benchmarking and profiling variational forms with different characteristics on a workstation show a significant speedup on the NVIDIA Tesla architecture of up to a factor 9 over the serial CPU implementation and reveal bottlenecks of both the CPU and GPU codes.
Furthermore, a prototype implementation of computational steering for the FEniCS application Unicorn is presented, allowing to interactively change parameters of a running simulation and getting visual feedback in close to real time, presenting a possible application for GPU acceleration.
Nathan Chong
Monday 1st november, 3pm, William Penney Lab, Access Grid room
Memory consistency models define the ordering of memory accesses in a shared-memory multiprocessor. Therefore their importance stems from the fact that they are (a) part of the underlying semantics of any shared-memory multiprocessor program and (b) they are frequently unintuitive for programmers used to a sequentially consistent view of memory.
Coherence is a related concept that defines the ordering of memory access *to the same location*. Coherence is often managed in hardware in a cache-coherence protocol such as MESI. High-level properties such as deadlock freedom are easy to verify at the specification level but can be trickier to prove if considering detail such as non-atomic transactions and protocol-level hardware dependencies (such as finite-sized buffers).
This talk will discuss the ARM weakly-consistent memory model using litmus tests (small multiprocessor programs) and build up an intuitive understanding of the specification. I'll also briefly cover the formal verification of the latest AMBA4 coherence specification.
Athanasios Konstantinidis
Monday 25th october, 3pm, William Penney Lab, Access Grid room
Efficient programming of the new emerging microprocessor architectures like GPGPUs, multi-core, Cell BE etc, has proven to be challenging for programmers who are required to put significant effort in order to exploit their power. As a result, new programming languages and compiler technologies have been developed in order to reduce programming effort while increasing software performance. One of these technologies - the polyhedral model - is the only one that can provide fully automatic parallelization, loop-nest transformations and data movement of static control program segments. Although the applicability of the model is somewhat restrictive, computation kernels that fall into its scope can be found in many high performance applications with considerable impact on their overall performance. Recent studies though suggest that its applicability can actually be extended beyond static control kernels. In fact the real power of polyhedral compilation comes from its ability to represent complex compositions of program transformations as a single scheduling function thus eliminating phase ordering limitations of abstract syntax trees.
This presentation will provide an introduction to the polyhedral model, highlighting its underlying principles and advantages compared with abstract syntax trees. Furthermore, the design of our new prototype polyhedral compiler will be presented in order to demonstrate how the technique is used in practice.