Imperial's Software Performance Engineering Laboratory

the software performance optimization research group

Publications: Here

Group leader:

Dr Paul Kelly  


Dr Tony Field, Department of Computing

Professor Wayne Luk, Department of Computing (Computer Systems Section, Custom Computing Group)

Professor Chris Hankin, , Department of Computing

Professor Chris Pain, Department of Earth Science & Engineering

Dr Gerard Gorman, Department of Earth Science & Engineering


Ashley Brown (PhD student jointly supervised with Wayne Luk)

Jay Cornwall (PhD student, Industrial CASE Award with The Foundry Ltd)

Lee Howes (PhD student)

Michael Mellor (PhD student)

Francis Russell (PhD student)

We also have a continuous stream of student members of the research group.  A partial list includes:

            MEng/BEng projects 2006-7: Peter Collingbourne, James Huggett, Alex LaMaison, Vishal Mistry.

            MSc 2006-7: Steve Lilly, Tadhg Pearson, David Olivier

            MEng/BEng projects 2005-6: Numaan Chaudry, Henry Falconer, Andrew Huff, Ekaterina Itskova, John Pappin

            UROPs 2005-6: Hubert Plociniczak, David Ingram


Professor Christian Lengauer (University of Passau)

Scott Baden (Associate Professor, University of California, San Diego)

Susanna Pelagatti (Ricercatore, University of Pisa)


iSPEL Seminar listing

Alumni and PhD Graduates: see here

The Software Performance Optimization research group at Imperial College is based in the Department of Computing, and forms part of the Computer Systems Section.  We are fortunate in having lively linkages with several other groups, spanning theory, programming languages, parallel software, computational science applications, program analysis, custom computing and computer architecture.

Our research objective is to make it easier for software engineers to achieve adequate performance. Software performance really matters:

In computational science and engineering, users need to assemble simulation, analysis, visualisation, data repository and optimization components quickly and without compromising responsiveness. They need their software to combine sophisticated data structures and numerical methods with architecture-specific program optimisations. They need tools to scale-up functional prototypes to handle full-size problems, and they need programs to adapt dynamically to the processing, networking and other resources available at run-time.

In embedded systems and mobile appliances, engineers need to derive the optimum hardware-software configuration to achieve the application's real-time demands while minimising power requirements and retaining flexibility. They need to generate code for multiple system-on-a-chip variants, consisting, for example, of processor core, signal processor and graphics device, from a common code base. They need to derive a custom system-on-a-chip configuration from the program structure. Standard libraries need to be automatically tuned for each custom CPU configuration. Dynamically-loaded code needs to exploit the resources of the specific device on which it runs.

In business information systems, software architects need to design secure, distributed systems by building and assembling re-usable components. Team leaders need to be able to rely on software services with minimal consideration for how they are implemented and distributed. Tools are needed to diagnose and rectify performance problems by automatically aggregating communications, caching data, proxying access methods and combining updates. Such tools involve optimisations which are not just inter-procedural, but which span a distributed system and cross security domains.

Our approach: run-time optimization of dynamically-composed software

The challenge we have identified is to extend existing compiler optimisations to the component level. Inter-component optimization goes beyond classical inter-procedural optimization in two crucial respects:

    • Components are often composed dynamically: the best way to execute a component depends on the way its operands are generated, the way its results are used, on the available computing and communications resources, on the prevailing contention and the end-to-end quality of service requirement.
    • A component can carry component metadata: the scope for finding the best way to execute a component depends on how much CPU time and memory it needs, how efficiently it uses multiple processors, the distribution and sequence with which it consumes its operands and produces its results, and may depend on the input data characteristics or pattern of use to which it is made.

Component metadata is similar to the interprocedural summary information modern compilers associate with separately-analysed procedures. This covers the analysis side of the picture; component metadata also describes the performance characteristics of the available component implementations, together with the parameters which can be adjusted to match the implementation to the context in which it is being used.

Current projects

    • DESOBLAS: "Delayed-evaluation self-optimizing basic linear algebra subroutines". Olav Beckmann's PhD project is to build a parallel implementation of a standard numerical library (the BLAS routines for dense linear algebra). Olav's implementation runs in parallel on workstation clusters, and on MPPs such as the Fujitsu AP3000, IBM SP2 and Origin 2000. The interesting part is to optimize the distribution of intermediate arrays as they are passed from one library call to another. Olav's implementation does this entirely automatically, at run-time. Each BLAS library function is characterised by an invertible set of data placement constraints based on affine functions. The optimizer assembles these constraints from the program's data flow graph, then solves them iteratively. The run-time system maintains a cache of previous optimisation results, which are used when a previously-optimized data flow graph is re-encountered. For further details, see our papers, Beckmann & Kelly, EuroPar97, Beckmann & Kelly, PCW97, Beckmann & Kelly, EuroPar98, Beckmann & Kelly, LCR98 and Beckmann & Kelly, LCPC99. Future work will address uniprocessor performance (eg run-time loop fusion) and less-regular computations.
    • DESORMI: "Delayed-evaluation self-optimizing remote method invocation". Kwok Yeung's PhD project is to optimize distributed object-oriented programs based on Java's RMI (remote method invocation) mechanism. Kwok's implementation builds a data flow graph for the computation being executed, then constructs an optimised plan to execute it using the minimum communication delays. By operating at run-time, the implementation can adapt to the pattern of usage, the physical locations of the servers and client, and the prevailing network traffic. Major challenges include dependence analysis for linked structures, avoiding unnecessary serialisation, and relaxing the ordering imposed by Java's exceptions.
    • Run-time profiling and optimization of the Linux kernel: David Pearce has built a software tool for systematically and safely instrumenting, modifying and optimising Linux kernel code on-the-fly. Ariel Burton is using similar ideas to extend and apply his work on using traces in server performance analysis (see our papers Burton & Kelly, IPCCC98, Burton & Kelly, CEE98 and Burton & Kelly, UKPEW99).  Future work is intended to investigate code motion optimizations and run-time specialisation which span the operating system's protection boundaries. The plan is to inline device drivers into user code, and conversely to migrate user code into intelligent I/O devices.
    • Code motion for mobile code (Comic): a common factor in much of the group's work is to extend conventional code motion optimisations (eg moving loop-invariant operations out of loops) to operate in a distributed (and dynamically-bound) context. This raises fresh issues in data flow analysis, primarily because of the security issues - code migrated from a client to a server, or from user mode to kernel mode, must be trustable. It also raises new synthesis issues - eg moving code from the client to the server may reduce the client's workload at the expense of the server's. In general, changing circumstances mean the run-time system has to navigate the landscape of implementation alternatives adaptively. To do this, we need component metadata which allows efficient optimization algorithms.

Long-term objectives

1.      Develop a software infrastructure based on Java and remote method invocation (RMI) which makes it possible to apply compiler code motion optimizations to distributed applications

2.      Extend current understanding of code motion optimizations and the underlying static analyses, in order to provide a principled basis for optimising distributed applications without compromising security or availability

3.      Develop software tools for building distributed scientific applications which adapt, at run-time, to the problem, the available resources, and to the available precursors cached from earlier computation

4.      Build a theoretical framework for run-time optimization of irregular scientific applications, which balances the potential advantage of having full knowledge of how future execution will develop, against the time and space needed to compute and maintain this information

5.      Develop our understanding of the search space for optimally matching the execution strategy for a computational kernel to the available hardware, and extend it to larger-scale applications

Technical challenges

Underlying the projects listed above there are some pervasive technical problems which are not likely to be solved soon:

Dependence analysis for rich data structures

Smart programs use rich data structures, both to be flexible, and to use sophisticated algorithms. Unfortunately, rich data structures are hard to analyse. Pointers lead to the possibility of two objects having shared substructure. This blocks many of our optimisation techniques. Sergio Almeida, a PhD graduate from the group, has an interesting proposal; see his ECOOP97 paper (available from his home page

Reasoning about performance issues with rich data structures

It is usually very difficult to reason about performance issues with rich, linked data structures. How large would the serialised version of this parameter be? How much work is involved in this subtree? This is analogous to a static analysis problem, but to solve it we need to add code to maintain the information we want as the computation proceeds. We need the compiler to generate code to provide the information we will need to make good run-time optimization decisions.

Programming languages and libraries

We need to get people to use much better programming languages. It's unlikely that the problems with rich data structures are going to be solved by better analysis - the best bet is to get programmers to tell us what we need to know. New language concepts may be necessary. Perhaps it's possible to do it with metadata for container class libraries?

Linking static analysis to dynamic program properties

Conventional optimizing compilers separate analysis from synthesis: find out what you can about the program, and use it to generate the best possible code.  In the run-time context, we need to turn this round: analyse the code to find out what you need to know about the run-time context to achieve optimisations, then generate code (or construct component metadata) to acquire this information and use it.

Program optimization as mathematical optimisation: understanding the search space

Although a run-time optimizer has to run fast initially, a long-running program can be subjected to very intensive optimization.  We can therefore consider searching for the best possible code, for example by backtracking (eg when loop unrolling leads to register spills).  A more systematic approach is to formulate the optimisation problem in mathematical terms, and use ideas from operational research to characterise and partition the search space.