Protocols by Default: Safe MPI Code Generation based on Session Types

Nicholas NgJose G.F. CoutinhoNobuko Yoshida

Abstract

This paper presents a code generation framework for type-safe and deadlock-free Message Passing Interface (MPI) programs. The code generation process starts with the definition of the global topology using a protocol specification language based on parameterised multiparty session types (MPST). An MPI parallel program backbone is automatically generated from the global specification. The backbone code can then be merged with the sequential code describing the application behaviour, resulting in a complete MPI program. This merging process is fully automated through the use of an aspect-oriented compilation approach. In this way, programmers only need to supply the intended communication protocol and provide sequential code to automatically obtain parallelised programs that are guaranteed free from communication mismatch, type errors or deadlocks. The code generation framework also integrates an optimisation method that overlaps communication and computation, and can derive not only representative parallel programs with common parallel patterns (such as ring and stencil), but also distributed applications from any MPST protocols. We show that our tool generates efficient and scalable MPI applications, and improves productivity of programmers. For instance, our benchmarks involving representative parallel and application-specific patterns speed up sequential execution by up to 31 times and reduce programming effort by an average of 39%.



Repository

Protocol (Pabble) Backbone (C/MPI) Diagram
ring pipeline Backbone Ring protocol
stencil Backbone Stencil protocol
scatter-gather Backbone ScatterGather protocol
master-worker Backbone Farm protocol
all-to-all Backbone All to All protocol
fft64 Backbone FFT protocol
Repeated for 64 processes with r=5

Benchmarks

Overall

Overall

The Overall benchmark compares the performance of all the applications we evaluated in our framework. The range of protocols used by the implementations are both from the repository and custom. All implementations are spawned with 64 processes (maximum available on our target cluster), this provides a common environment to compare the speedup of each implementation.

Performance: Nbody

Nbody

The Nbody benchmark compares the performance of N-body simulation implemented in the Ring protocol and shows how it scales with an increasing number of parallel process. As a parallel algorithm very suitable for our communication-computation overlapping using asynchronous optimisation we described in our paper, we show the impact on performance when we utilise the optimisation technique.

Performance: Solver

Solver

The Solver benchmark compare the performance of a Linear Equation Solver implemented in a custom, wraparound mesh protocol (see below). The aim of this benchmark is to show the flexibility of our framework which uses a used-specified protocol not in the protocol repository above.

module custom;
const N = 2..max;
global protocol Solver(role Worker[1..N][1..N], group Col={Worker[1..N][1]}) {
  rec Converge {
    BeforeRing() from __self to __self;
    Ring(double) from Worker[i:1..N][j:1..N-1] to Worker[i][j+1];
    Ring(double) from Worker[i:1..N][N] to Worker[i][1];
    AfterRing() from __self to __self;

    // Vertical propagation
    Propagate(double) from Col to Col;
    continue Converge;
  }
}

Reusability: MapReduce

MapReduce

The MapReduce benchmark compares 2 different applications, AdPredictor and Word Count, that share the scatter-gather protocol with 2 different set of kernel. This

Pre-compiled packages for Debian/Ubuntu Linux: libsesstype libesesstype-dev libscribble libscribble-dev pabble-mpi

Workflow

$ scribble-tool /path/to/protocol.spr --codegen --project ProjectionRole | pabble-mpi-tool -