Dependent Session Types for Evolving Multiparty Communication Topologies

Benchmark Results for a Parallel n-Body Simulation

Section 6 of the paper discusses how the greater expressiveness of dependent global session types in comparison to binary session programming permits significant performance improvement. This page describes further details regarding the benchmark application and execution environment, and presents the full source code and results.

The Benchmark Application and Source Code

The n-Body problem involves calculating the motion within a system of bodies modelled as particles with mass. The parallel n-Body simulation used for this benchmark features an advanced interaction structure based on the ring topology (see Figure 4a and Section 6). The basic idea is to divide the particles being modelled between a group of processes; each process is responsible for maintaining the state (i.e. position, velocity and acceleration) for its own particle set. To complete a single simulation step, the current state of each particle set (update packet) is forwarded from each process around the ring so that every process will eventually see the current state of the whole system. The forwarding of each update packet is interleaved with the partial calculation using the results of that packet. After seeing all update packets, the calculation for the current simulation step is complete.

To illustrate the potential impact of dependent global types on practical programming and performance, we have implemented the above algorithm using:

direct binary session programming (using SJ)
binary session implementations projected from the global dependent type (see Section 6)

and compare their performance. As explained in the paper, the restrictions of binary session typing in (1), where the session types specify the interactions over each separate ring link, require one fresh session to be created in each simulation step (to form the final ring link). However, dependent global types in (2) can avoid this problem by giving a holistic specification of the full ring system for m participants.

The complete source code for both implementation versions are available here.

The Benchmark Parameters, Environment and Results

The benchmarks were executed using a cluster of Sun Fire x4100 nodes with two 64-bit dual core Opteron 275 processors and 8GB of RAM, running 64-bit Mandriva Linux 10.2 (kernel 2.6.11). The nodes are connected via gigabit Ethernet with latency measured (56 byte ping) to be ~0.5ms on average. The benchmark applications were compiled and executed using the standard Sun Java SE compiler and runtime versions 1.5.0_03. We used the noalias versions of each implementation but, since each participant is a separate process, this is not an important factor in this experiment. We measured the time to run the simulation with 5 and 10 processes for 1, 10, 100 and 1000 simulation steps, and 1, 10, 100 and 1000 particles allocated to each processes; each parameter combination was measured 100 times.

The full results are available here.