

# **CAP-VMs: Capability-Based Isolation and Sharing in the Cloud**

Vasily A. Sartakov Imperial College London Lluís Vilanova Imperial College London David Eyers University of Otago

Takahiro Shinagawa The University of Tokyo Peter Pietzuch Imperial College London

## Abstract

Cloud stacks must isolate application components, while permitting efficient data sharing between components deployed on the same physical host. Traditionally, the MMU enforces isolation and permits sharing at page granularity. MMU approaches, however, lead to cloud stacks with large TCBs in kernel space, and page granularity requires inefficient OS interfaces for data sharing. Forthcoming CPUs with hardware support for *memory capabilities* offer new opportunities to implement isolation and sharing at a finer granularity.

We describe *cVMs*, a new VM-like abstraction that uses memory capabilities to isolate application components while supporting efficient data sharing, all without mandating application code to be capability-aware. cVMs share a single virtual address space safely, each having only capabilities to access its own memory. A cVM may include a library OS, thus minimizing its dependency on the cloud environment. cVMs efficiently exchange data through two capability-based primitives assisted by a small trusted monitor: (i) an asynchronous read/write interface to buffers shared between cVMs; and (ii) a call interface to transfer control between cVMs. Using these two primitives, we build more expressive mechanisms for efficient cross-cVM communication. Our prototype implementation using CHERI RISC-V capabilities shows that cVMs isolate services (Redis and Python) with low overhead while improving data sharing.

## 1 Introduction

Cloud environments require application compartmentalization. Today, isolation between application components is enforced by virtual machines (VMs) [10, 32, 63] and containers [2, 40], either separately or in combination. Yet, current applications push the limits of these mechanisms in terms of performance and security: when application components communicate heavily with each other, VMs and containers add substantial overheads, even when they are co-located to improve communication performance; furthermore, the implementation of the isolation mechanisms may also rely on a large trusted computing base (TCB). *VMs* provide strong isolation through a relatively narrow hardware interface. Since a guest VM has its own OS kernel, its TCB can be reduced to a relatively small hypervisor, which multiplexes VM access to the hardware [56]. Efficient inter-VM data sharing, however, is challenging to achieve due to performance and page granularity trade-offs [17,71].

In contrast, *containers* isolate processes into groups [2] and provide faster inter-process communication (IPC) primitives, including pipes, shared memory, and sockets. Similar to VMs, they face problems of page-level sharing granularity and overheads due to frequent user/kernel transitions. Their richer IPC primitives for data sharing come at the cost of a larger TCB—a shared OS kernel implements both namespace isolation between process groups and complex IPC primitives, increasing the likelihood of security vulnerabilities.

Existing cloud stacks thus face a fundamental tension when application components are compartmentalized but must communicate. They must either copy data or modify page tables, both of which are expensive operations that involve a privileged intermediary, e.g., a hypervisor or OS kernel, and lead to coarse-grained interfaces designed around page granularity.

In this work, we explore a different approach to designing a cloud stack that isolates application components, while supporting efficient sharing. We ask the question "if the hardware supported dynamic, low-overhead sharing of arbitrary-sized memory regions between otherwise isolated regions, how would this impact the cloud stack design?" We exploit hardware support for *memory capabilities* [23, 70], which impose flexible bounds on all memory accesses, allowing components to be isolated without page table modifications or adherence to page boundaries. This offers a new opportunity to design memory sharing primitives between isolated compartments with zero-copy semantics.

We describe **CAP-VMs** (**cVMs**), a new VM-like abstraction for executing isolated components and sharing data across them. cVMs are enforced by a small TCB that uses memory capabilities to isolate and share data between compartments efficiently. Through the use of a *hybrid* capability model [66], cVMs avoid having to port application components to use capability instructions, circumventing compatibility issues that typically plague capability architectures.

Using memory capabilities as part of a cloud stack, however, raises new challenges: the cloud stack must (i) support existing capability-unaware software without cumbersome code changes, bespoke compiler support, or manual management of capabilities across isolation boundaries; (ii) remain compatible with existing OS abstractions, e.g., POSIX interfaces, all while keeping the TCB small; and (iii) offer efficient IPC-like primitives for otherwise untrusted components to share data safely and take advantage of the potential zero-copy sharing enabled by capabilities.

To address the above challenges, cVMs make the following design contributions:

(1) Strong isolation through capabilities. Multiple cVMs share a single virtual address space safely through capabilities. Each cVM is sandboxed by a pair of *default* capabilities, which confine the accesses of all instructions inside a cVM to its own memory boundaries. To avoid having to port existing application components to a capability architecture, cVMs allow them to execute unmodified by using CHERI's *hybrid* capability architecture [66], which integrates capabilities with a conventional MMU architecture. In addition, cVMs strictly limit how CHERI capabilities can be used to avoid known capability revocation overheads: cVMs are not permitted to store or export capabilities, and the transitions of communication capabilities are controlled by a trusted component.

(2) Bespoke OS support through a library OS. cVMs are self-contained with a small TCB, reducing reliance on the external cloud stack, while providing POSIX compatibility. They include a bespoke *library OS* with POSIX interfaces for, e.g., filesystem and network operations with cryptography for transparent protection, which is protected from application code using capabilities. In the library OS, each cVM implements its own namespace for filesystem objects, virtual devices, cryptographic I/O keys etc. Only low-level resources, e.g., execution contexts for threads and I/O device operations, are shared and provided by an external host OS kernel.

(3) Efficient data sharing primitives. cVMs offer two low-level primitives to share data efficiently without exposing application code to capabilities, which are hidden behind a small, trusted *Intravisor*: (i) a *CP\_File* API allows application components to share arbitrary buffers through an asynchronous read/write interface. Under the hood, the cVM implementation uses capability-aware instructions to exchange the rights to safely access each other's memory, and read/write data at byte granularity at the cost of a single memory copy (whereas traditional file-oriented IPC would require two copies); and (ii) a *CP\_Call* API transfers control between cVMs, which, e.g., can be used to implement synchronization mechanisms. By combining these two primitives, higher-level APIs are possible: (iii) a *CP\_Stream* API supports efficient stream-oriented data exchange between cVMs with one memory copy.

We implement cVMs on the CHERI RISC-V64 architecture, executable on FPGA hardware with CHERI support and multicore RISC-V hardware. Our evaluation shows that cVMs provide a practical isolation abstraction with efficient data sharing: using the CP\_Stream API for inter-cVM communication reduces latency for Redis by up to 54% compared to classical socket interfaces, and reduces its standard deviation by up to  $2.1 \times$ . When isolating a cryptography component of a Python-based service, cVMs introduce an overhead of up to 12% compared to a monolithic baseline.

## 2 Hardware Isolation Support

Next we survey the design space for isolation and sharing in cloud environments in more detail ( $\S2.1$ ), provide background on capability support on modern hardware ( $\S2.2$ ), and describe our threat model ( $\S2.3$ ).

## 2.1 Isolation and sharing in the cloud

We argue that VMs and containers are two extremes of component isolation. VMs virtualize hardware interfaces such as page tables, instructions, traps, and physical device interfaces to manage both isolation and communication; containers virtualize pure software interfaces such as processes, files, and sockets for the same purposes.

**Compatibility.** Both VMs and containers are compatible with existing applications, which is critical for adoption in cloud environments. VMs can execute an unmodified guest OS on top of a hypervisor, making virtualization transparent to applications inside VMs. Conversely, containers execute unmodified applications on top of the same host OS kernel that manages other containerized and non-containerized applications. In both cases, OS interfaces and semantics used by the virtualized applications remain unmodified compared to a non-virtualized environment.

But the compatibility offered by these technologies lowers communication performance, which is often exacerbated as we try to achieve better isolation between components.

**Isolation.** Despite strict isolation between the memory of containers, there is a lack of isolation of the TCB that manages the virtualization mechanism itself. Conventional container platforms, e.g., Linux containers [2], share privileged state, as they employ namespace virtualization: the OS kernel creates separate process identifiers, devices, filesystem views etc., which offer the illusion that a process group exists in isolation. In reality, containers share kernel data structures, and privilege escalation inside one container may lead to the compromise of all containers [3,5]. In comparison, VMs are virtualized through narrower interfaces, resulting in a conceptually simpler hypervisor that is harder to compromise [15, 56].

Unfortunately, stronger isolation comes at a performance price from both known hardware inefficiencies [14,41,61] as well as less flexible mechanisms for data sharing.

Sharing. Components of cloud applications typically use

networking as a means of communication. Even if multiple components are co-located on the same host, they may use a reliable network transport protocol, e.g., TCP. While this helps with scalability, it adds overhead for co-located components, making optimizations based on direct memory sharing attractive. Both VMs and containers use page-based memory isolation, which limits the performance of memory sharing: mechanisms must be aware of page boundaries to avoid leaking sensitive data, and page table modifications for on-demand sharing are known to be expensive [62].

Co-location opens up two avenues for performance improvements: (1) sharing can transparently speed up communication of co-located components [44, 47]; and (2) new communication interfaces can be tailored toward efficient sharing between components.

### 2.2 CHERI capability architecture

In cloud applications with many services [26], traditional network-based communication shows its performance limits between tightly-coupled components [33]. Therefore, we aim to co-locate components and design a cloud stack with efficient isolation and communication interfaces and mechanisms. This requires, however, new hardware support for isolation and sharing that is free of the "MMU tax" of pagelevel privileged memory protection.

*Memory capabilities* [18] are a protection and sharing mechanism supported by the hardware. The *CHERI* architecture [64, 70] implements capabilities as an alternative to traditional memory pointers. A capability is stored in memory or registers, and encodes an address range with permissions, e.g., referring to a read-only buffer or a callable function.

CHERI protects capabilities by enforcing three properties: (1) *provenance validity* ensures that a capability can only be "derived", i.e., constructed, from another valid capability, i.e., it is not possible to cast an arbitrary byte sequence to a capability; (2) *capability integrity* means that capabilities stored in memory cannot be modified, which CHERI achieves through transparent memory tagging [70]; and (3) *capability monotonicity* requires that, if a capability is stored in a register, its bounds and permissions can only be reduced, e.g., a read-only capability cannot be turned into a read-write one.

**Building capability-based compartments.** CHERI capabilities can be used to compartmentalize software components, e.g., plugins or libraries in a program, by giving each capabilities to separate memory regions. The above properties enforced by CHERI ensure that compartments can coexist in the same address space, and remain isolated as long as their initial capabilities point to disjoint data and code in memory. The application can, of course, grant each compartment extra capabilities, e.g., to allow particular cross-compartment memory accesses or function calls.

**Pure- and hybrid-cap code.** CHERI distinguishes between two execution modes [66]: (i) in *pure-cap* mode, all point-

ers must be capabilities,<sup>1</sup> and code must use a new set of capability-aware instructions; and (ii) in *hybrid-cap* mode, code can mix ordinary and capability-aware instructions, which allows the coexistence of capability-unaware and purecap code via wrapping functions. This facilitates the incremental adoption of capabilities in software.

When accessing memory, pure-cap code must use new instructions that use capability registers instead of regular registers. In addition, secure calls across capability-isolated components must use a CInvoke instruction, which requires a pair of capabilities: the target function address, and an arbitrary value that is meaningful to the callee function (e.g., an identifier for an object managed by the callee).

To ensure that both capabilities are used correctly by CInvoke, e.g., thwarting a malicious caller from passing a callee object identifier that was meant for a different callee function, the callee can "seal" pairs of capabilities together using the CSeal instruction. CInvoke only accepts correctly sealed pairs of capabilities.

Hybrid-cap code relies on two new capability registers, the *default data capability* (ddc) and the *program counter capability* (pcc), which are used implicitly by capability-unaware instructions. The OS starts all processes by setting ddc and pcc to the entire virtual address space. Capability-aware code then creates new capabilities from these registers, preserving CHERI's provenance, integrity and monotonicity properties.

Pure-cap code thus introduces compatibility challenges:

- All pointers in pure-cap code are capabilities that occupy 16 bytes instead of the ordinary 8 bytes, and must be 16byte aligned. This decreases CPU cache effectiveness, and may require extra effort to align capability and noncapability elements in data structures.
- It is not possible to cast between addresses and various types of capability-based pointers, because CHERI distinguishes between them and imposes bounds on pointers [65].
   C/C++ code that uses raw casts—a commonly found idiom in low-level system software—requires substantial modifications. For example, the strict bounds in capabilities are typically incompatible with memory allocators that place metadata before allocated data.
- While CHERI compresses capabilities, they can still result in memory bloat, because larger sizes are subject to coarser address discretization. Large allocations with capabilities may require stronger alignment and extra padding [69].
- CHERI advocates for a trusted, system-wide *garbage collector* to manage capabilities to dynamically-allocated memory [66]. It is important to ensure that allocations are not reused while valid capabilities pointing to them still exist. Since new capabilities can be derived from existing ones, and stored on the heap, stack, and in registers, all capabilities derived from an allocation must be either invalidated (i.e., revoked), or allocations cannot be reused

<sup>&</sup>lt;sup>1</sup>CHERI has separate registers for regular data and capabilities.

while such capabilities are valid. A garbage collector (as opposed to expensive hardware support for capability revocation) addresses this issue, but it is a disruptive change in cloud environments, potentially leading to delays in resource reclamation and increased tail latencies.

Removing the need to use capability-aware code is important in cloud environments with limited control over tenant code. Therefore, we want to explore a design for a cloud stack that compartmentalizes application components using CHERI's hybrid-cap mode, without the disadvantages of pure capability-aware code.

### 2.3 Threat model

Cloud environments support multiple, isolated application components, and thus we consider attacks in which an attacker controls a malicious component that interferes with another component by probing interfaces or trying to escape its sandbox. We assume that the attacker has full control over the application components and a library OS, e.g., by exploiting vulnerabilities inside the compartment or by executing arbitrary code that includes capability-aware instructions.

Our TCB includes the underlying host OS kernel, but the entire application stack (program, libraries and library OS) is considered untrusted. We assume that the CHERI hardware implementation is correct. We do not analyse side-channel attacks against CHERI, which is an important, yet orthogonal consideration that affects both the architectural and micro-architectural levels [67].

## 3 cVM Design

cVMs are a new virtualisation and compartmentalization abstraction for application components. Such components can often be co-located and exchange data, and cVMs isolate them with support for low-overhead data exchange using CHERI capabilities. The design of cVMs has the following features:

**Separate namespaces.** Unlike containers, cVMs do not rely on a shared OS kernel for namespace isolation. They use capabilities to add a new userspace-level isolation boundary, moving OS kernel functionality from a privileged to an unprivileged layer. cVMs only use the host OS for execution contexts, synchronisation, and I/O, thus resembling VMs.

**Bypassed communication.** cVMs are mutually untrusted, but communication bypasses the host OS kernel for performance. They use capabilities for on-demand access to memory regions used for communication, without compromising neighbouring memory.

**Low-overhead isolation.** cVMs use capabilities for lowoverhead isolation of both process and program modules. For example, cVMs can isolate shared libraries with minimal changes to the calling interface.

**Compatibility.** cVMs use CHERI's hybrid-cap mode. Capabilities are thus hidden from application code, which only needs changes to use new communication APIs.



Fig. 1: cVM architecture

### 3.1 Architecture overview

Fig. 1 shows the architecture of cVMs. Each cVM (a) is an application component, such as a process or library, and has three parts: (i) program binaries and their libraries; (ii) a standard C library; and (iii) a library OS.

cVMs add two new isolation boundaries, enforced through capabilities. The *Intravisor boundary* () separates the *Intravisor* from all cVMs, and cVMs from each other. The Intravisor is responsible for the lifecycle and isolation of cVMs, allows safe communication between them, and provides other primitives that cannot be implemented inside the unprivileged library OS (e.g., storage and networking I/O, time, threading and synchronisation). It has access to the memory of all cVMs, but not the other way around.

The *Program boundary* **O** separates programs from the library OS that provides them the namespace for all OS primitives. A single library OS instance can thus host multiple, mutually-isolated programs with their own code and data (left-most cVM in Fig. 1).

These isolation boundaries are enforced by CHERI capabilities; compartmentalized content cannot access memory beyond its boundary, except through the controlled interfaces described next. Finally, there is a classical separation from the host OS, using CPU rings and MMU-based isolation.

### 3.2 Isolation boundaries

We now describe how cVM are isolated in more detail (see Fig. 2). Each program compartment contains the code and data of its binary, its dependencies (shared libraries), and the standard C library; the cVM also contains the library OS, which provides the OS functionality.

Isolation boundaries are enforced by giving each its own default CHERI capabilities using the pcc and dcc registers (see §2.2) with non-overlapping address ranges; compartmentalized code thus cannot load, store or jump into memory outside that granted by the capabilities that it holds. To allow ① program  $\rightarrow$  libOS and ② libOS  $\rightarrow$  Intravisor calls, cVMs use extra capabilities that grant controlled access to functions outside the respective compartment.

Tab. 1: cVM API

| Туре      | API function                                                                                                                                                                                                                                                                                               | Description                                                                                                                                                                                                                            |
|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Creation  | <pre>cp_cvm_make(cp_config_t *cfg, char *libos, char *disk.img, int argc,<br/>char *argv[])</pre>                                                                                                                                                                                                          | Create new cVM                                                                                                                                                                                                                         |
| CP_File   | <pre>cp_file_make(char *key, size_t key_size, void *addr, size_t size) cp_file_destroy(int file) cp_file_get(char *key, size_t key_size) cp_file_read,cp_file_write(int file, char *key, size_t key_size) cp_file_wait,cp_file_notify(int file)</pre>                                                      | Make CP_File for buffer addr & publish with key<br>Destroy CP_File<br>Get CP_File with key from another cVM<br>Read/write data via CP_File file<br>Wait/notify signal via CP_File file                                                 |
| CP_Call   | <pre>cp_call_make(char *key, size_t key_size, void *func) cp_call_destroy(int call) cp_call_get(char *key, size_t key_size) cp_call(int call, bool async, void *arg, size_t size)</pre>                                                                                                                    | Make CP_Call for func & publish with key<br>Destroy previously created CP_Call<br>Get CP_Call with key from another cVM<br>Call CP_File call with arguments                                                                            |
| CP_Stream | <pre>cp_stream_make(char *key, size_t key_size) cp_stream_destroy(int stream) cp_stream_get(char *key, size_t key_size) cp_stream_send(int stream, void *buf, size_t size) cp_stream_recv(int stream, long id, void *buf, size_t size) cp_stream_poll(int stream, long *id, size_t nid, int timeout)</pre> | Make CP_Stream & publish with given key<br>Destroy CP_Stream<br>Get CP_Stream with key from another cVM<br>Send buffer through CP_Stream<br>Post buffer to receive through CP_Stream.<br>Poll for data on receive buffers of CP_Stream |



Fig. 2: Anatomy of a cVM

cVMs need to implement the equivalent of user/kernel separation using CHERI capabilities in userspace. When loading a program, a set of capabilities is therefore given to the syscall handler functions of the library OS. The standard C library uses these capabilities to invoke system calls on the library OS through the CInvoke instruction, while the rest of the application remains capability-unaware. The library OS has full access to the programs that it manages.

cVMs also need to implement the equivalent of guest/host (or VM/hypervisor) separation using CHERI capabilities in userspace. When creating a cVM, the Intravisor installs capabilities to its own host system call handlers on the new library OS instance; in turn, the library OS uses CInvoke to invoke Intravisor operations.

## 3.3 Creation and communication API

cVMs combine compatibility and flexibility when isolating cloud services. They support the execution of complete application components using a process isolation abstraction, but also that of individual library components.

Tab. 1 shows the cVM API. New cVMs are created by cp\_cvm\_make(); similar to fork()/exec(), it accepts a disk image file, a program binary to load into the cVM, and a function in that binary to launch. If a cVM isolates a standalone

library, cp\_call() invokes functions in the library.

cVMs use CHERI capabilities for efficient inter-cVM communication. The Intravisor exchanges an initial set of capabilities between cVMs to allow communication.

**CP\_File.** This primitive introduces a file-like API to access memory from another cVM at arbitrary granularity; the use of capabilities in CP\_File permits bypassed access to memory without repeated mediation by the Intravisor.

A *donor* cVM registers a memory region with the Intravisor to share with other cVMs via cp\_file\_make(); a *recipient* cVM calls cp\_file\_get() with the same key to obtain access. The cVMs then access data in the memory region via cp\_file\_read/write(). Internally, the library OS uses capability-aware code to copy data directly between the cVMs (using capcpy; see §4).

To support asynchronous data transfers, cp\_file\_wait() and cp\_file\_notify() allow callers to wait for and notify events on a CP\_File, respectively. Finally, the donor cVM calls cp\_file\_destroy() to destroy it, revoking all access.

**CP\_Call.** This primitive invokes functions outside the calling cVM, e.g., a callback function in the library OS, or a function in a shared library. cVMs manage CP\_Calls as follows: cp\_call\_make() registers a function in the donor that recipients can look up using cp\_call\_get() and then call with cp\_call(). The call is received by the Intravisor, which creates a new thread in the donor's cVM, sets it to execution to the target function with given arguments and, optionally, waits for its completion, based on the async argument.

**CP\_Stream.** By composing the CP\_Files and CP\_Calls APIs, it is possible to construct more complex communication mechanisms. For example, we have built a stream-oriented API for inter-cVM communication in which the sender does not need to know where data is copied.

A recipient cVM calls cp\_stream\_recv() to register buffers for incoming messages (internally, a list of CP\_Files); a sender cVM calls cp\_stream\_send() to copy data into any of the buffers available in the recipient. The recipient is then informed of data transfers when calling cp\_stream\_poll().

#### 3.4 Capability management

The use of CHERI capabilities introduces two problems that cVMs must avoid: avoiding the need for application code to become capability-aware and performance problems when revoking capabilities.

As explained in §2.2, making an application fully capability-aware requires code changes. The design of cVMs avoids this by limiting the use of capability-aware code to a small portion of the standard C library, the library OS and the Intravisor, which explicitly handle the CP\_Files and CP\_Calls abstractions through syscall trampolines.

In the cVM design, we want to avoid centralized trusted mechanisms for capability revocation (see §2.2), as this goes against our goal of minimizing overheads and TCB size. Therefore, only the Intravisor is permitted to store CHERI capabilities in memory: all capabilities that are passed by the Intravisor to cVMs have the CAP\_STORE permission withheld. Instead of having to perform expensive garbage collection, revocation can now be done by clearing a small number of capability registers. This can be done efficiently when programs call the cVM API to avoid interrupting execution.

## 4 Implementation

Next, we report implementation details of cVMs on the CHERI RISC-V64 platform. Our implementation consists of 5,200 lines of C code and 100 lines of assembly for the Intravisor, and 1,800 lines of C code and 200 lines of assembly for the Init service, the Hostcall interface and CAP Devices. It uses the Linux Kernel Library (LKL) v4.17.0 [36] as the library OS and the musl standard C library v1.2.1 [42]. As the host OS kernel, we use CheriBSD [25].

## 4.1 cVM lifecycle

**Initialisation.** The boot process of a cVM is trigged by the Intravisor. It receives a deployment configuration for the cVM, which includes the heap size, the disk image location, the permitted interfaces, etc. It also defines the version and location of an Init service (see below) and the library OS binaries. The Intravisor first allocates memory for the cVM binary, stack and heap. It also allocates memory for the thread stack pool. Our implementation of cVMs cannot change the size of heap and stack at runtime, but this is a minor limitation given the size is in terms of virtual memory, and is only committed to physical memory on demand. Just as cloud providers prefer re-instantiating VMs over the use of memory ballooning, we expect large resource size changes to re-instantiate cVMs.

All threads must be created inside a compartment's memory, thus the Intravisor pre-allocates memory for future thread stacks. After that, the Intravisor deploys the image of the Init service into the cVM and spawns the initial thread in the context of the cVM. This thread prepares the hostcall callback tables, and enters the cVM via the CInvoke-based interface created by the Intravisor.



The Init service (see Fig. 2) is responsible for initializing all components at deployment, and creates the communication interface between the library OS and the host system. It is part of the library OS isolation layer, which means that it can access the memory of the application component. It initialises the library OS, builds the syscall interface for the program (or library), deploys its binary and calls the entry function (e.g., c\_start()). For an executable binary, it launches the program; for a library, the entry function initializes a CP\_Stream and registers the public library functions with the Intravisor.

**Execution.** cVMs use the Linux kernel library (LKL) [36] as a library OS that provides a Linux-compatible environment. LKL processes system calls and requests the host OS kernel to perform actions as needed.

LKL's storage and networking backends implement lean interfaces for hardware I/O devices: disk I/O has three hostcalls (disk\_read/write(), disk\_getsize()); networking uses only net\_read/write(). The disk\_read/write functions are applied to a file descriptor of the disk image; the network functions are invoked on a TAP device. The remaining functions in the hostcall interface are straightforward: they offer support for time and timer functions, debug output, threading and locking, and management of CAP Devices (see §4.3).

**Threading.** For simplicity, cVMs use a 1-to-1 threading model. When a cVM creates a thread, the pthread library requests an execution context from LKL, which in turn, requests a new thread from the host OS kernel. This requires the integration of the pthread implementations inside the cVM and the host—both must maintain their own thread-local stores, pointers to thread\_structs, etc.

When LKL requests a thread, it prepares a structure with an address of the entrance function, and a pointer to the arguments. This is passed to the host OS kernel, and the Intravisor creates a new thread with the provided arguments: it allocates a stack for the thread from the thread stack pool, pre-allocated at boot. After that, the new thread is ready to enter the cVM using CInvoke and capabilities are created by the hostcall interface. Prior to entering, the Intravisor switches the thread pointer tp register. Inside a cVM, threads have LKL TP values; when processing hostcalls, they have host ones.

#### 4.2 Calls between nested compartments

cVMs use the CInvoke instruction to call functions between isolation layers, both (i) from an outer to an inner layer (ICALL), e.g., when the Intravisor invokes Init; and (ii) from an inner to an outer layer (OCALL), e.g., when performing a syscall or hostcall.

CInvoke takes two *sealed* capabilities (see §2.2) as arguments: (i) one with a new Program Counter Capability (pcc) value and another that points to a memory region that becomes accessible after the instruction execution. The pcc is replaced by the first unsealed capability; the second capability moves to the ct6 (C31) register in the unsealed form.

Next, we explain how CInvoke is used to implement both ICALLs and OCALLs:

**ICALLs.** Fig. 3 shows the switching mechanism for ICALLs. In this example, the Intravisor in the outer layer calls Init in the inner layer. To make the call, the caller prepares the first capability that points to the entry point inside the compartment. This capability, together with the corresponding data capability, defines the default capabilities of the inner compartment. Inside the compartment, these capabilities, COMP.DDC and ENTRY.PCC become ddc and pcc, respectively. While the ENTRY.PCC capability can be passed as the first argument of CInvoke, COMP.DCC must be loaded by the caller prior to switching (see Fig. 3).

To return from the compartment or grant permission to invoke functions in the outer layer from the inner layer, further capabilities are needed: these are stored in memory by the Intravisor before CInvoke is called, in a structure that we call the *Affix*. They include a sealed ddc of the outer layer (MON.DDC.sealed). Without this capability, the Intravisor could not change ddc from the inner to the outer layer on return in order to access the Intravisor's data. This capability can only be fetched from the inner layer—the accessible memory is restricted by the ddc of the inner layer.

The Affix also includes RET. sealed and OCALL. sealed, which are two sealed pcc capabilities to entry functions in the outer layer. The former is used to return from the compartment; the latter points to an entry function, which is used when the inner layer calls a function of the outer layer (e.g., print()) and returns to continue execution inside the compartment. This is used for the syscall and hostcall interfaces. Capabilities in the Affix are created by the Intravisor and stored on the stack and inside per-compartment private stores.

**OCALLs** share many similarities with ICALLS. The caller prepares a sealed capability of the return address. After the end of a function, the callee uses CInvoke and the execution of the caller continues from the desired address. Together with CInvoke, the callee passes the sealed capability MON.DDC.seal, which was passed originally inside the Affix. It is put into ddc after the function returns.



Fig. 4: Implementation of communication mechanisms

#### 4.3 Communication mechanisms

The data sharing API between cVMs from §3.3 is also based on capabilities. Data referenced by capabilities, however, can only be manipulated by capability-aware instructions, which do not exist in native code. To resolve this issue, we *mediate* the interaction between hybrid-cap code and capabilities using virtual devices called *CAP Devices*.

The CP\_Files, CP\_Calls, and CP\_Streams primitives are implemented using character devices, which are created by the library OS and Intravisor. A program can read/write from/to these devices, and the corresponding operations are performed by capability-aware code inside drivers.

This design has two advantages: (i) despite its one memory copy, it is faster than traditional communication interfaces (see §6.5); and (ii) it supports a simple mechanism to revoke capabilities. A remote cVM can inform the Intravisor of the revocation, which then requests the library OS to destroy the corresponding CAP Device. To revoke capabilities in pure-cap code, a Intravisor would have to stop the cVM execution and destroy capabilities manually.

**CP\_Files** support regular POSIX file operations. In contrast to ordinary files, the content of CP\_Files is not cached by the page cache, and read/write operations can be unaligned.

Fig. 4a shows the implementation. A donor cVM advertises one or more memory regions defined by *keys*, and a recipient cVM probes the Intravisor for a given key. The Intravisor verifies the access control list and builds a CAP Device for the target CP\_File (e.g., /dev/cf0). For the donor cVM to revoke access, it uses its own CAP Device to request revocation, and the Intravisor, together with the library OS, destroy the CP\_Files (cf0) driver along with its capabilities.

When the recipient cVM issues a cp\_file\_read() call, the driver uses *capcpy* to copy data. For cp\_file\_read(), it uses ld.cap to read data from a remote cVM and store it via sd; a cp\_file\_write() does the reverse.

**CP\_Calls.** To expose a function, a cVM creates an ICALL entry and registers it with the Intravisor (see Fig. 4b). The In-

travisor maintains a table of exported functions for each cVM, called cVM-RPCs. It consists of access control records with capabilities, name identifiers and permissions. Application components interact with the cVM-RPCs via CAP Devices, a management interface (/dev/cf), and the Intravisor.

Any function can be invoked by CP\_Calls including ones inside the library OS. This enables the use of CP\_Calls as a notification mechanism between CP\_Files. The donor blocks execution until the recipient cVM reads data. It makes the wait() call with the driver, the driver puts the execution thread in the work queue and waits for the signal. Prior to blocking, it registers a wake-up CP\_Call with the Intravisor. The recipient cVM, in turn, finishes its operations with the CP\_Files, and notifies the donor via this CP\_Call.

These basic operations can be composed to create higherlevel protocols, and a single CAP Device can handle multiple memory regions. For example, for Redis (see §6.3), we use a series of read/write operations with a single notification as well as batched reads with different capabilities.

**CP\_Streams.** In contrast to CP\_Files, when sending data, the destination for CP\_Streams is unknown, and cp\_stream\_send() only knows the source. Therefore, one side of the communication pre-registers one or more destination buffers via cp\_stream\_recv(), and uses cp\_stream\_poll() to block. The remote side uses CP\_Call to enter the remote compartment, atomically fetches one destination buffer from a pre-registered queue of buffers, and copies into this buffer data via capcpy. It then wakes up the poll queue and returns.

**Hostcall Interface.** The Intravisor does not impose restrictions on the number of calls in the hostcall interface. For the LKL library OS, the Intravisor provides 24 hostcalls for minimal operation. In addition, 2 hostcalls are necessary for disk I/O, 3 for network I/O, and 10 for the capability-based communication primitives.

#### 4.4 Capability revocation

Data transfers (capcpy) are performed by the drivers of CAP Devices without direct involvement of the Intravisor, which enhances performance and reduces the TCB. This, however, means that the driver must have access to the capabilities provided by the donor. We do not consider the driver trusted, thus it may be compromised by an adversary who obtains access to capabilities and memory outside the cVM after the end of a communication session. To mitigate against this threat, cVMs support a revocation mechanism. It guarantees that, once the donor cVM revokes capabilities, they are destroyed, and a recipient cVM cannot use them.

First, cVMs or communication capabilities are not created with the PERMIT\_STORE\_CAP permission. Code inside a cVM thus cannot store capabilities to memory: it can load them, modify, create new capabilities, but it fails on ST. The communication capabilities are stored once by the Intravisor, when the communication is established, and destroyed at the end. Second, the revoked capabilities in the CPU context are destroyed after a context switch by the host OS kernel.

## 5 Security Analysis

According to our threat model from §2.3, an attacker can gain control over a cVM. However, we guarantee that they cannot escape the compartment or access memory beyond its boundary due to the CHERI architectural properties (see §2.2): the ddc and pcc capabilities always apply, are non-extensible, and are controlled by the Intravisor.

Hybrid-cap code may be vulnerable to attacks that attempt to break execution flow. An adversary may inject capabilityaware instructions (e.g., CLD/CSD, CInvoke) to access data and code outside of the compartment. To do this, the adversary requires capabilities, which they cannot construct from the available data inside a cVM.

To escape a compartment, an adversary must obtain appropriate capabilities. Each cVM, however, only maintains a few capabilities: a compartment (i) receives three sealed capabilities via Affixes, which can be inspected by an adversary but not unsealed to create new capabilities; and (ii) may receive capabilities used by CP\_Files and CP\_Streams. These capabilities can be exploited by an adversary after gaining full control over the library OS. Since these are data capabilities, they cannot be used to create code capabilities, which are needed to escape the compartment. The adversary also cannot store these capabilities due to their permissions. Finally, they also cannot be exported outside of the compartment via the hostcall interface, because the interface does not handle capabilities and instead corrupts them.

Hybrid-cap code may contain security flaws, but an adversary cannot escape confinement, unless a flaw in the outer level provides them with unsealed capabilities. In our design, this is unlikely due to the Intravisor's small TCB. The adversary cannot export or import capabilities via the hostcall interface or use them beyond a communication session. Vulnerable hybrid-cap code cannot abuse host system calls, escalate privileges or attack other cVMs, because the host OS kernel ignores all direct system calls from cVMs.

cVMs are intra-process compartments that share microarchitectural state and rely on the correctness of the CHERI architecture, which does not have special mechanisms to prevent side-channel attacks. Nonetheless, there are plans for CHERI to include explicit compartment identifiers (CIDs) in a future version of the architecture [67]. This will ensure that sensitive micro-architectural state is appropriately tagged by each cVM, similar to tagged TLB entries. This can be used to prevent attacks, such as training the branch predictor by one cVM to direct speculative execution in another cVM.

## 6 Evaluation

We now explore the performance of cVMs and the proposed communication interfaces. We begin with an overview of our evaluation platforms and workloads (§6.1). We then compare the performance of applications deployed with cVMs and Docker containers (§6.2). In §6.3, we validate the efficiency of inter-cVM communication mechanisms; in §6.4, we explore the use of cVMs for component compartmentalisation; and in §6.5, we compare inter-cVM communication mechanisms with existing OS mechanisms. Finally, §6.6 explores the deployment performance of cVMs and Docker containers.

#### 6.1 Experimental environment

The CHERI architecture is under active development and, while ARM's Morello board with CHERI support has been announced [9], it is unavailable at the time of writing. Therefore, we use two **evaluation platforms**: (1) a single-core FPGA-based CHERI implementation [21]; and (2) a multicore SiFive RISC-V implementation without CHERI support.

*FPGA CHERI*. We synthesize an FPGA image from DARPA's CHERI FETT program [22] (agfi-026d853003d6c433a), that ships with a single-core RISC-V64 CHERI system based on the FLUTE core (5-stage, in-order pipeline, running at 100 MHz) [49], and execute it on AWS F1 [8]. We use CheriBSD as the host OS kernel, compiled as a hybrid-cap system with LLVM v11.0.0 and cheribuild [16].

The FPGA implementation enables a quantitative evaluation of cVMs, but has limitations: (i) it has a single-core CPU with low clock frequency; (ii) its peripheral devices, in particular storage devices, are emulated by the host; and (iii) DRAM latency is disproportionately low compared to the CPU clock speed. As a consequence, we cannot realistically execute typical cloud workloads that are memory- and I/Obound and use multiple CPU cores. We also cannot eliminate system noise by pinning tasks to separate cores.

*SiFive RISC-V.* To avoid the abovementioned limitations, we also evaluate cVMs on a HiFive Unmatched RISC-V board [30], which has 4 RISC-V64 (dual-issue, in-order) CPU cores running at 1.2 GHz. The CPU does not have CHERI support, and we instead replace all CHERI instructions with their native RISC-V versions. Our applications execute on Ubuntu v20.04 with Linux v5.11.0 and the RISC-V Docker port [48] with Alpine containers [7]. Our IPC microbenchmarks execute on FreeBSD 14, as the FPGA version uses CheriBSD, and we run them on both platforms.

This approach allows us to execute realistic cloud applications. We run CHERI-equivalent code and data paths while remaining compatible with existing RISC-V platforms (e.g., by replacing capability loads/stores with ordinary ld/st instructions, CInvoke with jr, etc.). Note that security is therefore not enforced.

**Application workloads.** We explore cVMs using several cloud applications and micro-benchmarks to evaluate their performance and isolation requirements:

*NGINX/Redis (§6.2).* This is a two-tier microservice deployment that evaluates the YCSB benchmark [72] using the NG-INX [43] web server and the Redis [46] key/value store. NG-



Fig. 5: Control/data flow in multi-tier deployment (NGINX/Redis)

INX acts as an API gateway and translates REST requests into Redis queries. When co-located, these services have a substantial amount of communication between them. We demonstrate that the cVMs interfaces, CP\_Files and CP\_Streams, significantly reduce overhead, using the SiFive platform to compare cVMs against a deployment using Docker containers [40].

*Redis* (*§6.3*). We execute a single-core Redis instance [46] and measure the latency of fixed-size GET and SET operations, comparing sockets and the equivalent cVM interface with CP\_Streams. This experiment validates our previous results by also comparing the FPGA and SiFive environments.

*Python/Library* (*§*6.4). We measure the cost of using cVMs to isolate the components of a simple cryptographic application in Python, by deploying the Python runtime [58] and the PyCrypto cryptographic library [1] in mutually isolated cVMs that use the CP\_Call and CP\_File interfaces to communicate. This experiment runs on the FPGA environment.

## 6.2 Multi-tier deployment with NGINX/Redis

First, we compare the benefits of using cVMs when colocating communicating components, compared to a traditional deployment with Docker containers [40].

The computational limitations of our FPGA and SiFive platforms make it unfeasible to execute a complete microservice benchmark suite such as DeathStarBench [26]. Instead, we deploy a representative YCSB benchmark [72] (workloadb; 1 KB records; read/update ratio of 95%/5%) on the SiFive platform with two-tiers: the NGNIX web server [43] acts as an API gateway that redirects incoming HTTP requests to the Redis key/value store [46], which acts as a cache for frequently used data. We use *wrk2* [6] to generate NGINX requests over a 1 GbE network, measuring the latency of different configurations (10 connections; 4 I/O threads).

The application components benefit from co-location due to the frequent interaction between the (NGINX) API gateway



Fig. 6: Multi-tier deployment performance (NGINX/Redis)

and its (Redis) cache. Fig. 5 compares the Docker and cVM deployments. Docker incurs multiple data copies between the components and the TCP/IP network stacks. As Fig. 5a shows, Redis copies values into a send buffer that is passed to the TCP/IP stack, which NGINX copies into an output buffer that is, in turn, passed to the client's network stack (for a total of 4 copies, including the kernel's TCP/IP stack).

In contrast, cVMs reduce the number of copies. Fig. 5b shows that the CP\_Stream primitive requires only 2 copies: Redis values are always copied directly into NGINX's output buffer. To support this optimization, NGINX and Redis must replace their use of sockets with CP\_Streams. NGINX registers the output buffer with a CP\_Stream, and the CP\_Stream write in Redis uses capabilities to copy data directly into the output buffer, which NGINX can then send to the client.

Fig. 6 shows the median and 95<sup>th</sup> percentile latencies for the 4 YCSB queries under various throughput regimes, comparing the baseline Docker deployment with cVMs. We can see that cVMs are more efficient: they have lower latencies in all cases (20–40% for median latency), and substantially higher throughput, with send latencies below 5 ms (33–50% for median latency).

**Conclusion.** In a typical deployment with multiple application components, cVMs can achieve isolation while lowering latencies and increasing throughput compared to containers. This performance gain is due to a reduced number of memory copies (via CP\_Stream), using fast calls to the capabilityhiding TCB in cVMs (via CP\_Call within CP\_Streams). Furthermore, cVMs come with a smaller TCB compared to containers. We also expect cVMs to outperform VMs because of VMs' higher overheads caused by memory virtualization



Fig. 7: Latency CDF for Redis (platform validation)

(especially for memory-bound applications) and communication mechanisms (e.g., extra data copies by the guest OS and/or hypervisor, or cross-VM copies via PCIe with directly assigned devices).

#### 6.3 Platform validation with Redis

We now validate our results by comparing the FPGA and SiFive platforms. We use Redis with a single connection that measures the latency of 1000 GET or SET operations with fixed-size keys (1 byte) and values (100 bytes). We use a simple client application that is co-located with the Redis instance. The baseline system uses separate processes and TCP/IP sockets; we use separate cVMs for each application and CP\_Stream for communication (similarly to §6.2).

Fig. 7 shows the latency distribution of the GET and SET requests for all configurations. The results indeed validate our observations from the multi-tier YCSB benchmark in §6.2. cVMs exhibit lower latencies with less deviation on both platforms, compared to a native system with TCP/IP sockets: 90% of cVM requests take 14–19 ms; the baseline takes 19–35 ms on the FPGA platform. The SiFive platform supports the same conclusions, albeit with different absolute numbers. This is because the FPGA device runs at a lower clock frequency, and two processes must be co-scheduled on the same core (with both the baseline and cVMs).

**Conclusion.** The CP\_Stream primitive in cVMs shows better performance on both the FPGA and SiFive platforms, achieving lower communication latencies across the whole throughput spectrum. We thus conclude that our end-to-end evaluation in §6.2 is representative of how cVMs would perform on a real-world CHERI-enabled CPU. In §6.5, we re-validate this by comparing cVMs against IPC primitives on all platforms.



Fig. 8: cVMs with Python (AES cryptographic performance)

#### 6.4 Process compartmentalization with Python library

Next, we explore the overhead of compartmentalizing a shared library with cryptographic operations in Python. In this case, we harden the security of a cloud application by mutually isolating the Python runtime and a native cryptographic module, PyCryptodome [1]. By using separate cVMs, we can safeguard the application against malicious interference by package managers [59], or protect the library against unauthorized access to its cryptographic keys [4].

Python creates CP\_Files for the input/output buffers that it passes to the PyCryptodome library, and it uses CP\_Call to transfer control to the library, using the CP\_Files as arguments. (The original version instead passes raw buffer pointers.) PyCryptodome then uses these CP\_Files to read its input and encrypt/decrypt it into the output buffer(using AES-128. Finally, it uses CP\_Call to return execution to Python.

Fig. 8 shows the average throughput for encryption/decryption with different buffer sizes for cVMs, using the FPGA platform, and the baseline (non-isolated) system. Note that the low absolute numbers and variance (shown as shaded areas) are due to the platform limitations (single core), described in §6.1. The results in §6.3, however, show the same trend on a platform without these limitations.

We observe that cVMs have a negligible performance impact. Throughput grows until its peak with 32 KB buffers, where the encryption/decryption rates of cVMs are only 7% and 12% lower than the baseline, respectively. This amounts to 0.79 MB/s and 0.96 MB/s for the baseline, and 0.74 MB/s and 0.85 MB/s for cVMs, respectively. As expected, these overheads become even smaller as the buffer sizes grow.

Our experiment shows that CP\_Call and calls into the Intravisor are reasonably efficient. For reference, the mean execution time for the AES cryptographic code with a 16 byte buffer is comparable to the time for a C binding invocation in Python. At such sizes, CP\_Call invocations account for half of the overhead, which is at 97% and 101% for encryption and decryption, respectively, only slightly above a C binding invocation. The overhead reduces to 7% with larger buffers.

**Conclusion.** cVMs is effective at hardening applications by isolating some of their components, such as shared libraries. The required changes are minimal and do not change the semantics of the application interfaces, because the CP\_File and



Fig. 9: Comparison of communication mechanisms

CP\_Call primitives follow well-understood memory copy and function call semantics. Note that CP\_Streams are constructed on top of these. The cost of this extra isolation is small, even for small buffers, and it becomes negligible as the amount of work performed between cVMs-enabled operations increases.

#### 6.5 Inter-cVM communication

We compare cVMs to other IPC primitives in a baseline system, and re-validate our performance results across our two platforms (FPGA and SiFive). The baseline system uses two threads in a single process instead of cVMs; otherwise the FPGA implementation shows low TLB performance. We measure the performance of CP\_Files and CP\_Streams, pipes (PIPE), unix sockets (UNIX), TCP/IP sockets (TCP) and a combination of mmap+memcpy+munmap (MAP+CPY). For comparison, we also consider a raw local memcpy (MEMCPY; 4 instructions; aligned data; double-word load/store operations) as an upper performance bound. We do not evaluate CP\_Calls due to the lack of an equivalent operation in the baseline kernel.

Fig. 9 shows the results under different buffer sizes on both the FPGA and SiFive platforms. First, the peak performance of MEMCPY on the FPGA platform is limited and fluctuates due to the TLB size and simple indexing function of its Flute CPU—these issues carry onto the other primitives, too.

The overhead of CP\_Files is 6% compared to MEMCPY on the FPGA platform and negligible for SiFive; it significantly outperforms all baseline IPC mechanisms. This is because we do a simple cross-cVM memcpy using CHERI's 1d. cap and cincoffsetimm instructions to perform the memory access and to increment the capability offset, respectively. The results also show that domain transitions via CInvoke are efficient, as every CP\_File operation requires one capability call and its return (user $\rightarrow$ library OS, and back).

All baseline IPC primitives have  $2\times$  overhead or more, because they perform more data copies than MEMCPY and CP\_Files, closely following ideal performance. Interestingly, CP\_Streams have worse performance on the FPGA platform, despite the lower number of copies, whereas they show performance close to CP\_File on the SiFive platform. This is because CP\_Streams offer an asynchronous communication primitive in which two concurrent processes time-share a single CPU core on the FPGA platform when using the cVM API. For the same reason, all IPC primitives have lower relative performance on the FPGA platform compared to SiFive.

UNIX sockets are the closest to CP\_Streams, because both are bi-directional, support more than two parties, and have sequenced packet modes. They exhibit only 10% and 54% of the performance of CP\_Streams for 4 MB buffers on the FPGA and SiFive platforms, respectively. Here, the impact of MMU manipulation can be seen: the combination of memory copies and remapping reaches 3.4 MB/s and 89 MB/s on the FPGA and SiFive platforms, respectively. This mechanism lacks a notification primitive, and, compared to CP\_Files, it is  $15 \times$  and  $1.5 \times$  slower on each platform, respectively.

**Conclusion.** For a multi-core CPU architecture with CHERI, we would expect the results to be close to those of the SiFive platform, with a minor performance decrease, similar to the difference between memcpy and CP\_Files in Fig. 9a. This potential performance degradation is significantly smaller than the measured improvements: they range between  $2\times$  for the multi-core SiFive platform against the best baseline primitive, and  $2\times$  to an order of magnitude for the single-core FPGA platform, depending on the mechanism and buffer size.

### 6.6 Deployment time

We compare the deployment time of cVMs with that of Docker containers. We create a Docker image with a simple "hello world" program and measure the time to execute it using a cVM and a container. For the cVM, we use a debugfree binary with the LKL library OS and the musl standard C library ( $\approx$ 30 MB in size) and a 10 MB application disk image. We measure two intervals, averaged over 5 runs: from the start until the output of the program, and until its termination.

On average, the Docker container requires 1.9 s to produce the output, and 2.8 s until container termination. The times for the cVM deployment are comparable, which demonstrates their low overhead: 1.7 s and 2.6 s, respectively.

## 7 Related Work

**Intra-process compartments.** Various projects apply intraprocess isolation or introduce isolation primitives. CubicleOS [52] isolates components of a user-level library OS using Intel MPK; unlike cVMs, it cannot readily and efficiently support legacy POSIX calls. Shreds [20], Janus [28], Erim [60], Hodor [29], and Donkey [53] use page tag-based isolation (ARM Domains, Intel MPK, or a custom RISC-V implementation) to implement protection domains and communication. In cases in which tags can be manipulated directly by user code, e.g., using MPK's wrpkru instruction, the system requires a trusted toolchain or program verifier, unlike cVMs. Page tags also limit the number of compartments and communication buffers, as well as their granularity, which is not a problem for cVMs with capabilities.

NaCl [73] and WASM [27] face similar problems, as they require obsolete Intel segmentation and/or proof-carrying code that must be verified by a toolchain or loader. Conf-LLVM [12] also uses MPK to isolate code inside a process, but only supports two domains with asymmetric data exchange: trusted code can only interact with untrusted code. cVMs do not limit the number of protection domains, and inter-cVM communication is symmetric.

LwCs [37] are an OS abstraction for intra-process protection, but they have page granularity, and switching domains comes at the cost of switching page tables. XFI [24] provides fine-grained memory protection and control flow integrity by extending software-based fault isolation (SFI), but SFI incurs runtime overheads and is error-prone due to its complexity.

**Compartmentalisation frameworks.** cVMs allow the deployment of isolated shared libraries. Prior work proposes frameworks for such compartmentalization: Wedge [11] identifies code parts that can be isolated; PrivTrans [13] is a source-code partitioning tool that separates trusted and untrusted components; Glamdring [35] does the same for trusted execution. These approaches are orthogonal to cVMs, and they could be used to generate application components.

**Trusted execution.** Intel SGX [31, 38, 39] provides *enclaves* as an intra-process isolation primitive. Enclaves are part of processes and cannot be accessed by privileged software or other enclaves. Frameworks, such as Graphene-SGX [19], SGX-LKL [45], Panoply [55], and Spons and Shields [51], deploy programs inside enclaves together with a library OS. Such designs decrease the potential impact of the untrusted OS kernel on enclaved software.

cVMs also use a library OS and share design features with these frameworks, but provide effective data sharing that cannot be implemented using enclaves. Enclaves can only share untrusted memory and cannot access each others memory, which is necessary for fast inter-cVM communication. Since enclaves do not trust the host, they must use encryption, impacting performance [50]. Therefore, an interface similar to CP\_Files cannot be implemented with enclaves.

**Library OSs** can be used to de-privilege OS kernel components or create user-level containers.  $\mu$ Kontainer [57] offers containers based on the LKL library OS [36]; Williams et al. [68] show that library OSs can be executed efficiently on top of processes instead of bare VMs; X-Containers [54] offer a cloud platform using library OSs. cVMs share similarities with user-level library OS-based containers but enhance them with strong isolation and a secure communication mechanism using capabilities.

Machine and process isolation. As discussed in §2.1, traditional process-based isolation has shortcomings in terms of performance and TCB size when compared to cVMs. One could envision using virtualization and Intel's vmfunc to strike a balance between shared TCB size and communication performance [34]. Virtualization introduces well-known I/O and memory translation overheads, which are costly in a cloud stack, but are not present in cVMs.

## 8 Conclusions

cVMs are a new VM-like abstraction for cloud applications that use memory capabilities for secure isolation. cVMs include a library OS to minimize how much of the cloud environment is within the TCB. Multiple cVMs safely share an address space, allowing more efficient interaction of application components than when crossing current VM/container boundaries. Their asynchronous read/write and synchronous call interfaces allow capability-unaware, legacy code to run within cVMs.

Acknowledgements. This work was funded by the UK Government's Industrial Strategy Challenge Fund (ISCF) under the Digital Security by Design (DSbD) Programme (UKRI grant EP/V000365 "CloudCAP"), and the Technology Innovation Institute (TII) through its Secure Systems Research Center (SSRC). It was also supported by JSPS KAKENHI grant number 18KK0310. We thank our shepherd, Ana Klimovic, and the anonymous reviewers for their helpful comments.

**Source code availability.** The source code of cVMs, the Intravisor, and various application examples can be found at https://github.com/lsds/intravisor.

## References

- [1] A self-contained cryptographic library for Python. https://github.com/Legrandin/pycryptodome. Last accessed: June 1, 2022.
- [2] Linux containers. https://linuxcontainers.org. Last accessed: June 1, 2022.
- [3] CVE-2013-6441. Available from MITRE, CVE-ID CVE-2013-6441, December 2013.
- [4] CVE-2014-016021284. Available from MITRE, CVE-ID CVE-2014-0160, December 2014.
- [5] CVE-2021-21284. Available from MITRE, CVE-ID CVE-2021-21284, December 2021.
- [6] A constant throughput, correct latency recording variant of wrk. https://github.com/giltene/wrk2. Last accessed: June 1, 2022.

- [7] Alpine Linux. https://alpinelinux.org. Last accessed: June 1, 2022.
- [8] Amazon EC2 F1 Instances. https://aws.amazon. com/ec2/instance-types/f1/. Last accessed: June 1, 2022.
- [9] Arm Morello program. https://developer. arm.com/architectures/cpu-architecture/ a-profile/morello. Last accessed: June 1, 2022.
- [10] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the Art of Virtualization. In *Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles*, SOSP '03, pages 164– 177. ACM, 2003.
- [11] Andrea Bittau, Petr Marchenko, Mark Handley, and Brad Karp. Wedge: Splitting Applications into Reduced-Privilege Compartments. In 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI 08). USENIX Association, April 2008.
- [12] Ajay Brahmakshatriya, Piyus Kedia, Derrick P. McKee, Deepak Garg, Akash Lal, Aseem Rastogi, Hamed Nemati, Anmol Panda, and Pratik Bhatu. ConfLLVM: A Compiler for Enforcing Data Confidentiality in Low-Level Code. In *Proceedings of the Fourteenth European Conference on Computer Systems*, EuroSys '19. ACM, 2019.
- [13] David Brumley and Dawn Song. Privtrans: Automatically Partitioning Programs for Privilege Separation. In 13th USENIX Security Symposium (USENIX Security 04). USENIX Association, August 2004.
- [14] Jeffrey Buell, Daniel Hecht, Jin Heo, Kalyan Saladi, and R Taheri. Methodology for performance analysis of VMware vSphere under Tier-1 applications. VMware Technical Journal, 2(1):19–28, 2013.
- [15] Reto Buerki and Adrian-Ken Rueegsegger. Muen An x86/64 Separation Kernel for High Assurance. University of Applied Sciences Rapperswil (HSR), Tech. Rep, 2013.
- [16] Building system for CHERI software. https:// github.com/CTSRD-CHERI/cheribuild. Last accessed: June 1, 2022, commit a37f5cc.
- [17] Anton Burtsev, Kiran Srinivasan, Prashanth Radhakrishnan, Kaladhar Voruganti, and Garth R. Goodson. Fido: Fast Inter-Virtual-Machine Communication for Enterprise Appliances. In 2009 USENIX Annual Technical Conference (USENIX ATC 09), San Diego, CA, June 2009. USENIX Association.

- [18] Nicholas P. Carter, Stephen W. Keckler, and William J. Dally. Hardware Support for Fast Capability-Based Addressing. *SIGPLAN Not.*, 29(11):319–327, November 1994.
- [19] Chia che Tsai, Donald E. Porter, and Mona Vij. Graphene-SGX: A Practical Library OS for Unmodified Applications on SGX. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 645– 658, Santa Clara, CA, July 2017. USENIX Association.
- [20] Yaohui Chen, Sebassujeen Reymondjohnson, Zhichuang Sun, and Long Lu. Shreds: Fine-Grained Execution Units with Private Memory. In 2016 IEEE Symposium on Security and Privacy, pages 56–71, 2016.
- [21] CHERI-modified versions of the Flute processor. https: //github.com/CTSRD-CHERI/Flute. Last accessed: June 1, 2022.
- [22] Darpa FETT Bug Bounty Program. https://fett. darpa.mil. Last accessed: June 1, 2022.
- [23] Joe Devietti, Colin Blundell, Milo M. K. Martin, and Steve Zdancewic. Hardbound: Architectural Support for Spatial Safety of the C Programming Language. *SIGOPS Oper. Syst. Rev.*, 42(2):103–114, March 2008.
- [24] Úlfar Erlingsson, Martín Abadi, Michael Vrable, Mihai Budiu, and George Necula. XFI: Software Guards for System Address Spaces. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI 06), pages 75–88, January 2006.
- [25] FreeBSD adapted for CHERI-MIPS, CHERI-RISC-V, and Arm Morello. https://github.com/ CTSRD-CHERI/cheribsd. Last accessed: June 1, 2022.
- [26] Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '19, pages 3–18. ACM, 2019.
- [27] Andreas Haas, Andreas Rossberg, Derek L. Schuff, Ben L. Titzer, Michael Holman, Dan Gohman, Luke Wagner, Alon Zakai, and JF Bastien. Bringing the web up to speed with WebAssembly. *SIGPLAN Notices*, 52(6):185–200, June 2017.

- [28] Mohammad Hedayati, Spyridoula Gravani, Ethan Johnson, John Criswell, Michael Scott, Kai Shen, and Mike Marty. Janus: Intra-Process Isolation for High-Throughput Data Plane Libraries. Technical report, Technical Report UR CSD/1004, 2018.
- [29] Mohammad Hedayati, Spyridoula Gravani, Ethan Johnson, John Criswell, Michael L. Scott, Kai Shen, and Mike Marty. Hodor: Intra-Process Isolation for High-Throughput Data Plane Libraries. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 489–504, Renton, WA, July 2019. USENIX Association.
- [30] HiFive Unmatched. https://www.sifive.com/ boards/hifive-unmatched. Last accessed: June 1, 2022.
- [31] Simon P. Johnson. Scaling Towards Confidential Computing. https://systex.ibr.cs.tu-bs.de/ systex19/slides/systex19-keynote-simon.pdf, 2019.
- [32] Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. kvm: the Linux Virtual Machine Monitor. In *Proceedings of the Linux Symposium*, volume 1, pages 225–230, 2007.
- [33] Marios Kogias, George Prekas, Adrien Ghosn, Jonas Fietz, and Edouard Bugnion. R2P2: Making RPCs first-class datacenter citizens. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, July 2019.
- [34] Koen Koning, Xi Chen, Herbert Bos, Cristiano Giuffrida, and Elias Athanasopoulos. No Need to Hide: Protecting Safe Regions on Commodity Hardware. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys '17, pages 437–452. ACM, 2017.
- [35] Joshua Lind, Christian Priebe, Divya Muthukumaran, Dan O'Keeffe, Pierre-Louis Aublin, Florian Kelbert, Tobias Reiher, David Goltzsche, David Eyers, Rüdiger Kapitza, et al. Glamdring: Automatic Application Partitioning for Intel SGX. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 285–298. USENIX Association, 2017.
- [36] Linux Kernel Library. https://github.com/lkl. Last accessed: June 1, 2022.
- [37] James Litton, Anjo Vahldiek-Oberwagner, Eslam Elnikety, Deepak Garg, Bobby Bhattacharjee, and Peter Druschel. Light-Weight Contexts: An OS Abstraction for Safety and Performance. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 49–64. USENIX Association, November 2016.

- [38] Frank McKeen, Ilya Alexandrovich, Ittai Anati, Dror Caspi, Simon Johnson, Rebekah Leslie-Hurd, and Carlos Rozas. Intel® Software Guard Extensions (Intel® SGX) Support for Dynamic Memory Management Inside an Enclave. In *Proceedings of the Hardware and Architectural Support for Security and Privacy 2016*, pages 1–9. 2016.
- [39] Frank McKeen, Ilya Alexandrovich, Alex Berenzon, Carlos V Rozas, Hisham Shafi, Vedvyas Shanbhogue, and Uday R Savagaonkar. Innovative Instructions and Software Model for Isolated Execution. *HASP@ ISCA*, 10, 2013.
- [40] Dirk Merkel. Docker: Lightweight Linux Containers for Consistent Development and Deployment. *Linux Journal*, 2014(239):2, 2014.
- [41] Gal Motika and Shlomo Weiss. Virtio network paravirtualization driver: Implementation and performance of a de-facto standard. *Computer Standards & Interfaces*, 34(1):36–47, 2012.
- [42] musl libc. https://musl.libc.org. Last accessed: June 1, 2022.
- [43] nginx HTTP and reverse proxy server. https:// nginx.org. Last accessed: June 1, 2022.
- [44] Fengfeng Ning, Chuliang Weng, and Yuan Luo. Virtualization I/O optimization based on shared memory. In *IEEE Intl. Conference on Big Data*, 2013.
- [45] Christian Priebe, Divya Muthukumaran, Joshua Lind, Huanzhou Zhu, Shujie Cui, Vasily A Sartakov, and Peter Pietzuch. SGX-LKL: Securing the Host OS Interface for Trusted Execution. arXiv preprint arXiv:1908.11143, 2019.
- [46] Redis is an in-memory database that persists on disk. https://github.com/redis/redis. Last accessed: June 1, 2022.
- [47] Yi Ren, Ling Liu, Qi Zhang, Qingbo Wu, Jianbo Guan, Jinzhu Kong, Huadong Dai, and Lisong Shao. Shared-Memory Optimizations for Inter-Virtual-Machine Communication. ACM Computing Surveys, February 2016.
- [48] RISC-V bring-up tracker. https://github.com/ carlosedp/riscv-bringup. Last accessed: June 1, 2022.
- [49] RISC-V CPU, simple 5-stage in-order pipeline. https: //github.com/bluespec/Flute. Last accessed: June 1, 2022.

- [50] Vasily A. Sartakov, Stefan Brenner, Sonia Ben Mokhtar, Sara Bouchenak, Gaël Thomas, and Rüdiger Kapitza. EActors: Fast and Flexible Trusted Computing Using SGX. In Proceedings of the 19th International Middleware Conference, Middleware '18, pages 187–200, New York, NY, USA, 2018. ACM.
- [51] Vasily A. Sartakov, Daniel O'Keeffe, David Eyers, Lluís Vilanova, and Peter Pietzuch. Spons & Shields: Practical Isolation for Trusted Execution. In Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE 2021, pages 186–200, New York, NY, USA, 2021. ACM.
- [52] Vasily A. Sartakov, Lluís Vilanova, and Peter Pietzuch. CubicleOS: A Library OS with Software Componentisation for Practical Isolation. In Proceedings of the Twenty-Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '21, pages 575–587. ACM, 2021.
- [53] David Schrammel, Samuel Weiser, Stefan Steinegger, Martin Schwarzl, Michael Schwarz, Stefan Mangard, and Daniel Gruss. Donky: Domain Keys – Efficient In-Process Isolation for RISC-V and x86. In 29th USENIX Security Symposium (USENIX Security 20), pages 1677– 1694. USENIX Association, August 2020.
- [54] Zhiming Shen, Zhen Sun, Gur-Eyal Sela, Eugene Bagdasaryan, Christina Delimitrou, Robbert Van Renesse, and Hakim Weatherspoon. X-Containers: Breaking Down Barriers to Improve Performance and Isolation of Cloud-Native Containers. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '19, pages 121–135. ACM, 2019.
- [55] Shweta Shinde, DL Tien, Shruti Tople, and Prateek Saxena. Panoply: Low-TCB Linux Applications With SGX Enclaves. In Proceedings of the Annual Network and Distributed System Security Symposium (NDSS), 2017.
- [56] Udo Steinberg and Bernhard Kauer. NOVA: A Microhypervisor-Based Secure Virtualization Architecture. In Proceedings of the Fifth European Conference on Computer Systems, EuroSys '10, pages 209–222. ACM, 2010.
- [57] Hajime Tazaki, Akira Moroo, Yohei Kuga, and Ryo Nakamura. How to Design a Library OS for Practical Containers? In Proceedings of the 17th ACM SIG-PLAN/SIGOPS International Conference on Virtual Execution Environments, VEE 2021, pages 15–28. ACM, 2021.
- [58] The Python programming language. https://github. com/python/cpython. Last accessed: June 1, 2022.

- [59] There's a RAT in my code: new npm malware with Bladabindi trojan spotted. https://blog.sonatype.com/ bladabindi-njrat-rat-in-jdb.js-npm-malware. Last accessed: June 1, 2022.
- [60] Anjo Vahldiek-Oberwagner, Eslam Elnikety, Nuno O. Duarte, Michael Sammler, Peter Druschel, and Deepak Garg. ERIM: Secure, efficient in-process isolation with protection keys (MPK). In 28th USENIX Security Symposium (USENIX Security 19), pages 1221–1238. USENIX Association, August 2019.
- [61] Lluís Vilanova, Nadav Amit, and Yoav Etsion. Using SMT to accelerate nested virtualization. In Proceedings of the 46th International Symposium on Computer Architecture, ISCA '19, pages 750–761. ACM, 2019.
- [62] Carlos Villavieja, Vasileios Karakostas, Lluís Vilanova, Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho Navarro, Adrian Cristal, and Osman Unsal. DiDi: Mitigating the performance impact of TLB shootdowns using a shared TLB directory. In *Intl. Conf. on Parallel Arch. and Compilation Techniques (PACT)*, pages 340–349, October 2011.
- [63] Carl A. Waldspurger. Memory Resource Management in VMware ESX Server. SIGOPS Oper. Syst. Rev., 36(SI):181–194, December 2003.
- [64] Robert NM Watson, Peter G Neumann, Jonathan Woodruff, Michael Roe, Jonathan Anderson, David Chisnall, Brooks Davis, Alexandre Joannou, Ben Laurie, Simon W Moore, et al. Capability Hardware Enhanced RISC Instructions: CHERI Instruction-Set Architecture (Version 8). Technical report, University of Cambridge, Computer Laboratory, 2019.
- [65] Robert NM Watson, Alexander Richardson, Brooks Davis, John Baldwin, David Chisnall, Jessica Clarke, Nathaniel Filardo, Simon W Moore, Edward Napierala, Peter Sewell, et al. CHERI C/C++ Programming Guide. Technical report, University of Cambridge, Computer Laboratory, 2020.
- [66] Robert NM Watson, Jonathan Woodruff, Peter G Neumann, Simon W Moore, Jonathan Anderson, David Chisnall, Nirav Dave, Brooks Davis, Khilan Gudka, Ben Lau-

rie, et al. CHERI: A Hybrid Capability-System Architecture for Scalable Software Compartmentalization. In 2015 IEEE Symposium on Security and Privacy, pages 20–37. IEEE, 2015.

- [67] Robert NM Watson, Jonathan Woodruff, Michael Roe, Simon W Moore, and Peter G Neumann. Capability hardware enhanced RISC instructions (CHERI): Notes on the Meltdown and Spectre attacks. Technical report, University of Cambridge, Computer Laboratory, 2018.
- [68] Dan Williams, Ricardo Koller, Martin Lucina, and Nikhil Prakash. Unikernels as Processes. In Proceedings of the ACM Symposium on Cloud Computing, SoCC '18, pages 199–211. ACM, 2018.
- [69] Jonathan Woodruff, Alexandre Joannou, Hongyan Xia, Anthony Fox, Robert M Norton, David Chisnall, Brooks Davis, Khilan Gudka, Nathaniel W Filardo, A Theodore Markettos, et al. CHERI Concentrate: Practical Compressed Capabilities. *IEEE Transactions on Computers*, 68(10):1455–1469, 2019.
- [70] Jonathan Woodruff, Robert N. M. Watson, David Chisnall, Simon W. Moore, Jonathan Anderson, Brooks Davis, Ben Laurie, Peter G. Neumann, Robert Norton, and Michael Roe. The CHERI capability model: Revisiting RISC in an age of risk. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pages 457–468, 2014.
- [71] Xen project. Event channel internals. https://wiki. xenproject.org/wiki/Event\_Channel\_Internals. Last accessed: June 1, 2022.
- [72] Yahoo! Cloud Serving Benchmark. https://github. com/brianfrankcooper/YCSB. Last accessed: June 1, 2022.
- [73] Bennet Yee, David Sehr, Gregory Dardyk, J Bradley Chen, Robert Muth, Tavis Ormandy, Shiki Okasaka, Neha Narula, and Nicholas Fullagar. Native Client: A Sandbox for Portable, Untrusted x86 Native Code. In 2009 IEEE Symposium on Security and Privacy, pages 79–93. IEEE, 2009.