Performance  

 Introduction
• Overview of GO!
• Performance    
• Publications
• The GO! Team
• Current Projects
• Download GO! 

Go! Component-based Operating System

Performance Results*

This page gives results of performance experiments performed on Go! There are three types of experiments documented here: temporal performance, space requirements and stability. Temporal performance measures the time taken to perform various operations. Space requirements document how much memory various components and operations require. Stability is measured by a `stress test', designed to load the system heavily, with the intention of exposing any bugs.

Temporal Performance is measured using two sets of results: best-case and worst-case. The best-case results measure the time taken when all memory references hit the level 1 cache and branches are in the `Branch Target Buffer' (that is, branch-prediction works). Worst-case measurements are made immediately after the cache and branch-target-buffers have been flushed. No average-case timings are given, because what exactly constitutes the average case depends on Go!'s use. Assuming high cache hit rates and branch prediction success, the best-case can be assumed to be the average case.

All experiments were conducted on an Intel Pentium P54C running at 90 MHz, with 32 MB of 90 ns EDO DRAM. The P54C uses 8K two-way set-associative caches for both code and data. All results are in machine cycles, measured using the Pentium's rdtsc instruction.

Go! Compared to Other OSs

Go! is extremely quick. The table below compares the times of different OSs to perform round-trip null-RPCs and one-way domain transfers (or IPC message transmission).

OS Architecture Null- RPC Domain Transfer
BSD Pentium 55,000 13,000
Mach 2.5 MIPS 3,000 1,600
Spring SparcStation 2 - 880
L4 Pentium 665 121
Pebble MIPS - 114
Go! Pentium 73 29

Detailed ORB Performance

Perhaps the most important measurement is the `null-RPC time': the time taken to call a method on a server component that takes no inputs, performs no work and produces no outputs.

There are several ways a client may call a method on a server. Recall that the xfer operation is one-way only, hence the xfer experiment measures only the time to xfer to a method on another component, not to return.

Best-case timings use a hot cache and branch-prediction buffers (i.e. the experiments are run in a tight loop, and the times averaged). The worst-case experiments ensure that the cache and branch-target buffers are cold by flushing them immediately before timings are taken. No average-case timings are given, because what exactly constitutes the average-case depends on Go!'s use; assuming high cache hit rates and branch prediction success, the best-case can be assumed to be the average case.

call, f-call and t-call measure round-trip, intra-machine, null-RPC. That is, the result is the time taken to call a method on another component and return from it. The callee component's method does nothing except return. xfer does not return, so the xfer experiments measure the time taken for a one-way trip (the time between issuing the xfer and arriving at the target component).

Primitive Worst Case Best Case
call 748 201
t-call 606 99
f-call 412 73
xfer 606 29

Exception Latency

Go! supports software exceptions. Throwing an exception is much like issuing a RPC return, except that control does not return normally, but via interrupt vector 30h.

The results for throw measure the time taken between throwing the exception, and receiving the interrupt. However, since this involves unwinding the stack, the time will depend on what type of call was last issued (call, f-call or t-call). For this reason there are three sets of results for throw: one for each type of call.

Last-Call Worst Case Best Case
call 465 116
t-call 318 63
f-call 409 86

Other ORB Methods

This section presents the results for measuring the times taken for all other ORB methods. Where the time is dependent on the parameters, the time is taken for the least amount of work possible. For example, create implicitly calls a component's constructor: times are given where the constructor performs no work other than issuing an RPC return.

Operation Worst Case (cycles) Best Case (cycles)
create 2795 673
create (null) 1247 131
create (unlinked stack) 2065 320
create (linked stack) 2657 416
destroy 2043 522
destroy (null) 1509 289
destroy (unlinked stack) 1811 444
destroy (linked stack) 2110 497
install 2298 965
uninstall 564 187
get_self 435 17
get_type 524 22
get_stack 650 22
switch_stacks 641 34
lock 560 29
unlock 650 28
sel2ref 432 23
reject 337 23

Spatial Performance

The ORB uses very little memory. Perhaps most impressive is the space required per component: just 32 bytes for each interface. This is around two orders of magnitude improvement over traditional, page-based protection models.

The ORB itself is also small. The code section is just 6K, with 2K of static data. The size of the ORB's data segment depends mostly on the size of the GDT (configurable at compile time from 1K to 64K). The larger the GDT, the fewer selector misses will be incurred --- this is a conventional trade-off of space against time.

GTE Performance

Unlike the Go! ORB, GTE has not been developed with high performance in mind throughout. Given the number of GTE operations, and that most have not been optimised, an extensive break-down of the timing for each GTE method is laborious and of little interest. Instead, this section presents experiments that look at performance in a wider context, such as interrupt latency and scheduling overhead. The size of each GTE component is also given.

Component Sizes

As mentioned, GTE consists of 11 distinct components. This section lists the code size, data size, and BSS size for each component. Code size corresponds the number of bytes a component's code segment consumes; data size is the amount of statically initialised data; and BSS the amount of zero initialised data.

Component Code (bytes) Data (bytes) BSS (bytes)
comp_lib 5664 1742 0
mem_mgr 2432 704 2(32)??
idisp 6912 944 832
sched 3040 512 196656
scan 4288 3488 1067616
xcp_mgr 560 240 0
thread 176 128 16
video 1344 80 0
keyb 1056 288 80
cli 13974 800 3200

sched overhead

Each timer interrupt schedules a new thread, resulting in the interaction of 7 distinct components (the out-going and incoming thread and stack components, the interrupt-dispatcher, the scheduler and the ORB). Contrasting this with conventional systems (including micro-kernels) that incorporate these 7 distinct components into a single entity (namely, the kernel), it is clear that the above approach might incur performance problems.

An experiment was conducted to assess whether the decomposition of scheduling introduces unacceptable overhead. The experiment was conducted with two runnable threads, operating with interrupts disabled. A timer interrupt was simulated by issuing an int 32 instruction from the first thread, and the time was measured for the second thread to receive control. Using the Pentium's rdtsc instruction to count processor cycles, this time was found to be just under 2,500 cycles.

With the default interrupt frequency of 100 ticks per second, this is an overhead of less than 0.3% on the 90MHz test machine used. Furthermore, relatively little of this overhead can be attributed to the decomposition. Firstly, the single interrupt causes 2 threads to be scheduled (the scheduler-thread, as well as the second runnable thread). Secondly, for simplicity the current implementation uses the Intel's TSS task-switching mechanism, known to perform poorly. Thirdly, the scheduling code is not heavily optimised. Implementing these three optimisations is likely to reduce significantly this already small overhead.

GDT miss

It is reasonable to assume a very high hit-ratio of components cached in an 8K slot GDT (although exact figures will depend on the application). Still, it is important to ascertain the penalty for a `cache miss' (that is, the penalty involved when invoking a method on a component with its data segment not cached in the GDT). It is even more important that the scheme does not adversely affect the inter-component method call latency when a component's data segment is cached in the GDT (that is, the common case).

The experiments examine the time taken to transfer control between two components using Go!'s xfer primitive. Three experiments were conducted: one for an implementation using data segment selectors to identify components, and one each using 26 bit ObjRefs where the callee component's data segment was and was not cached in the GDT (the caller component will always be in the GDT since its data segment selector is loaded into the data-segment register).

Primitive Best Case
Using 16 bit selector 30
Using ObjRef cached in GDT 28
Using ObjRef not cached in GDT 1080

The experiments show a hit of around 30 fold when a call is made on a component with its data segment not cached in the GDT. Given that taking a miss on the GDT is analogous to taking a page-miss on a virtual memory system, the miss penalty is quite acceptable (as with virtual memory systems, components with critical latency can be locked in the GDT). More importantly, indirecting ObjRefs via the CDT does not affect performance in the common case (other than a slight improvement!).

Stability

Stress-Test

To measure the overall system's stability, a stress-test was conducted. This test consisted of randomly creating and destroying instances of a stress component so that between 1 and 10,000 existed at a time. In addition, threads were created and destroyed so that between 1 and 1,000 were runnable at any time. Each stress component either called into one of the others using call, f-call, t-call or xfer, or returned (the choices are made randomly). A counter was kept so that the call-depth did not exceed 20 (and thus exhaust stack space).

Note that `random' in this context means pseudo-random (specifically, the FIXME algorithm is used). To produce genuinely random (and asynchronous) stimuli, key-presses were generated and processed at various times during the test.

To really stress the system, the GDT was reduced to having just 256 entries (thus thrashing the GDT), and the timer frequency was increased to 1,000 ticks per second.

This stress test was run for 4 days without failing.

Error-Test

To test the ORB adequately checks parameters, a component was built that called random ORB methods with random register contents and random data on the stack. The component would simply continue stressing the ORB on receipt of an exception. This test also ran for 4 days without failing.

Dynamism

An experiment was conducted to demonstrate Go!'s suitability for implementing dynamic implementation replacement. A swapper component was developed that dynamically replaced GTE's keyb component implementation from UK layout to US. While a relative trivial example, this test proves that the 'hot-swapping' support works.

Conclusion

This section has presented several experiments and their results. Performance experiments have provided very pleasing results: Go! is approaching 3 orders of magnitude faster than commercial systems, and one order of magnitude faster than even the leanest research systems.

A stress-test has also shown that the system is stable. Without this stability, the impressive performance results would be meaningless, since bug-fixes would likely adversely affect performance.

Back to Top  
Go! was sponsored by City University, School of Informatics' PhD Studentship Award. All reference to the algorithms, technology, publications listed on these pages must be acknowledged accordingly.