Performance Results*
T his
page gives results of performance experiments
performed on Go! There are three types of
experiments documented here: temporal performance,
space requirements and stability. Temporal
performance measures the time taken to perform
various operations. Space requirements document how
much memory various components and operations
require. Stability is measured by a `stress test',
designed to load the system heavily, with the
intention of exposing any bugs.
Temporal
Performance is measured using two sets of results:
best-case and worst-case. The best-case results
measure the time taken when all memory references
hit the level 1 cache and branches are in the
`Branch Target Buffer' (that is, branch-prediction
works). Worst-case measurements are made immediately
after the cache and branch-target-buffers have been
flushed. No average-case timings are given, because
what exactly constitutes the average case
depends on Go!'s use. Assuming high cache hit rates
and branch prediction success, the best-case can be
assumed to be the average case.
All
experiments were conducted on an Intel Pentium P54C
running at 90 MHz, with 32 MB of 90 ns EDO DRAM. The
P54C uses 8K two-way set-associative caches for both
code and data. All results are in machine cycles,
measured using the Pentium's rdtsc
instruction.
Go! Compared
to Other OSs
Go! is
extremely quick. The table below compares the times
of different OSs to perform round-trip null-RPCs and
one-way domain transfers (or IPC message
transmission).
| OS |
Architecture |
Null-
RPC |
Domain
Transfer |
| BSD |
Pentium |
55,000 |
13,000 |
| Mach
2.5 |
MIPS |
3,000 |
1,600 |
| Spring |
SparcStation
2 |
- |
880 |
| L4 |
Pentium |
665 |
121 |
| Pebble |
MIPS |
- |
114 |
| Go! |
Pentium |
73 |
29 |
Detailed ORB
Performance
Perhaps the
most important measurement is the `null-RPC time':
the time taken to call a method on a server
component that takes no inputs, performs no work and
produces no outputs.
There are
several ways a client may call a method on a server.
Recall that the xfer operation is one-way
only, hence the xfer
experiment measures only the time to xfer
to a method on another component, not to return.
Best-case
timings use a hot cache and branch-prediction
buffers (i.e. the experiments are run in a tight
loop, and the times averaged). The worst-case
experiments ensure that the cache and branch-target
buffers are cold by flushing them immediately before
timings are taken. No average-case timings are
given, because what exactly constitutes the
average-case depends on Go!'s use; assuming high
cache hit rates and branch prediction success, the
best-case can be assumed to be the average case.
call,
f-call and t-call measure round-trip,
intra-machine, null-RPC. That is, the result is the
time taken to call a method on another component and
return from it. The callee component's method does
nothing except return.
xfer
does not return, so the xfer
experiments measure the time taken for a one-way
trip (the time between issuing the xfer
and arriving at the target component).
| Primitive |
Worst
Case |
Best
Case |
| call |
748 |
201 |
| t-call |
606 |
99 |
| f-call |
412 |
73 |
| xfer |
606 |
29 |
Exception
Latency
Go! supports
software exceptions. Throwing an exception is much
like issuing a RPC return, except that control does
not return normally, but via interrupt vector 30h.
The results
for throw measure the time taken between
throwing the exception, and receiving the interrupt.
However, since this involves unwinding the stack,
the time will depend on what type of call was last
issued (call, f-call or t-call).
For this reason there are three sets of results for throw:
one for each type of call.
| Last-Call |
Worst
Case |
Best
Case |
| call |
465 |
116 |
| t-call |
318 |
63 |
| f-call |
409 |
86 |
Other ORB
Methods
This section
presents the results for measuring the times taken
for all other ORB methods. Where the time is
dependent on the parameters, the time is taken for
the least amount of work possible. For example, create
implicitly calls a component's constructor: times
are given where the constructor performs no work
other than issuing an RPC return.
| Operation |
Worst
Case (cycles) |
Best
Case (cycles) |
| create |
2795 |
673 |
| create
(null) |
1247 |
131 |
| create
(unlinked stack) |
2065 |
320 |
| create
(linked stack) |
2657 |
416 |
| destroy |
2043 |
522 |
| destroy
(null) |
1509 |
289 |
| destroy
(unlinked stack) |
1811 |
444 |
| destroy
(linked stack) |
2110 |
497 |
| install |
2298 |
965 |
| uninstall |
564 |
187 |
| get_self |
435 |
17 |
| get_type |
524 |
22 |
| get_stack |
650 |
22 |
| switch_stacks |
641 |
34 |
| lock |
560 |
29 |
| unlock |
650 |
28 |
| sel2ref |
432 |
23 |
| reject |
337 |
23 |
Spatial
Performance
The ORB uses
very little memory. Perhaps most impressive is the
space required per component: just 32 bytes for each
interface. This is around two orders of magnitude
improvement over traditional, page-based protection
models.
The ORB itself
is also small. The code section is just 6K, with 2K
of static data. The size of the ORB's data segment
depends mostly on the size of the GDT (configurable
at compile time from 1K to 64K). The larger the GDT,
the fewer selector misses will be incurred --- this
is a conventional trade-off of space against time.
GTE
Performance
Unlike the Go!
ORB, GTE has not been developed with high
performance in mind throughout. Given the number of
GTE operations, and that most have not been
optimised, an extensive break-down of the timing for
each GTE method is laborious and of little interest.
Instead, this section presents experiments that look
at performance in a wider context, such as interrupt
latency and scheduling overhead. The size of each
GTE component is also given.
Component
Sizes
As mentioned,
GTE consists of 11 distinct components. This section
lists the code size, data size, and BSS size for
each component. Code size corresponds the number of
bytes a component's code segment consumes; data size
is the amount of statically initialised data; and
BSS the amount of zero initialised data.
| Component |
Code
(bytes) |
Data
(bytes) |
BSS
(bytes) |
| comp_lib |
5664 |
1742 |
0 |
| mem_mgr |
2432 |
704 |
2(32)?? |
| idisp |
6912 |
944 |
832 |
| sched |
3040 |
512 |
196656 |
| scan |
4288 |
3488 |
1067616 |
| xcp_mgr |
560 |
240 |
0 |
| thread |
176 |
128 |
16 |
| video |
1344 |
80 |
0 |
| keyb |
1056 |
288 |
80 |
| cli |
13974 |
800 |
3200 |
sched
overhead
Each timer
interrupt schedules a new thread, resulting in the
interaction of 7 distinct components (the out-going
and incoming thread and stack components, the
interrupt-dispatcher, the scheduler and the ORB).
Contrasting this with conventional systems
(including micro-kernels) that incorporate these 7
distinct components into a single entity (namely,
the kernel), it is clear that the above approach
might incur performance problems.
An experiment
was conducted to assess whether the decomposition of
scheduling introduces unacceptable overhead. The
experiment was conducted with two runnable threads,
operating with interrupts disabled. A timer
interrupt was simulated by issuing an int 32
instruction from the first thread, and the time was
measured for the second thread to receive control.
Using the Pentium's rdtsc instruction to
count processor cycles, this time was found to be
just under 2,500 cycles.
With the
default interrupt frequency of 100 ticks per second,
this is an overhead of less than 0.3% on the 90MHz
test machine used. Furthermore, relatively little of
this overhead can be attributed to the
decomposition. Firstly, the single interrupt causes
2 threads to be scheduled (the scheduler-thread, as
well as the second runnable thread). Secondly, for
simplicity the current implementation uses the
Intel's TSS task-switching mechanism, known to
perform poorly. Thirdly, the scheduling code is not
heavily optimised. Implementing these three
optimisations is likely to reduce significantly this
already small overhead.
GDT miss
It is
reasonable to assume a very high hit-ratio of
components cached in an 8K slot GDT (although exact
figures will depend on the application). Still, it
is important to ascertain the penalty for a `cache
miss' (that is, the penalty involved when invoking a
method on a component with its data segment not
cached in the GDT). It is even more important that
the scheme does not adversely affect the
inter-component method call latency when a
component's data segment is cached in the GDT (that
is, the common case).
The
experiments examine the time taken to transfer
control between two components using Go!'s xfer
primitive. Three experiments were conducted: one for
an implementation using data segment selectors to
identify components, and one each using 26 bit
ObjRefs where the callee component's data segment
was and was not cached in the GDT (the caller
component will always be in the GDT since its data
segment selector is loaded into the data-segment
register).
| Primitive |
Best
Case |
| Using
16 bit selector |
30 |
| Using
ObjRef cached in GDT |
28 |
| Using
ObjRef not cached in GDT |
1080 |
|