ssential Evaluation Etiquette
Performability, at first impression, appears to be simply some measure of performance. "The ability to perform," one may think. In actuality, performance makes up only half of a performability evaluation. Performability is a composite measure a system's performance and its dependability. This measure is the vital evaluation method for degradable systems - highly dependable systems which can undergo a graceful degradation of performance in the presence of faults (malfunctions) allowing continued "normal" operation. An example of a degradable system is a spacecraft control system containing three CPUs. A failure in this system would be catasrophic, possibly resulting in loss of life. Thus, the system is designed to degrade upon failure of CPU "1", i.e. CPUs "2" and "3" will drop their lower priority work in order to complete high priority work that the failed CPU would have done. Most of the applications of performability at the moment deal with degradable computing and communication systems like this. However, the concept can be applied to degradable systems in a wide variety of areas from economics to biology. This paper discusses the basic concept of performability and introduces the Markov Reward Modelling method using a simple example. Lastly, a few simplification and solution methods are mentioned.
Performance can be thought of as the "quality of service (QOS), provided the system is correct"[1]. Performance modelling involves representing the probabilistic nature of user demands and predicting the system capacity to perform, under the assumption that the system structure remains constant. Dependability is an all-encompassing definition for reliability, availability, safety and security. Thus it is what allows "reliance to be justifiably placed on the service the system delivers"[2]. Dependability modelling deals with the representation of changes in the structure of the system being modelled, which are generally due to faults, and how such changes affect the availability of the system. Perfomability modelling, then, considers the effect of structural changes and their impact on the overall performance of the system.
In the past, most modelling work kept performance and dependability separate. Initially, the dependability of the system might have been satisfied, then the performance optimised. This lead to systems having good performance when the system was fully functional but a drastic decline in performance when, inevitably, failure occured. Basically, the system was either 'on' and running perfectly or 'off' when it crashed. Improvements on this lead to the design of degradable systems. Because degradable systems are designed to continue their operation even in the presence of component failures (albeit at a reduced performance level), their performance can not be accurately evaluated without taking into account the impact of the structural changes (malfunctions & repairs). Analysis of the systems from a pure performance viewpoint tended to be optimistic since it ignored the failure-repair behaviour of the systems. At the other extreme, pure dependability analysis tended to be conservative, since performance considerations were not taken into account. Thus, it was essential that methods for the combined evaluation of performance and dependability be developed: enter Performability analysis!
The above network connection is given to two separate analysts to optimise. One performs a separate dependability (more specifically reliability in this case, a subset of dependability) & performance evaluation, while the other carries out a performability evaluation.
The specification of the network is such that the nodes are highly reliable, and the central node has three back-ups. the problem areas are the links between the nodes.
The first analyst (performance/dependability) carries out his dependability
study and determines that the network is not well-connected, hence undependable.
He recommends that additional links be put in, as shown in figure 7(a), to
make it sufficiently dependable. He then carries out the performance evaluation
to optimise this now dependable network. His studies provide the following
figures for degradation in the event of failure.
The second analyst, as he is doing a performability evaluation, studies the
dependability and performance criteria concurrently before making his
recommendations. His figures for the degradation are similar, but provide a
much clearer picture in this case :
His recommendation therefore is to upgrade link 3 to a more highly reliable connection, and to provide a single back-up route for link 2, via the new highly reliable link 3. This is shown in figure 7(b).
Although this is just a first draft assessment on their parts, it becomes obvious from the two different figures that the first method is more costly and less efficient than the second, performability, evaluation. In analysing the two criteria separately, the first evaluator has been led astray from the correct implementation, and has cost network owners more. This example demonstrates how the different techniques of evaluating with the same criteria, performance & dependability, can lead to disparate solutions, and indeed that in such cases performability analysis is the better method.
This efficient compromise is the ultimate attribute of performability. Consider the spacecraft control system from the first paragraph. In order to make the spacecraft ultra-dependable, it could be built with 10 back-up CPUs. This alternative would be extremely costly. Instead, a performability evaluation provides the designers with the most efficient combination of performance and dependability needed to complete the mission.
The following sections focus on performability modelling, particularly the Markov Reward Model. Following, a few simplification and solution methods to these models are mentioned as well as some tools used to build and solve performability models.
The models of the degradable systems in performability analysis come under two general headings : Cyclic and Acyclic[1]. A Cyclic model is one that describes a system that can undergo repairs, e.g. Banking computer systems or Telephone networks. Acyclic models represent nonrepairable systems (no oppurtunity for repairs) e.g. Fault-tolerant satellite systems or Space-craft control systems[3]. Performability modelling was first applied to acyclic models, and only later on to cyclic ones. The greater complexity of cyclic models makes the solutions harder to find.
There are several types of model that can be used for performability analysis, of which we shall focus on one main type. There is the Petri Net method which includes Stochastic Petri Nets (SPN), Generalised SPN (GSPN) and Extended SPN (ESPN)[2]. An SPN-based modelling technique was also developed called Stochastic Activity Networks (SAN). It is, however, the Markov performability models that is of interest to us, specifically the Markov reward model.
Degradable systems are probabilistic in their behaviour over time, therefore stochastic processes are used to model their behaviour. Stochastic processes are "mathematical models of systems which vary in time in a random manner". The problem is that although the model is required to represent the time-dependent behaviour of the system, time cannot be considered in the representation of the system. Markov process representation, a type of stochastic process, allows the consideration of time in a very controlled manner, thus overcoming the apparent obstacle. The Markov property is such that "given the present state, the past states have no influence on the future". A Markov process describes the structural changes in the system as faults and/or repairs occur. For the cases typically modelled, there must be bounds on the state space size. For example, a fault-tolerant computer system can only occupy so many different states...not an infinite number. So, in general Markov chains are used, which are "discrete parameter Markov processes whose state space is finite or countably infinite".
There are some assumptions in our model to take note of. It is assumed that not
more
than one processor can fail at one time; there is no simultaneous failures of
CPUs. This is described by the transition arrows (only one possible transition
to & from a state). Another assumption is that the buffers are ultra-reliable,
so buffer failure is not modelled for, although such a failure might result in
a complete system
breakdown. There are no limits on buffer capacity, this eliminates the
possibility
of job loss that would occur with a full buffer. It is also assumed that all
our processors are identical,
so that if for example one CPU fails, only a single state represents the
three possible cases. (Either processor 1, 2 or 3 failing, with the other two
up). This lumping method reduces the state space of the problem.
More details on the State Space Explosion problem, and methods of
dealing with it are mentioned later.
Any one processor may fail, thus causing the system to degrade. This degradable
system is fault-tolerant because the work can still be shared out among the
remaining operational one(s), should one (or two) CPUs fail. Thus the system
has several different stages between fully functional and completely failed. The
system can still do work in less than fully operational status, although at a
reduced performance level. This provides the perfect opportunity for performability
evaluation of the system.
The MRM is built up of two distinct models : the behaviour model and the
reward structure[1].
The behaviour model, funnily enough, describes the possible behaviour of the
system. A degradable system, depending on the faults that occur, can be in
different states at different times. If all the processors were running that
would be one state, should one fail it would be another state, etc... The states,
and the transition links between them representing by failures and repairs,
make up the behaviour model.
Each different state, representing a different number of processors 'up', is
capable of
a different amount of work. In other words each has a certain performance level
associated with it. The amount of performance achievable has a certain reward
related to it. This reward rate quantifies the ability of the system to perform.
If the system goes to a state with a higher reward, a higher performance level
is reached. The set of these rewards, associated to the individual states, make
up the reward structure.
Together the behaviour model and reward model describe the MRM :
Regarding figure 2, you see that there are four states describing the system.
These are :
There are two possible ways of assigning rewards in the reward structure. The
structure can associate reward rates with state occupancies or associate
reward impulses with state transitions. These two techniques can be
implemented in the
model together or in mutual exclusion. In the latter case a fixed reward
is associated with each transition, to be awarded at the moment of the transition's
occurrence. It is, however, the former technique that is adopted in our example,
whereby the reward is assigned per unit time (a rate). It can be interpreted as
the rate at which reward is accumulated in each state : The longer the system
remains in a state, the more reward it accumulates; the higher the reward rate,
the more reward per unit time.
The rates described in figure 2 are summed up below. The highest reward rate
(R0) is associated to state 1, and the lowest (R3) to state 4 :
[4]The finite-state stochastic process Z(t) represents the
evolution (path) of the
system in time. It is the state of the system at any time t. This is
shown in figure 3. In this case the path it describes goes from state
1->2->1->2->3->4->3->2->1. As can be seen from the differing lengths of the
horizontal lines in Z(t), the holding time in each state is random.
X(t) defines the reward rate of the system at time t, as shown
in figure 3. In this case the rate goes from 3->2->3->2->1->0->1->2->3. This
follows directly from the state changes shown in the plot of Z(t).
Y(t) is the crucial variable , entirely dependant on the
previous two graphs. It is the accumulated reward until time t, which is the
area under the
X(t) curve. The higher the reward rate in X(t), the steeper the slope of the
Y(t) curve, which implies more work done per unit time. With this
the performability of the system is determined. By interpreting rewards as
performance levels,
the distribution of accumulated reward characterizes systems
that evolve through states with different performance levels (degradable
systems).
J.F.Meyer, from whom the concept of performability evaluation originated,
formally
described the performability of a system as "the probability that the system
reaches an accomplishment level y over a utilization interval (0,t)"[6]. Looking
at the graph of accumulated reward Y(t) in figure 3, y can be interpreted
as a value on the y-axis. We
see that a certain time t' is needed before a certain reward level (height
of the curve) is reached.
(Note : The PDF, probability distribution function is different from the 'pdf',
probability
density function). The distribution of accumulated reward by time t
evaluated at x is denoted by :
The PDF in figure 4 is a fairly average system probability distribution. To
the unindoctrinated, it will not be particularly
meaningful. These further figures will clarify the idea slightly.
A PDF graph literally shows how the probabilities of accomplishing the work are
distributed. The PDF is the integral of the area under the pdf (probability
density function),
so is a direct translation of it. The higher the density, the steeper the slope
of the PDF.
It stands to reason then that a system that operates for the majority of time
at fully operational status will have a higher density of probabilities nearer
to the maximum accomplishment level. (R0*t...the furthest right of the PDF,
which represents the greatest possible accumulated reward).
If the system fails quite frequently, however, and generally runs at a lower
performance level it would tend to look more like this :
So using the formula given earlier :
y(x,t) = Prob[Y(t)<=x]
The ideal case is a failure free system that operates at maximum capacity all
the time. In this case, there is zero probability of accomplishing anything less
than the maximum amount of work, and 100% probability of doing that. This
distribution is described by the impulse shape shown in figure 6; this is the
ideal shape that optimisation techniques strive to attain. Figure 6 illustrates
how the different evaluation techniques might compare to each other. As can
be seen, performability provides the best means to find the most efficient
use of resources that ensure mission completion (considering time factors).
In order to lessen the complexity of a model as well as aid in the solution, simplification is
often needed. State Space Explosion is a problem in many performability models. A
model may contain millions of states because of the fact that the number of states grows
exponentially with the number of components that can fail[2].
If a model's state space is very large, it is difficult to solve (particularly by
numerical techniques) and costly to construct (use of computing resources, man hours, etc.).
Following, one method used to simplify models is the Reduction of
model size. Two such techniques are the Elimination of states that have a
low probability of occuring and the Lumping of equivalent states, as mentioned in the previous
section. Another method of simplification is Decomposition. An example of decomposition is
the splitting of a model into specific groups of states that are occupied frequently and
groups that are occupied rarely (time-scale distinction)[2]. These methods, among others, allow
for much simpler (and cheaper) analysis of performability models. In fact, some models can't
even be solved without simplification.
Just to name a few, Randomisation/Uniformisation, Partial differential
equations,
and Transformation are techniques used to solve performability models. These four methods, among
others, are used to solve Markovian models. In different modelling cases,
Stochastic Petri Nets, Stochastic Ativity Networks (SANs) and Generalised Petri
Nets may be solved by Simulation[2]. However, depending on model composition,
parts of the model may be solved by analytic means, others parts by simulation.
Approximate analytic methods are often used for their simplicity as compared to exact means.
Similarly, accuracy in simulation is proportional to solution time, so the evaluators must decide on
accuracy required[1]. Accordingly, many different algorithms have
been created although they may be constrained in their applicability.
The nature of the solution methods obviously depends on the type of performability model to be solved.
In Markov Reward Modelling, as in the example in the previous
section, the solution usually consists of
a solution of the probability distribution function (PDF) of accumulated reward (see section on
METASAN which
uses Stochastic Activity Networks (SANs) to create and evaluate performability models.
A second tool is METFAC,
which is written in Fortran 77 and makes use of Markov processes to create the models[1]
(see appendix for more information on METASAN and METFAC).
As this is simply a mention of the use of modelling tools, see Shezard Jeffry Careem's article,
An Introduction to Performability and Some Modeling Tools...
(http://www-students.doc.ic.ac.uk:80/~sjc2/article1.html) for more information
on modelling tools.
The use of performability evaluations has become widespread as a result of the proliferation of
degradable multi-processor systems. As these systems become more complex, there is increased
demand for use of performability, the most accurate technique to understand and predict the
behaviour of these machines. The crucial questions that need to be answered
are "How long can the system be expected to work without interruption ?"; "How much work can the
system be expectected to accomplish before a failure ?"; "What is the probability that the system
operates above a certain level of efficiency during an observation period ?".
Currently, these questions are applied heavily
in the computing and communications realm of degradable systems. However, it is seen that
systems designers in many fields will soon consider performability
as a cost effective method of evaluation[1]. As an example of varied applications
of this concept, one solution technique (double Laplace-transform inversion) used in performability
today was derived by a mathematician, Prem S. Puri, in the 1960's in his investigation of the
lethal effect of injecting bacteria into an animal[5].
One wonders if the developers of performability anticipated how widely their ideas would be applied.
Regardless, performability will continue to be applied in more diverse applications, such as
flexible manufacturing systems and vehicle-highway systems, as the tools used to apply performability
evaluations become cheaper, more powerful, and easier to use[1].
The Markov Reward Model
The most commonly used performability model today is the Markov Reward
Model (MRM). To illustrate the modelling technique, applied to a cyclic
case, a 3-CPU multi-processor system is used, that begins running in fully operational
mode. Jobs arrive at the buffer and are stored until
a processor(CPU) becomes available, then the job at the head of the buffer is
sent to this CPU to be processed. In this manner jobs are shared equally between
the 3 identical processors.
Fig 1: Model of the multi-processor system
Fig 2: The Markov reward Model
The possible transitions between each state are described by the arrows joining
them. The system will only spend a certain amount of time in each state, called
the holding time.The holding times in each state are typically exponential
distributed, therefore we can associate to each a probability of changing state, over
time. These are the labels on the transition arrows, Lambda and Mju.
Lambda refers to a state switch due to a failure caused by faults occuring
in the system; Mju to one caused by repairs to the system. From these,
and the state transition diagram, we can build up Q, the
Generator matrix shown in figure 2. Each transition rate from state i
to state j is denoted by the term q(i,j) in the matrix.
Fig 3 : Sample paths of Z(t), X(t) & Y(t) processes
Fig 4 : The Probability Distribution Function of Y(t)
Probability Density Functions (PDFs) Illuminated
Fig 5(a) : Low performability system PDF
Fig 5(b) : High performability system PDF
It becomes clear that in the latter case, as one progresses further from the
origin, the probability of accomplishing at most a certain level of
work is quite high, i.e. There is a greater chance of not accomplishing a
specified amount of work in the latter case than in the former.
Fig 6 : Comparative PDF's
Brief Overview of Some Simplification and Solution Methods
Performability On the Rise In the Future
By
Frederick Abou Jawad
&
Eric Johnsen
June 12, 1995