Performability, at first impression, appears to be simply some measure of performance. "The ability to perform," one may think. In actuality, performance makes up only half of a performability evaluation. Performability is a composite measure a system's performance and its dependability. This measure is the vital evaluation method for degradable systems - highly dependable systems which can undergo a graceful degradation of performance in the presence of faults (malfunctions) allowing continued "normal" operation. An example of a degradable system is a spacecraft control system containing three CPUs. A failure in this system would be catasrophic, possibly resulting in loss of life. Thus, the system is designed to degrade upon failure of CPU "1", i.e. CPUs "2" and "3" will drop their lower priority work in order to complete high priority work that the failed CPU would have done. Most of the applications of performability at the moment deal with degradable computing and communication systems like this. However, the concept can be applied to degradable systems in a wide variety of areas from economics to biology. This paper discusses the basic concept of performability and introduces the Markov Reward Modelling method using a simple example. Lastly, a few simplification and solution methods are mentioned.
Performance can be thought of as the "quality of service (QOS), provided the system is correct"[1]. Performance modelling involves representing the probabilistic nature of user demands and predicting the system capacity to perform, under the assumption that the system structure remains constant. Dependability is an all-encompassing definition for reliability, availability, safety and security. Thus it is what allows "reliance to be justifiably placed on the service the system delivers"[2]. Dependability modelling deals with the representation of changes in the structure of the system being modelled, which are generally due to faults, and how such changes affect the availability of the system. Perfomability modelling, then, considers the effect of structural changes and their impact on the overall performance of the system.
In the past, most modelling work kept performance and dependability separate. Initially, the dependability of the system might have been satisfied, then the performance optimised. This lead to systems having good performance when the system was fully functional but a drastic decline in performance when, inevitably, failure occured. Basically, the system was either 'on' and running perfectly or 'off' when it crashed. Improvements on this lead to the design of degradable systems. Because degradable systems are designed to continue their operation even in the presence of component failures (albeit at a reduced performance level), their performance can not be accurately evaluated without taking into account the impact of the structural changes (malfunctions & repairs). Analysis of the systems from a pure performance viewpoint tended to be optimistic since it ignored the failure-repair behaviour of the systems. At the other extreme, pure dependability analysis tended to be conservative, since performance considerations were not taken into account. Thus, it was essential that methods for the combined evaluation of performance and dependability be developed: enter Performability analysis!
The above network connection is given to two separate analysts to optimise. One performs a separate dependability (more specifically reliability in this case, a subset of dependability) & performance evaluation, while the other carries out a performability evaluation.
The specification of the network is such that the nodes are highly reliable, and the central node has three back-ups. the problem areas are the links between the nodes.
The first analyst (performance/dependability) carries out his dependability
study and determines that the network is not well-connected, hence undependable.
He recommends that additional links be put in, as shown in figure 7(a), to
make it sufficiently dependable. He then carries out the performance evaluation
to optimise this now dependable network. His studies provide the following
figures for degradation in the event of failure.
The second analyst, as he is doing a performability evaluation, studies the
dependability and performance criteria concurrently before making his
recommendations. His figures for the degradation are similar, but provide a
much clearer picture in this case :
His recommendation therefore is to upgrade link 3 to a more highly reliable connection, and to provide a single back-up route for link 2, via the new highly reliable link 3. This is shown in figure 7(b).
Although this is just a first draft assessment on their parts, it becomes obvious from the two different figures that the first method is more costly and less efficient than the second, performability, evaluation. In analysing the two criteria separately, the first evaluator has been led astray from the correct implementation, and has cost network owners more. This example demonstrates how the different techniques of evaluating with the same criteria, performance & dependability, can lead to disparate solutions, and indeed that in such cases performability analysis is the better method.
This efficient compromise is the ultimate attribute of performability. Consider the spacecraft control system from the first paragraph. In order to make the spacecraft ultra-dependable, it could be built with 10 back-up CPUs. This alternative would be extremely costly. Instead, a performability evaluation provides the designers with the most efficient combination of performance and dependability needed to complete the mission.
The following sections focus on performability modelling, particularly the Markov Reward Model. Following, a few simplification and solution methods to these models are mentioned as well as some tools used to build and solve performability models.
The models of the degradable systems in performability analysis come under two general headings : Cyclic and Acyclic[1]. A Cyclic model is one that describes a system that can undergo repairs, e.g. Banking computer systems or Telephone networks. Acyclic models represent nonrepairable systems (no oppurtunity for repairs) e.g. Fault-tolerant satellite systems or Space-craft control systems[3]. Performability modelling was first applied to acyclic models, and only later on to cyclic ones. The greater complexity of cyclic models makes the solutions harder to find.
There are several types of model that can be used for performability analysis, of which we shall focus on one main type. There is the Petri Net method which includes Stochastic Petri Nets (SPN), Generalised SPN (GSPN) and Extended SPN (ESPN)[2]. An SPN-based modelling technique was also developed called Stochastic Activity Networks (SAN). It is, however, the Markov performability models that is of interest to us, specifically the Markov reward model.
Degradable systems are probabilistic in their behaviour over time, therefore stochastic processes are used to model their behaviour. Stochastic processes are "mathematical models of systems which vary in time in a random manner". The problem is that although the model is required to represent the time-dependent behaviour of the system, time cannot be considered in the representation of the system. Markov process representation, a type of stochastic process, allows the consideration of time in a very controlled manner, thus overcoming the apparent obstacle. The Markov property is such that "given the present state, the past states have no influence on the future". A Markov process describes the structural changes in the system as faults and/or repairs occur. For the cases typically modelled, there must be bounds on the state space size. For example, a fault-tolerant computer system can only occupy so many different states...not an infinite number. So, in general Markov chains are used, which are "discrete parameter Markov processes whose state space is finite or countably infinite".
There are some assumptions in our model to take note of. It is assumed that not more than one processor can fail at one time; there is no simultaneous failures of CPUs. This is described by the transition arrows (only one possible transition to & from a state). Another assumption is that the buffers are ultra-reliable, so buffer failure is not modelled for, although such a failure might result in a complete system breakdown. There are no limits on buffer capacity, this eliminates the possibility of job loss that would occur with a full buffer. It is also assumed that all our processors are identical, so that if for example one CPU fails, only a single state represents the three possible cases. (Either processor 1, 2 or 3 failing, with the other two up). This lumping method reduces the state space of the problem. More details on the State Space Explosion problem, and methods of dealing with it are mentioned later.
Any one processor may fail, thus causing the system to degrade. This degradable system is fault-tolerant because the work can still be shared out among the remaining operational one(s), should one (or two) CPUs fail. Thus the system has several different stages between fully functional and completely failed. The system can still do work in less than fully operational status, although at a reduced performance level. This provides the perfect opportunity for performability evaluation of the system.
The MRM is built up of two distinct models : the behaviour model and the reward structure[1].
The behaviour model, funnily enough, describes the possible behaviour of the system. A degradable system, depending on the faults that occur, can be in different states at different times. If all the processors were running that would be one state, should one fail it would be another state, etc... The states, and the transition links between them representing by failures and repairs, make up the behaviour model.
Each different state, representing a different number of processors 'up', is capable of a different amount of work. In other words each has a certain performance level associated with it. The amount of performance achievable has a certain reward related to it. This reward rate quantifies the ability of the system to perform. If the system goes to a state with a higher reward, a higher performance level is reached. The set of these rewards, associated to the individual states, make up the reward structure.
Together the behaviour model and reward model describe the MRM :
Regarding figure 2, you see that there are four states describing the system.
These are :
There are two possible ways of assigning rewards in the reward structure. The structure can associate reward rates with state occupancies or associate reward impulses with state transitions. These two techniques can be implemented in the model together or in mutual exclusion. In the latter case a fixed reward is associated with each transition, to be awarded at the moment of the transition's occurrence. It is, however, the former technique that is adopted in our example, whereby the reward is assigned per unit time (a rate). It can be interpreted as the rate at which reward is accumulated in each state : The longer the system remains in a state, the more reward it accumulates; the higher the reward rate, the more reward per unit time.
The rates described in figure 2 are summed up below. The highest reward rate
(R0) is associated to state 1, and the lowest (R3) to state 4 :
[4]The finite-state stochastic process Z(t) represents the evolution (path) of the system in time. It is the state of the system at any time t. This is shown in figure 3. In this case the path it describes goes from state 1->2->1->2->3->4->3->2->1. As can be seen from the differing lengths of the horizontal lines in Z(t), the holding time in each state is random.
X(t) defines the reward rate of the system at time t, as shown in figure 3. In this case the rate goes from 3->2->3->2->1->0->1->2->3. This follows directly from the state changes shown in the plot of Z(t).
Y(t) is the crucial variable , entirely dependant on the previous two graphs. It is the accumulated reward until time t, which is the area under the X(t) curve. The higher the reward rate in X(t), the steeper the slope of the Y(t) curve, which implies more work done per unit time. With this the performability of the system is determined. By interpreting rewards as performance levels, the distribution of accumulated reward characterizes systems that evolve through states with different performance levels (degradable systems).
J.F.Meyer, from whom the concept of performability evaluation originated, formally described the performability of a system as "the probability that the system reaches an accomplishment level y over a utilization interval (0,t)"[6]. Looking at the graph of accumulated reward Y(t) in figure 3, y can be interpreted as a value on the y-axis. We see that a certain time t' is needed before a certain reward level (height of the curve) is reached.
(Note : The PDF, probability distribution function is different from the 'pdf', probability density function). The distribution of accumulated reward by time t evaluated at x is denoted by :
The PDF in figure 4 is a fairly average system probability distribution. To the unindoctrinated, it will not be particularly meaningful. These further figures will clarify the idea slightly.
A PDF graph literally shows how the probabilities of accomplishing the work are distributed. The PDF is the integral of the area under the pdf (probability density function), so is a direct translation of it. The higher the density, the steeper the slope of the PDF.
It stands to reason then that a system that operates for the majority of time at fully operational status will have a higher density of probabilities nearer to the maximum accomplishment level. (R0*t...the furthest right of the PDF, which represents the greatest possible accumulated reward).
If the system fails quite frequently, however, and generally runs at a lower performance level it would tend to look more like this :
So using the formula given earlier :
y(x,t) = Prob[Y(t)<=x]
It becomes clear that in the latter case, as one progresses further from the
origin, the probability of accomplishing at most a certain level of
work is quite high, i.e. There is a greater chance of not accomplishing a
specified amount of work in the latter case than in the former.
The ideal case is a failure free system that operates at maximum capacity all the time. In this case, there is zero probability of accomplishing anything less than the maximum amount of work, and 100% probability of doing that. This distribution is described by the impulse shape shown in figure 6; this is the ideal shape that optimisation techniques strive to attain. Figure 6 illustrates how the different evaluation techniques might compare to each other. As can be seen, performability provides the best means to find the most efficient use of resources that ensure mission completion (considering time factors).
In order to lessen the complexity of a model as well as aid in the solution, simplification is often needed. State Space Explosion is a problem in many performability models. A model may contain millions of states because of the fact that the number of states grows exponentially with the number of components that can fail[2]. If a model's state space is very large, it is difficult to solve (particularly by numerical techniques) and costly to construct (use of computing resources, man hours, etc.). Following, one method used to simplify models is the Reduction of model size. Two such techniques are the Elimination of states that have a low probability of occuring and the Lumping of equivalent states, as mentioned in the previous section. Another method of simplification is Decomposition. An example of decomposition is the splitting of a model into specific groups of states that are occupied frequently and groups that are occupied rarely (time-scale distinction)[2]. These methods, among others, allow for much simpler (and cheaper) analysis of performability models. In fact, some models can't even be solved without simplification.
Just to name a few, Randomisation/Uniformisation, Partial differential equations, and Transformation are techniques used to solve performability models. These four methods, among others, are used to solve Markovian models. In different modelling cases, Stochastic Petri Nets, Stochastic Ativity Networks (SANs) and Generalised Petri Nets may be solved by Simulation[2]. However, depending on model composition, parts of the model may be solved by analytic means, others parts by simulation. Approximate analytic methods are often used for their simplicity as compared to exact means. Similarly, accuracy in simulation is proportional to solution time, so the evaluators must decide on accuracy required[1]. Accordingly, many different algorithms have been created although they may be constrained in their applicability. The nature of the solution methods obviously depends on the type of performability model to be solved. In Markov Reward Modelling, as in the example in the previous section, the solution usually consists of a solution of the probability distribution function (PDF) of accumulated reward (see section on METASAN which uses Stochastic Activity Networks (SANs) to create and evaluate performability models. A second tool is METFAC, which is written in Fortran 77 and makes use of Markov processes to create the models[1] (see appendix for more information on METASAN and METFAC). As this is simply a mention of the use of modelling tools, see Shezard Jeffry Careem's article, An Introduction to Performability and Some Modeling Tools... (http://www-students.doc.ic.ac.uk:80/~sjc2/article1.html) for more information on modelling tools.
The use of performability evaluations has become widespread as a result of the proliferation of degradable multi-processor systems. As these systems become more complex, there is increased demand for use of performability, the most accurate technique to understand and predict the behaviour of these machines. The crucial questions that need to be answered are "How long can the system be expected to work without interruption ?"; "How much work can the system be expectected to accomplish before a failure ?"; "What is the probability that the system operates above a certain level of efficiency during an observation period ?".
Currently, these questions are applied heavily in the computing and communications realm of degradable systems. However, it is seen that systems designers in many fields will soon consider performability as a cost effective method of evaluation[1]. As an example of varied applications of this concept, one solution technique (double Laplace-transform inversion) used in performability today was derived by a mathematician, Prem S. Puri, in the 1960's in his investigation of the lethal effect of injecting bacteria into an animal[5]. One wonders if the developers of performability anticipated how widely their ideas would be applied. Regardless, performability will continue to be applied in more diverse applications, such as flexible manufacturing systems and vehicle-highway systems, as the tools used to apply performability evaluations become cheaper, more powerful, and easier to use[1].