Use-Related Computer Failures: Heuristic Evaluation and its application in Modern Aviation
by Hugh Jackson and Ben Tristem

Contents

  1. Abstract
  2. Introduction
    1. Challenges and Goals of HCI
  3. Usability Inspection - spotting the problems
    1. Usability Testing
    2. Heuristic Evaluation
    3. The Heuristics
    4. Alternating between Heuristic Evaluation and Usability Testing
  4. Is it possible to measure good design?
    1. Measurement Criteria
  5. Aircraft Design Methods
    1. Disorientation Issues
    2. Role of Heuristic Design
    3. Consideration of Risk
  6. Some Case Studies
    1. Poor Instrumentation?
    2. Reasonable Human Error?
  7. Conclusion
    1. Enhanced Heuristics - Extending Heuristic Evaluation
Appendix 1 - Definitions
Appendix 2 - The Evolution of Usability Engineering in Organizations
Appendix 3 - References 
Abstract


Introduction

Computers should be designed for the needs and capabilities of the end user. Users should not have to think about the intricacies of how to use a computer, they should simply be able to efficiently and effectively use the system without distractions from the user interface.

Thus the field of Human Computer Interaction (HCI) was developed - a discipline concerned with the design, evaluation, and implementation of interactive computing systems for human use and with the study of major phenomena surrounding them.

Challenges and Goals of HCI

The main problem involved with systems design is how to keep abreast of all the changes in technology. It is important for developers to ensure that their designs offer good HCI as well as harnessing the potential functionality of any new technology. However, it is crucial that increased functionality should not be used as an excuse for poor design.

The whole point of HCI is to develop or improve the safety, utility, effectiveness, efficiency, and usability of computer systems. This can be achieved by understanding the factors that determine how people operate computer technology. Ultimately we should be trying to make radical changes to the system, not the people. People should not have to adapt to use a computer, rather the computer should adapt to the user's behaviour.

The aim is to develop tools to help designers ensure that computer systems are suitable for the human activities that they are being used for and which ensure that efficient, effective and safe interaction is achieved.

Is it possible to measure good design?

It is important for designers to try to quantify their design implementations. This, however, is not an easy task. In theory, testing should be done at every stage of development so that the design requirements can be adhered to and any usability faults can be spotted early on. Ideally, quantifiable empirical methods that judge a system's design would enable us to build more usable computers. Usability metrics are one way of doing this. Dix et al [19] did not consider them totally dependable, however - "The problem with usability metrics is that they rely on measurements of very specific user actions in very specific situations. When the designer knows what the actions and situations will be, then she/he can set goals for measured observations." In other words it is sometimes easy for designers to concentrate on the metrics and not get the design right. The metrics, therefore, must cover a large number of measurement criteria.

Measurement Criteria

Whiteside et al [14] developed a set of criteria that they considered capable of measuring the design quality with some accuracy:
  • time to complete a task
  • percent of task completed
  • percent of task completed per unit time
  • ratio of successes to failures
  • time spent in errors
  • percent or numbers of errors
  • percent or number of competitors better than it
  • number of commands used
  • frequency of help and documentation use
  • percent of favourable/unfavourable user comments
  • number of repetitions of failed commands
  • number of runs of successes and of failures
  • number of times interface misleads the user
  • number of good and bad features recalled by user
  • number of available commands not invoked
  • number of regressive behaviours
  • number of users preferring your system
  • number of times users need to work around a problem
  • number of times the user is disrupted from a work task
  • number of times user loses control of the system
  • number of times user expresses frustration or satisfaction.
  • Such measurements could be carried out during a usability test.

    Usability Inspection - spotting the problems

    Usability inspection is the generic name for a set of methods that are all based on having evaluators inspect a user interface trying to find usability problems in the design. There are various inspection methods, but two in particular are of interest to us.

    Usability Testing

    During a usability test, a test user (or sometimes a pair of users) is asked to use the system to perform a set of tasks. Instead of separate tasks, the tasks are embedded in an appropriate scenario. This is where the user is given a scenario (for example, "You see a job advertised in the paper and you want to write an e-mail to the company to apply for it.")  which they must carry out. Scenarios make test sessions more natural and help subjects to concentrate on the current task. Subjects are encouraged to think aloud while performing tasks.

    A usability test can be performed during the development of a product, revealing problems which may lead to re-design of some features. It can also be used as a comparison test: usability of a product is compared against competitors' products. Depending on the nature of the test and on the variety of users, 2-8 test users are usually enough. [5]

    Heuristic Evaluation

    Heuristic evaluation is a systematic inspection of a user interface design for usability. The goal of heuristic evaluation is to find the usability problems in a user interface design so that they can be attended to as part of an iterative design process. Evaluation is performed by user interface specialists (evaluators), who analyse the visible part of the user interface applying checklists of general heuristics, and their knowledge of common usability principles and problems.

    Heuristic evaluation can be done in the early phases of design, because it can be based on paper mock-ups and prototypes as well as on working software. It cannot replace usability tests with real users, but it is a fairly quick and inexpensive method, by which the most significant usability problems can be found. Some of the problems revealed in heuristic evaluation might never turn up during short usability tests. There may also be problems discovered in usability tests, which would have been hard to find by heuristic evaluators. The latter can be partly explained by realizing that instead of really using the system the evaluators are often able to only look at screen mock-ups.

    Evaluation is most effective when it is run by a researcher, who is specialized in both usability and the application domain. If it is impossible to find any, a good alternative is to have a non-domain expert, who is a usability specialist. Some results can be achieved even by having a system designer who is a non-usability-specialist doing the analysis.

    The output from using the heuristic evaluation method is a list of usability problems in the interface, with references to those usability principles that were violated by the design in the opinion of the evaluator. It is not sufficient for evaluators to simply say that they do not like something; they should explain why they do not like it with reference to the heuristics or to other usability results. The evaluators should try to be as specific as possible and should list each usability problem separately.

    No matter how experienced an evaluator is, a single evaluator can find only some of the usability problems in the user interface of the system. According to Nielsen [6], single evaluators find only approximately 35 % of the problems, three evaluators 60 % and five 75 %. Different evaluators find different problems.

    The Heuristics

    These guidelines were originally developed by Nielson and Molich [1, 2] then updated by Nielson [3, 4] after carrying out a study into a database of 249 usability problems. He considered them to be capable of explaining nearly all usability problems encountered in HCI.
    1. Visibility of system status - The system should always keep users informed about what is going on, through appropriate feedback within reasonable time.
    2. Match between system and the real world - The system should speak the users' language, with words, phrases and concepts familiar to the user, rather than system-oriented terms. Follow real-world conventions, making information appear in a natural and logical order.
    3. User control and freedom - Users often choose system functions by mistake and will need a clearly marked "emergency exit" to leave the unwanted state without having to go through an extended dialogue. Support undo and redo.
    4. Consistency and standards - Users should not have to wonder whether different words, situations, or actions mean the same thing. Follow platform conventions.
    5. Error prevention - Even better than good error messages is a careful design which prevents a problem from occurring in the first place.
    6. Recognition rather than recall - Make objects, actions, and options visible. The user should not have to remember information from one part of the dialogue to another. Instructions for use of the system should be visible or easily retrievable whenever appropriate.
    7. Flexibility and efficiency of use - Accelerators (shortcuts) -- unseen by the novice user -- may often speed up the interaction for the expert user such that the system can cater to both inexperienced and experienced users. Allow users to tailor frequent actions.
    8. Aesthetic and minimalist design - Dialogues should not contain information which is irrelevant or rarely needed. Every extra unit of information in a dialogue competes with the relevant units of information and diminishes their relative visibility.
    9. Help users recognize, diagnose, and recover from errors - Error messages should be expressed in plain language (no codes), precisely indicate the problem, and constructively suggest a solution.
    10. Help and documentation - Even though it is better if the system can be used without documentation, it may be necessary to provide help and documentation. Any such information should be easy to search, focused on the user's task, list concrete steps to be carried out, and not be too large.

    Alternating between Heuristic Evaluation and Usability Testing

    There are two major reasons for alternating between heuristic evaluation and user testing. Firstly, a heuristic evaluation pass can eliminate a number of usability problems without the need to "waste users," who sometimes can be difficult to find and schedule in large numbers. Secondly, these two categories of usability assessment methods have been shown to find fairly distinct sets of usability problems; therefore, they supplement each other rather than lead to repetitive findings [7, 8, 9].

    Aircraft Design Methods

    Disorientation Issues

    A pilot's priority is to fly the aircraft at all times. When looking inside the cockpit at a Multi Function Display even for a few seconds, it is very easy to become disorientated. A continuous but gentle rolling or pitching action can be impossible to feel without visual cues. There is a real danger that when the pilot next looks up, the aircraft will be in an uncompromising (perhaps inverted) position!

    Incorporated into the new European joint fighter aircraft (EF2000), is a system where a small digital indicator can be seen even when looking inside the cockpit:
     
    Seen here is one of the three multi function displays (currently displaying fuel information). At the bottom right of the display, we can see a blue and orange artificial horizon. This allows the pilot to maintain his current attitude (i.e. level of bank and pitch). 

    The buttons around the outside of the display allow selection of various display modes, and are lit to indicate current selection. Along the top of the display, in white, is information concerning current speed, direction and altitude. 

    Role of Heuristic Design

    Considered below are seven of Nielson's heuristics with respect to the Eurofighter MFD design: In conclusion, the Eurofighter closely obeys Nielson's design heuristics. This encouragingly supports their validity as they were not used in the design of this aircraft's HCI. One can see how useful a technique heuristic evaluation can be.

    Consideration of Risk

    Recent incidents and accidents involving highly automated commercial transport aircraft have raised concerns about the overall safety effects of advanced autopilots, flight path management systems and other cockpit automation. While several recent studies have attempted to address some of these automation issues, until now no one has systematically identified all issues that exist about cockpit automation. The US Federal Aviation Administration has funded a team comprised of Oregon State University, Research Integrations, Inc., and Honeywell, Inc. to look into this further.

    It might be useful to summarize the current state of affairs. There have been three major accidents involving Airbus aircraft in the last year: an A320 ran off the end of the runway in Warsaw in September 1993, killing two people and injuring many more; the crew of an Aeroflot Airbus A310 lost control during cruise flight, which led to the death of everyone on board; and a China airlines A300 crashed recently tail-first (!) on landing at Nagoya, killing almost all on board.

    The A300 and A310 aircraft have 'conventional' control, that is, physical control of the aircraft is transmitted by mechanical or hydraulic means to most of the flight control surfaces. The normal flight control of the Airbus A320, A321, A330 and A340 aircraft, in contrast, is achieved by computer, to which the pilots' side stick movements are one set of inputs. This is commonly known as 'fly-by-wire'. Fly-by-wire aircraft have been in regular use by the military for over 20 years, but the A320 is the first commercial 'fly-by-wire' airliner, introduced in the early 90's. Pilots have extremely limited direct physical control of A320/21/30/40 aircraft should the flight control computers be unavailable, a situation which is anticipated not to occur during the lifetime of the fleet.

    The first flight of the Boeing 777 took place on Sunday 12 June, 1994.  The B777 is Boeing's first 'fly-by-wire' commercial transport. The B777 is a significantly different design from the A320, and it would be very surprising if there were to be any accidents attributable to features common to A320/21/30/40 and B777 aircraft which are not also common features of conventional aircraft such as the B737.

    Airbus claims its design philosophy is evolutionary, that is, the systems are not designed from scratch, but introduced gradually into the company's designs after success in previous designs. Nevertheless, there are steps, such as that to 'fly-by-wire' in the A320, which should be considered more significant than others. See the article by J.P. Potocki de Montalk [20].

    A useful and readable reference for those interested in A320 accidents is Peter Mellor's paper 'CAD: Computer-Aided Disaster!' which contains a description of the design of the A320 Electrical Flight Control System, and detailed commentary on all A320 accidents to date. A version of this will appear in High Integrity Systems journal.

    Apart from the flight control on A320/321/330/440s and B777s, there are potentially risky computer-based systems on almost all modern transport aircraft, of which maybe the most important are the autopilot/Flight-Director and the FADEC (Full-Authority Digital Engine Control). All commercial aircraft have autopilots of various degrees of sophistication (and most have Flight Directors, which provide passive guidance rather than active control), and these may be suspect in certain incidents (e.g. the Collins autopilots on B757 and B767 aircraft). Many modern aircraft also have FADEC, which has occasionally come under investigation, but so far there have been no occasions on which they have been considered primary cause of accidents or incidents.

    Human factors are very important. A task force has recently been convened to study incidents of controlled flight into terrain', in which the continued safe flight of the aircraft is impeded by a storm cloud [21]. In these accidents the physical performance of the aeroplane is generally not a factor, but they may nevertheless be computer-related, since guidance and air traffic control relies on computers to various degrees. Dr Michael Bagshaw is head of aviation medical services at British Airways.  He wonders, "Are we perhaps reaching the limit to pilots' mental processing capacity?"

    Some Case Studies

    Poor Instrumentation?

    A British Midlands Airways Boeing 737 crashed on the M1, one kilometre short of the runway at East Midlands Airport on Jan 8 1989.

    During it's climb from Heathrow, flight G-OBME experienced engine trouble (with symptoms of vibration and cabin smoke!) Finding nothing abnormal about the indications of either engine, the crew made an assumption (involving the air conditioning system) that it was the right engine and shut it down. The symptoms of the engine trouble reduced, and they believed they had made the correct decision. In fact, the right engine was fine, it was the left engine that was severely malfunctioning!! A matter of seconds prior to landing, the left engine caught fire and failed. The aircraft made a crash landing - killing about one third of the passengers on board!

    It was clear from the Flight Data Recorder that the engine indications for the faulty engines were seriously abnormal for at least 3 minutes. Even though the crew were checking these intently, they failed to assimilate the correct information in the pressure of the situation. A digital Electronic Instrument System was in use for the engine readings. Whether or not the crew would have noticed such abnormal readings on conventional electromechanical instruments is a matter for conjecture, but undoubtedly it would have been more recognisable. While the introduction of the Electronic Instrument System represented progress in terms of reliability and maintenance costs, the investigators believed it could be a retrograde step in information presentation.

    The most obvious difference between conventional electromechanical engine instruments and the EIS is that full-radius mechanical pointers have been replaced by short, light emitting diode pointers moving around the outside of their scales. Much less conspicuous than mechanical pointers, they are less able to give the comparative information provided by the strong visual cue of parallel mechanical needles! Furthermore, the EIS had been introduced to this aircraft without thorough evaluation of its efficiency in passing information. Neither pilot had operated a simulator equipped with these instruments before they flew the real plane!? This lack of user testing prior to application in such a safety critical system is unacceptable and steps are being taken to prevent such occurrences in the future.

    Reasonable Human Error?

    An aircraft, for which we do not have details, crashed into the ground whilst landing in nil visibility.

    An aircraft fitted with an automated landing system was performing a standard automatic night approach through cloud when it struck the ground, killing all on board. The pilots noticed that the altitude readings were abnormal for the stage of approach but decided to trust the computer that had so many times landed them safely in the past!

    The automatic landing system requires that the descent is specified by entering either the slope of descent (in degrees), or the rate of descent (in 1000's of feet per minute) A single instrument is used to set and display this value, whether it be slope or rate, and can be seen below in simplified form:

    If the position of the marking arrow on the right hand side is not registered correctly, then a decent slope of three degrees could be mistaken, for a 3000 foot per minute decent. In this incident, the wrong selection had been inadvertently made, directly causing the crash.
    Is it reasonable to conclude that the sole cause of the accident was human error? The official investigation concludes that the mistake made represented a reasonable level of human error. The design of the Human Computer Interface was ultimately blamed. This is an excellent example of the modern theory of compressing different information onto universal displays going fatally wrong! We must except that human error will occur, and instead take steps in the design and usability testing stages to reduce risk generated by this error!

    Conclusion

    Enhanced Heuristics - Extending Heuristic Evaluation

    Muller et al [12] extended Nielson's heuristics because they were unhappy with 2 things.

    Firstly they could not see how a system could be properly evaluated if users were not included in the inspection because users are likely to know more than the developers or the human factors workers about usability since it relates to their own work. They therefore recognised users as work domain experts, and included them in the panel of experts they asked to serve as inspectors.

    Secondly, they considered Nielson's theory to be incomplete. To quote "Floyd [13] analyzed two complementary approaches to software engineering that she called product-oriented and process-oriented. The product-oriented paradigm is focused on the computer artifact itself. Questions of adequacy, testing, validation, and so on are posed and answered within the context of a requirements/design/implementation software development process. By contrast, the process-oriented paradigm is focused on the human work process (or human life process) that the computer artifact is intended to support. Questions of adequacy, testing, validation, and so on are posed and answered within the broader context of the users' work or life setting. Floyd asserted that both paradigms were important, and that the problem in much software engineering was to get them in the proper balance."

    They therefore developed an extended version of heuristic evaluation which they called participatory heuristic evaluation. They thought the original form of heuristic evaluation was too weighted toward the product-oriented paradigm. Their extensions were to be more balanced between product orientation and process orientation.

    They updated Nielson's heuristics as follows:

    They also added 4 further heuristics: Their evaluation revealed 247 usability problems, resulting in 89 recommendations to the development team, of which the team accepted 87 percent and implemented 72 percent. Each problem or recommendation was scored by the human factors member of the team as being related to one or more of the 13 heuristics. One or more of the heuristics from the original set accounted uniquely for 33 percent of problems and 31 percent of recommendations, without any contributions from the new heuristics. By contrast, one or more of the new heuristicsstudy accounted uniquely for 15 percent of problems and 10 percent of recommendations, without any contributions from Nielson's set. 52 percent of the problems and 59 percent of the recommendations appeared to be based on common contributions from both of the sets of heuristics -- that is, they appeared to be based on at least one heuristic from the original set and at least one heuristic from the new heuristics. Hence their results show that the new heuristics made a significant difference.

    Ultimately, we should recognize the inherent limitation of usability engineering (the design of systems using usability inspection methods), that is, it provides a means of satisfying specifications and not necessarily usability. If developers have usability issues at heart, as they should in aircraft, then this will not be a problem, but this might not always be the case.

    Appendix 1 - Definitions

    User Interface: An input language for the user, an output language for the machine, and a protocol for any interaction

    Usability: The effectiveness, efficiency and satisfaction with which specified users can achieve specified goals in particular environments. [11]

    Effectiveness: The accuracy and completeness with which specified users can achieve specified goals in particular environments. [11]

    Efficiency: The resources expended in relation to the accuracy and completeness of goals achieved. [11]

    Satisfaction: The comfort and acceptability of the work system to its users and other people affected by its use.

    Functionality: The capabilities of the system

    System: The entire environment i.e. both human operator and computer

    Usability Metric: A measure of usability which indicates the complexity, understandability, testability, description and intricacy of design

    Appendix 2

    The Evolution of Usability Engineering in Organizations[10]

    1. Usability does not matter. The main focus is to wring every last bit of performance from the iron. This is the attitude leading to the world-famous error message,"beep."
    2. Usability is important, but good interfaces can surely be designed by the regular development staff as part of their general system design. This attitude is symbolized by the famous statement made by King Frederik VI of Denmark on February 26, 1835: "We alone know what serves the true welfare and benefit of the State and People." At this stage, no attempt is made at user testing or at acquiring staff with usability expertise.
    3. The desire to have the interface blessed by the magic wand of a usability engineer. Developers recognize that they may not know everything about usability, so they call in a usability specialist to look over their design and comment on it. The involvement of the usability specialist is often too late to do much good in the project, and the usability specialist often has to provide advice on the interface without the benefit of access to real users.
    4. GUI panic strikes, causing a sudden desire to learn about user interface issues. Currently, many companies are in this stage as they are moving from character-based user interfaces to graphical user interfaces and realize the need to bring in usability specialists to advise on graphical user interfaces from the start. Some usability specialists resent this attitude and maintain that it is more important to provide an appropriate interface for the task than to blindly go with a graphical interface without prior task analysis. Even so, GUI panic is an opportunity for usability specialists to get involved in interface design at an earlier stage than the traditional last-minute blessing of a design that cannot be changed much.
    5. Discount usability engineering sporadically used. Typically, some projects use a few discount usability methods (like user testing or heuristic evaluation), though the methods are often used too late in the development lifecycle to do maximum good. Projects that do use usability methods often differ from others in having managers who have experienced the benefit of usability methods on earlier projects. Thus, usability acts as a kind of virus, infecting progressively more projects as more people experience its benefits.
    6. Discount usability engineering systematically used. At some point in time, most projects involve some simple usability methods, and some projects even use usability methods in the early stages of system development. Scenarios and cheap prototyping techniques seem to be very effective weapons for guerrilla HCI in this stage.
    7. Usability group and/or usability lab founded. Many companies decide to expand to a deluxe usability approach after having experienced the benefits of discount usability engineering. Currently, the building of usability laboratories [Nielsen 1994a] is quite popular as is the formation of dedicated groups of usability specialists.
    8. Usability permeates lifecycle. The final stage is rarely reached since even companies with usability groups and usability labs normally do not have enough usability resources to employ all the methods one could wish for at all the stages of the development lifecycle. However, there are some, often important, projects that have usability plans defined as part of their early project planning and where usability methods are used throughout the development lifecycle.

    Appendix 3 - References

    [1] Molich, R., and Nielsen, J. (1990). Improving a human-computer dialogue, Communications of the ACM 33, 3 (March), 338-348.

    [2] Nielsen, J., and Molich, R. (1990). Heuristic evaluation of user interfaces, Proc. ACM CHI'90 Conf. (Seattle, WA, 1-5 April), 249-256.

    [3] Nielsen, J. (1994a). Enhancing the explanatory power of usability heuristics. Proc. ACM CHI'94 Conf. (Boston, MA, April 24-28), 152-158.

    [4] Nielsen, J. (1994b). Heuristic evaluation. In Nielsen, J., and Mack, R.L. (Eds.), Usability Inspection Methods. John Wiley & Sons, New York, NY.

    [5] The Usability Research Group at Helsinki University of Technology

    [6] Nielsen, J. (1992c). Finding usability problems through heuristic evaluation. Proc. ACM CHI'92 (Monterey, CA, 3-7 May), 373-380.

    [7] Desurvire, H. W., Kondziela, J. M., and Atwood, M. E. 1992. What is gained and lost when using evaluation methods other than empirical testing. In People and Computers VII, edited by Monk, A., Diaper, D., and Harrison, M. D., 89-102. Cambridge: Cambridge University Press. A shorter version of this paper is available in the Digest of Short Talks  presented at CHI'92 (Monterey, CA, May 7): 125-126.

    [8] Jeffries, R., Miller, J. R., Wharton, C., and Uyeda, K. M. 1991. User interface evaluation in the real world: A comparison of fou techniques. Proceedings ACM CHI'91 Conference (New Orleans, LA, April 28-May 2): 119-124.

    [9] Karat, C., Campbell, R. L., and Fiegel, T. 1992. Comparison of empirical testing and walkthrough methods in user interface evaluation. Proceedings ACM CHI'92 Conference (Monterey, CA, May 3-7): 397-404.

    [10] Nielsen, J. (1994). Guerrilla HCI: Using Discount Usability Engineering to Penetrate the Intimidation Barrier

    [11] International ISO standard 9241, "Ergonomic requirements for office work with visual display terminals"

    [12] Participatory Heuristic Evaluation: Process-Oriented Extensions to Discount Usability - Michael J. Muller, Anne McClard, Brigham Bell, Scott Dooley, Lori Meiskey, Judith A. Meskill, Randall Sparks, and Donna Tellam. Proc. ACM CHI'97 Conf.

    [13] Floyd, C. (1987). Outline of a paradigm change in software engineering. In G. Bjerknes, P. Ehn, and M.Kyng, (Eds.), Computers and democracy: A Scandinavian challenge. Brookfield VT: Gower.

    [14] John Whiteside, John Bennett, and Karen Holtblatt. Usability engineering: Our experience and evolution.  In Martin Helander, editor, Handbook of Human-Computer Interaction. North Holland, Amsterdam, 1988.

    [15] Gould, J. D., and Lewis, C. H. (1985). Designing for usability: Key principles and what designers think. Communications of the ACM 28, 3 (March), 300-311.

    [16] Job, M. (1989). Air Disaster vol. 2

    [17] Digital Image Design - picture of EF2000 flight simulator

    [18] Computer-Related Incidents with Commercial Aircraft - compiled by Peter Ladkin

    [19] Alan Dix, Janet Finaly, Gregory Abowd, Russell Beale, Human-Computer Interaction, Prentice Hall, 1993, p 170

    [20] J.P. Potocki de Montalk. Head of Airbus Cockpit/Avionic Engineering at Airbus. Article in Microprocessors and Microsystems.

    [21] The Economist, June 4-10 1994, p92