Workshop on
Human Error and
Systems Development
19th-22nd March 1997, Glasgow University, Scotland
Editor: Chris Johnson Glasgow Accident Analysis Group, Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ. http://www.dcs.gla.ac.uk/research/gaag GAAG TR-97-2 |
Contents
T. Gerdsmeier, P. Ladkin, K. Loer,
Analysing the Cali Accident With a WB-Graph
L. Love and C. Johnson,
Accident Fault Trees
P.R. Croll, C. Chambers and M. Bowell,
A Study of Incidents Involving Programmable Electronic Systems
C. Burns, C. Johnson and M. Thomas,
Accident Analysis and Action Logic
Iain Carrick,
Incorporating Human Factors into Safety Systems In Nuclear Reactors;
T.W. van der Schaaf,
Prevention and Recovery of Errors in System Software
S. Dekker, B. Fields, P. Wright, M. Harrison,
Human Error Recontextualised
Marie-Odile Bes,
Analysis of a Human-Error In A Dynamic Environment: The Case of Air Traffic Control
Heather Allen and Marcey Abate,
Development of Aviation Safety Decision Support Systems
S. Viller, J. Bowers and T. Rodden,
Human Factors in Requirements Engineering 98
L. Emmet, R. Bloomfield, S. Viller, J. Bowers,
PERE: Process Improvement through the Integration of Mechanistic and
Human Factors Analyses
N. Maiden, S. Minocha, M. Ryan, K. Hutchings, K. Manning,
A Co-operative Scenario-Based Approach to the Acquisition and Validation
of Systems Requirements
P. Sherman, W.E. Hines, R.L. Helmriech
The Risks of Automation: Some Lessons From Aviations and Implications for
Training and Design
F. Vanderhaegen and C. Iani,
Human Error Analysis to Guide System Design
N. Leveson, J. Reese, S. Koga, L.D. Pinnel, S.D. Sandys
Analysing Requirements Specifications for Mode Confusion Errors
P. Palanque and R. Bastide,
Embedding Modelling of Errors in Specifications
Analysing the Cali Accident With a WB-Graph
Thorsten Gerdsmeier, Peter Ladkin and Karsten Loer,
RVS, Technische FakultŠt, UniversitŠt Bielefeld, Germany.
{thorsten | ladkin | karlo}@rvs.uni-bielefeld.de
This work is dedicated to the memory of Paris Kanellakis, who died with his family in this accident
We analyse the Cali accident to American Airlines Flight 965, a Boeing 757, on 20 Dec 1995. We take the events and states from the accident report, construct a WB-graph (`Why?...Because...-graph) in both textual and graphical form of the 59 events and states, and compare this representation favorably with the Findings section of the original report. We conclude that the WB-graph is a useful structure for representing explanations of accidents.
THE ACCIDENT TO N651AA NEAR CALI ON 20 DEC 1995
The accident aircraft, an American Airlines Boeing 757-223, hit mountainous terrain while attempting to perform a GPWS escape manoeuvre, about 10 miles east of where it was supposed to be on the instrument arrival path to Cali Runway 19. Approaching from the north, the crew had been expecting to use Runway 1, the same asphalt but the reciprocal direction, which would require flying past the airport and turning back, the usual procedure. They were offered, and accepted, a `straight-in' arrival and approach to Rwy 19, giving them less time and therefore requiring an expedited descent. The crew were not familiar with the ROZO One arrival they were given, became confused over the clearance, and spent time trying to program the FMC (Flight Management Computer) to fly the clearance they thought they had been given. A confusion over two navigation beacons in the area with the same identifier and frequency led to the aircraft turning left away from the arrival path, a departure not noticed by the crew for 90 seconds. When they noticed, they chose to fly `inbound heading', that is, parallel to their cleared path. However, they had not arrested the descent and were in mountainous terrain. Continued descent took them into a mountain, and the GPWS (Ground Proximity Warning System) sounded. The escape manoeuvre was executed imprecisely, with the speedbrakes left out, as the aircraft flew to impact. The US National Transportation Safety Board believes that had the manoeuvre been executed precisely, the aircraft could possibly have cleared the terrain.
The aircraft should never have been so far off course, so low. The accident has been of great interest to aviation human factors experts. It was the first fatal accident to a B757 in 13 years of exemplary service.
We analyse the Cali accident sequence, using the system states and events noted in the accident report (1). We employ the WB-graph method as used in (2) (earlier called the causal hypergraph method in (3), (4)).
Our causal analysis compares interestingly with the statements of probable cause and contributing factors in the report.
THE CALI REPORT
The report concludes (p57):
3.2 Probable Cause
Aeronautica Civil determines that the probable causes of this accident were:
1. The flightcrew's failure to adequately plan and execute the approach to runway 19 at
SKCL and their inadequate use of automation.
2. Failure of the flightcrew to discontinue the approach into Cali, despite numerous cues
alerting them of the inadvisability of continuing the approach.
3. The lack of situational awareness of the flightcrew regarding vertical navigation,
proximity to terrain, and the relative location of critical radio aids.
4. Failure of the flightcrew to revert to basic radio navigation at the time when the
FMS-assisted navigation became confusing and demanded an excessive workload in a
critieal phase of the flight.
3.3 Contributing Factors
Contributing to the cause of the accident were:
1. The flightcrew's ongoing efforts to expedite their approach and landing in order to
avoid potential delays.
2. The flightcrew's execution of the GPWS escape maneuver while the speedbrakes
remained deployed.
3. FMS logic that dropped all intermediate fixes from the display(s) in the event of
execution of a direct routing.
4. FMS-generated navigational information that used a different naming convention from
that published in navigational charts.
It is interesting to note that the probable causes are all stated as failures or a lack, that is, an absence of some (needed) action or competence. These are descriptions of persisting state. However, the accident is an event. Depending on what one counts as state and what as event, a causal sequence cannot normally contain one event alone.
An event is normally explained by the values of system state variables along with certain prior events. We may therefore suspect that the statement of probable cause in the report is logically inadequate because (at the least) incomplete. This suspicion may be substantiated by observing that all four `probable causes' would have been true even if the aircraft had successfully executed the GPWS escape manoeuvre and landed safely later at Cali. Or had the faulty left turn away from the cleared airspace not been executed. A set of probable causes that allow the possibility that the accident would not occur is necessarily incomplete as a causal explanation.
In contrast to the four probable causes and four contributing factors of the report, the WB-graph contains 55 causally-relevant events and states mentioned in the report. The statement of probable causes and contributing factors is not intended to represent all causally-relevant events and states. However, we know of no generally-accepted logic-based methodology for discriminating `important' causally-necessary factors from `less important' causally-necessary factors. Each share the logical property that, had they not occurred, the accident would not have happened. We believe it aids understanding to display all such causally-necessary factors and their logical interrelations. The WB-graph, in both its graphical and its textual forms, does that.
THE CALI WB-GRAPH AS FORMATTED TEXT
We use an ontology of (partial system) states and events as described in (5). The sequence of events and states used in the graph are those mentioned in the Cali accident report, with one exception. As discussed in (6), the cockpit voice recorder transcript shows that the crew asked for confirmation of a clearance that it was impossible to fly. The controller said `Affirmative', thus (falsely) confirming a clearance he knew to be confused and impossible to fly. The report mentions that the controller felt he was not able to explain to the pilots that they were confused. This was attributed to `cultural differences'. The NTSB recommendations (7) also mention fluency training in Aviation English for non-native speakers. As argued in (6), we take the semantics of ATC/pilot English literally, call the affirmation a mistake, and explain this event by the officially-suggested `cultural differences' and `lack of fluency' situations. Readers who disagree should be able easily to make the necessary modification to the graph.
We have found that a path-notation for the causally-relevant states and events is useful. We denote each explanadum as a sequence of digits, e.g., [1.2.1.1] The explanans for [1.2.1.1] is subsequently written as a bulleted list: [-.1], <-.2>, etc, representing the conjunction of all the reasons why event [1.2.1.1] occurred. This is formatted in the form
[1.2.1.1] /\ [-.1]
/\ <-.2>
(We use the doubled-symbol `-.' for readability, although one symbol would logically have sufficed.) These explanans nodes for [1.2.1.1] inherit the names [1.2.1.1.1] and <1.2.1.1.2>, respectively. As in (2), we use [...] to denote events, and <...> to denote true state predicates. State predicates are qualified with since (to denote an event after which these predicates remain true) and/or until (to denote an event before which the predicate has remained true), following the suggestion in (5). Two nodes are classified as both events and states, for reasons to be explained below. These nodes are <[1.2]> and <[1.2.1.3.1]>. We pretty print the nodes according to the lengths of their names. We believe the notation becomes self-evident upon reading. Pilot behavioral failures are classified according to the sixfold classification scheme reproduced here as Appendix 1
WHY BECAUSE DESCRIPTION
[1] AC impacts mountain
/\ <-.1> GPWS manoeuvre failed: since [1.1.1]
/\ <[-.2]> AC in mountainous terrain: since [1.2.1.2]
<1.1> /\ [-.1] GPWS manoeuvre initiated
/\ <-.2> AC did not exhibit optimal climb performance
/\ [-.3] AC very close to mountains @ [1.1.1]
[1.1.1] [-.1] GPWS warning sounds
[1.1.1.1] [-.1] AC dangerously close to terrain
[1.1.1.1.1] /\ [1.2.1.1]
/\ [1.2.1.3]
/\ <1.2.2.2>
<1.1.2> /\ <-.1> AC speedbrakes are extended: since [1.2.2.1.2]
/\ <-.2> AC performs non-optimal pitch manoeuvre
<1.1.2.1> /\ <-.1> CRW didn't retract speedbrakes according to procedure
(Action failure)
/\ <1.2.2.2.1>
<1.1.2.1.1> <-.1> CRW unaware of extended speedbrakes (Awareness failure)
<1.1.2.1.1.1> <-.1> CD displays speedbrakes-extended
<1.1.2.2> <-.1> PF doesn't hold optimal steady pitch attitude
(Action failure)
<1.1.3> /\ <1.2.1>
/\ <1.2.2>
<1.2> /\ <-.1> AC on wrong course/position (2D-planar): since [1.2.1.3.1.1]
/\ <-.2> AC flying too low for cleared airspace (3rdD): since [1.2.1.3.1.1]
<1.2.1> /\ [-.1] CRW turned to "inbound heading" at [1.2.1.3]
(Decision Failure:
Reasoning Failure)
/\ <-.2> CRW without situational awareness: since [1.2.1.3.1.1]
(Perception Failure)
/\ [-.3] AC arrived at (false) Position B: end of left turn
<1.2.1.2> /\ <-.1> CRW unfamiliar with ROZO One Arrival and Rwy 19 Approach
/\ <-.2> CRW high workload
/\ <-.3> CRW used procedural shortcuts
/\ [-.4] CRW request for confirmation of false clearance
twice confirmed by ATC
/\ [-.5] FMC erases intermediate waypoints @ [1.2.1.3.1.1]
<1.2.1.2.2> /\ <-.1> CRW must expedite arrival
/\ <-.2> lack of external visual reference
/\ <1.2.1.2.1>
<1.2.1.2.2.1> /\ <-.1> lack of time for executing arrival procedure
/\ [1.2.2.1.1]
<1.2.1.2.2.2> /\ <-.1> arrival takes place at night
/\ <-.2> few lighted areas on ground
[1.2.1.2.4] [-.1] ATC misuse of Aviation English
[1.2.1.2.4.1] /\ [-.1] discourse under cultural dependencies
/\ <-.2> ATC lack of fluency in English
/\ <-.3> ATC lack of knowledge of AC position
<1.2.1.2.4.1.2> <-.1> Colombian ATC Use-of-English
training/certification
<1.2.1.2.4.1.3> <-.1> no ATC radar coverage
[1.2.1.2.5] /\ <-.1> FMC design
/\ [1.2.1.3.1.1]
[1.2.1.3] /\ <[-.1]> AC left turn from true course for 90 seconds:
since [1.2.1.3.1.1]
/\ <-.2> CRW didn't notice left turn:
since [1.2.1.3.1.1]; until [1.2.1.3]
<[1.2.1.3.1]> /\ [-.1] PNF gives 'R' to FMC
/\ <-.2> FMC-database uses `R' to denote ROMEO
/\ <-.3> CRW didn't realize <1.2.1.3.1.2>: since [1.2.1.3.1.1]
(Perception Failure)
/\ <-.4> PNF didn't correctly verify FMS-entry: since [1.2.1.3.1.1]
(Action Failure)
[1.2.1.3.1.1] /\ <-.1> CRW believes 'R' denotes 'ROZO' in FMC
(Awareness Failure)
/\ [-.2] CRW decides to fly direct 'ROZO'
<1.2.1.3.1.1.1> <-.1> ID `R' and FREQ for ROZO on the approach plate correspond
with an FMC database entry
<-.2> ID/FREQ combination usually suffice to identify
uniquely an NDB within range
<1.2.1.3.1.1.1.1> <1.2.1.3.1.2.1> ARINC 424 Specification
<1.2.1.3.1.1.2> <1.2.1.2.1> CRW unfamiliar with ROZO One Arrival
and Rwy 19 Approach
<1.2.1.3.1.2> /\ <-.1> ARINC 424 Specification
/\ <-.2> Jeppesen FMC-database design
<1.2.1.3.1.3> /\ <-.1> FMC-displayed ID and FREQ valid for ROZO
/\ <-.2> CRW didn't perceive FMC-displayed Lat/Long
(Awareness Failure)
<1.2.1.3.1.3.1> <-.1> ROZO and ROMEO have same ID `R' and FREQ
<1.2.1.3.1.3.1.1> <-.1> Colombian government decision
<1.2.1.3.1.3.2> /\ <-.1> FMC display figures small
/\ <-.2> CRW not trained to check Lat/Long
/\ <1.2.1.2.2>
<1.2.1.3.1.4> /\ <1.2.1.3.1.3.1>
/\ <1.2.1.3.1.3.2>
<1.2.2> /\ [-.1] AC starts expedited descent from FL230
/\ <-.2> AC expedited-descent continuous: until [1.1.1]
[1.2.2.1] [-.1] CRW decision to accept Rwy 19 Approach
<1.2.2.2> /\ [-.1] CRW extends speedbrakes
/\ <-.2> CRW failed to arrest descent: until [1.1.1]
(Action Failure)
<1.2.2.2.2> <1.2.1.2> CRW without situational awareness: since [1.2.1.3.1.1]
Glossary:
AC Aircraft
ARINC ARINC, Inc.
ATC Air Traffic Control
CD Cockpit Display
Course Two-dimensional straight-line ground track
CRW Crew
FLxyz Flight Level xyz = Altitude at which altimeter reads
xyz00ft @ barometric setting 29.92"=1013hPa
FMC Flight Management Computer
FREQ (Navaid) radio Frequency
GPWS Ground Proximity Warning System
Heading Magnetic compass direction along which course is flown
ID (Navaid) Identifier (sequence of symbols)
Jeppesen Jeppesen-Sanderson, Inc.
Lat/Long Latitude and Longitude Values
Navaid Navigation Aid (radio beacon)
NDB Non-Directional Beacon (a navaid)
PF Pilot Flying
PNF Pilot Not Flying
ROMEO NDB near Bogota
ROZO NDB near Cali
Rwy xy Runway with heading xy0 degrees magnetic (to nearest 10 degrees)
We constructed the textual form as above, and checking the construction noticed that certain causal factors were missing, namely, [1.1.1.1.1] (AC dangerously close to mountain) had no causal forebears. We noticed this discrepancy by singling out the source nodes in the WB-graph. These nodes represent causal factors with no causal forebears. Intuitively it should have forebears, since the aircraft was in the three-dimensional position it was in (physically) because of persisting course (2D) and altitude (1D) states, which were in turn consequences of certain command actions. A persistent course state is a consequence of (i) a particular heading flown (ii) from a given position; a persistent descent state was commanded at a particular point. Hence we looked for these events/states.
Looking over the textual form again, the causal forebears of [1.1.1.1.1] were already included. These reasons are: (course) [1.2.1.3] that the aircraft was at Position B; from whence [1.2.1.1] the crew turned to "inbound heading"; while <1.2.2.2> continuing their descent. We modified the textual description to include these three reasons for [1.1.1.1].
We also realised that reasons for <1.2.1.2.1.1.1> CRW believes 'R' denotes 'ROZO' in FMC database were in the report, respectively the NTSB recommendations, but had not yet been included in the textual graph: namely <1.2.1.3.1.1.1.1> ID `R' and FREQ for ROZO on the approach plate correspond with an FMC database entry and <1.2.1.3.1.1.1.2> ID/FREQ combination usually suffice to identify uniquely an NDB within range. <1.2.1.3.1.1.1.2> has a reason already in the textual graph, namely <1.2.1.3.1.2.1> ARINC 424 Specification
We had already drawn the WB-graph (below) so we simply added the links, even though two links cross existing links, without attempted to make the graph planar (since we had one crossing link to begin with).
This experience confirmed our supposition that the textual form with path-numbering and pretty-printing is much easier to construct and check thoroughly than the graphical form of the WB-graph, in particular to check the correctness of the `Why...Because...' assertions themselves in terms of the counterfactual semantics; but that properties of the graphical form single out certain kinds of mistakes, such as source nodes (which represent the `original causes', as noted below) which should nevertheless have causal forebears.
We concluded that the textual and graphical forms are complementary, that they are both needed for checking, and that therefore our method should involve always constructing both.
AUTOMATED WB-GRAPH CONSTRUCTION AND CHECKING
As noted above, we are aware of the possibilities of error when generating a WB-graph by hand. The first author then wrote the graph in DATR, a pure inheritance language developed for phonological analysis in computational linguistics. A DATR theory (program) is a set of nodes, with defined attributes and values; queries (requests for values of attributes) are processed by evaluating the attributes. Attribute values may be aliased to an attribute of another node, and there are defaults for evaluation.
Each state/event in the WB-graph was written as a DATR node, with value being the description of the state/event. Attributes are the reasons (corresponding to the indented bulleted list by the node name in the textual form; and in the graph itself the arrows of which this node is head), and also the nodes for which this node is a reason (occurrences of the node name in a bulleted reason-list in the textual form; and in the graph the arrows of which this node is tail). The whole forms a simple DATR `theory' (8).
The DATR theory is thus written using only local information about each node: its value (the description), ancestors (immediate causal factors) and offspring (nodes of which it is an immediate causal factor). We take it as a principle that all event nodes must have at least one causal factor which is also an event, although states may have factors which are all states, or a mixture of states and events (see many papers in (9)). A DATR interpreter was used to run the following simple checks:
• does every node have at least one causal factor which is an event?
• is every node classified as exclusively either an event or a state?
We found four event nodes whose causal factors were all states, one state node which was written mistakenly once as an event, and one event node which was mistakenly written as a state. This consistency condition has global consequences. If an event is once mistakenly written as a state, then all causal factors in its history need only be states; whereas in fact the event must have at least one event as factor, and that event must have at least one, and so forth. When the mistake is found, and the `state' rewritten as an event, a consistency check must be made on the entire history to make sure that each collection of factors for an event contains at least one event. Thus an error in miswriting an event node as a state, or an event node which has only states as causal factors, requires a consistency review on the entire subgraph `backwards' from this node. Such errors are therefore expensive.
The fact that all three of us had overlooked these simple and obvious inconsistencies in the `carefully checked' textual version, and the cost in time of correcting them, established firmly for us the value of using such automated help in generating the WB-graph. We recommend that DATR be used according to the method of (8) when generating WB-graphs of comparable or larger size.
EVENT/STATE AMBIGUITY
The intuitive semantics of the division into events and states is that an event represents an action, a once-only state change, and a state represents a persisting condition. At the `level of granularity' at which reasons are considered in accident reports such as Cali, it may sometimes be difficult to tell if a condition should be classified as an event or a state. This may have consequences for the application of the consistency condition in the last section.
For example, consider event [1], the accident event. Its causal factors are two states, thus superficially violating the consistency condition. The second condition, <[1.2]>, AC in mountainous terrain is a state as described; but what in fact caused the impact is that the aircraft was in the position it was with the flight path that it had, and this flight path intersected with the mountain. Having a particular position at a particular time may be regarded as an event, since it is more-or-less instantaneous; but it is expressed logically as a state predicate - it is not an action. The AC flight path, which is an AC state predicate, along with the position-time event-state will ensure that, in the absence of other intervening events, other predictable position event-states will occur in the future.
Some causal factors such as <[1.2]> thus represent imprecise features of the flight which at this level of granularity may be classified as an event or a state. This affects application of the consistency condition above. We have thus chosen provisionally to classify them as event-states, with the symbol e.g., <[1.2]>, and apply the consistency condition formally as for a pure state (that is, not at all).
There are precisely two such nodes in the Cali WB-graph:
<[1.2]> AC in mountainous terrain: since [1.2.1.2]
<[1.2.1.3.1]> AC left turn from true course for 90 seconds: since [1.2.1.3.1.1]
To emphasise that it is the `level of granularity' at which the reasons are expressed which engenders this event-state ambiguity, rather than any fundamental problem with the ontology or our method, we note that <[1.2]> is very closely related to [1.1.3], AC very close to mountains @ [1.1.1]. These position nodes are obviously not independent.
In the case of the Cali accident report, event-state ambiguity only occurs with position/flight path factors. One can extrapolate and suppose that this will happen with other accident explanations also. Thus we recommend that all position/flight path factors in accidents be examined to see whether they should be classified as event-states, as pure events, or as pure states. We do not know if there are other such specific features of accident explanations which require resolution as event-states.
It is intuitively obvious that a more detailed ontological analysis of flight path/position dynamics will obviate the need for event-states. Introduction of the relevant mathematics of dynamics, however, would in our opinion be `overdoing it' at this level of granularity: one does not need to know the precise physics in order to know that being too close to the mountains was a causal factor. But we feel it would be preferable to bring the dynamical theory and the ontology we use into a more close relation with practice: we are not satisfied with event-state introduction because
• it is intuitive: we give no guidance on when to classify factors as event-states, other than to say that it may be expected with position/flight path parameters;
• introducing an event-state relaxes the applicable consistency check on that event-state. We believe that in formal methods in general, the more thorough consistency checking may be applied, the better; we thus dislike weakening consistency checks, on principle.
Event-state introduction represents a feature of the WB-method which we wish to develop further, with a view to closer analysis and eventual elimination of the need for event-states in a WB-graph.
THE WB-GRAPH
Some attempt was made to construct a planar graph. There are two crossings in this WB-graph. We concluded that at least one was necessary (by cases, trying to eliminate it), so felt that two did not greatly lessen legibility. We attempted to make the graph planar by using the algorithm below. The algorithm involves calculating the `relative shape' of certain tree-subgraphs, and `laying them out'. This procedure is not exact, because aesthetic, readability and size criteria come into play, and these criteria may well be in conflict. Such a conflict can only be resolved on a case-by-case basis by prioritising the criteria. We indicate the relative-shape and layout techniques we have found useful, with the understanding that they can be, but must not be, followed.
The question might arise: why not use one of the planar-graph algorithms already in the literature? We have found that WB-graphs have roughly the form of a tree-with-links. In the present case, the number of link nodes is roughly a quarter of the number of nodes, and the number of links roughly half the number of link nodes. In other words, the links are relatively independent, and the crucial events and states have relatively independent causes. We are handling a real example, our technique is relatively simple to grasp, and sufficed. Furthermore, we could use it `by hand' while preserving many of the characteristics of the layout; and even with corrections we only had two crossings, which remain completely readable. So we didn't see the need to use a more mathematically precise algorithm.
The graph construction proceeds as follows.
• The textual form is constructed as a tree, as above;
• Links to other nodes occur as leaves, as above;
• The graph containing just the links (the dashed edges in the WB-graph below) is drawn, the
nodes labelled with their path-numbers;
• nodes labelled with the greatest common subsequences of the node labels are added (recursively); • cycles were noted, and for each cycle a list of links from nodes exterior to the cycle to nodes
interior to the cycle (and vice versa) was made;
• an attempt was made to planarise the graph by `inverting' cycles: writing the clockwise node-
sequence as an anti-clockwise node sequence instead, whereupon some links from exterior to
interior can become purely exterior links (and vice versa);
• `leaving enough space' to draw in the rest of the graph (which consists of pure trees), and
drawing it in.
The `leaving enough space' technique consists in the following. The rest of the graph consists in a pure tree structure. Nodes in the link graph therefore have trees `hanging off' them. The relative shape of each `hanging' tree is calculated, and a decision is made (arbitrarily, that is, for aesthetic reasons) whether to `hang' the tree off the exterior or interior of any cycle in which the mode partakes. The relative shapes of all tree structures in the interior of each cycle are laid out without overlapping, and the cycle is drawn outside this layout.
The relative shape calculation technique is as follows. Let one (space) unit be the length of a normal link between two nodes. A tree with n nodes of depth less than or equal int(log2(n)) may be drawn in a triangle of height int(log2(n)) and base int(log2(n)) + 1, where int(n) == the greatest integer less than or equal to n. This is roughly the shape of the full binary tree with n nodes: it is our experience that any roughly `bushy' tree with n nodes can be fitted in to such a shape without significantly affecting readability (not all nodes with a common parent will appear on the same level, so some links will need to be stretched). A `linear' tree with n nodes is roughly the shape of a chain of length n, that is, a rectangle with width one node and length n nodes (the link to the parent in the link graph is included in the shape). Such a rectangle can of course be `kinked'.
The `layout' algorithm consists in taking the `chains' and `bushes' and arranging them without overlapping as desired, kinking the chains as need be.
There are 59 nodes in the WB-graph for the Cali accident, with 14 `links' between 22 nodes (as may be easily seen from the textual form). The `link graph' includes a few more nodes, but remains roughly half the size of the complete graph. The `relative space' algorithm therefore is easy to use, since the trees to be `hung' are relatively small. The final graph we drew has two crossings, one of them due to the two late corrections mentioned above. We did not think there was much point in trying to redraw the graph to see if we could eliminate the `new' crossing. Figure 1 presents the results of using this algorithm for the Cali accident sequence, with corrections.
Figure 1: WB-Graph for the Cali Accident
SOURCE NODES IN THE WB-GRAPH
In principle, source nodes (nodes with only outgoing edges, no incoming edges) represent reasons for the accident which have no further reasons lying behind them. They should thus represent the original reasons for the accident. Since the semantics of the WB-graph is that the node at the tail of an edge represents a necessary causal factor for the node at the head of the edge, source nodes represent necessary causal factors for the accident (`necessary causal factor is a transitive binary relation, as noted in (10)) which themselves have no necessary causal factors mentioned in the report. Logically, therefore, the report regards these as contingencies, that is, events or states which need not have occurred, but whose conjunction was sufficient to ensure that the accident happened. These `original causes' are as follows. (Notice that being an `original cause' does not imply temporal priority - some original causes occur late in the accident event sequence.)
<1.1.2.2.1> PF doesn't hold optimal steady pitch attitude in GPWS manoeuvre
(Action failure)
<1.1.2.1.1.1> CRW unaware of extended speedbrakes in GPWS manoeuvre
[1.2.2.2.1] CRW extends speedbrakes for descent
[1.2.2.1.1] CRW decision to accept Rwy 19 Approach
<1.2.1.3.2> CRW didn't notice left turn caused by FMC:
since [1.2.1.3.1.1]; until [1.2.1.3]
<1.2.1.3.1.2.1> ARINC 424 Specification
<1.2.1.3.1.2.2> Jeppesen FMC-database design
<1.2.1.3.1.3.2.1> FMC display figures small
<1.2.1.3.1.3.2.2> CRW not trained to check Lat/Long on FMC
<1.2.1.3.1.3.1.1.1> Colombian government decision on beacon ID/FREQ
<1.2.1.3.1.1.1.1> ID `R' and FREQ for ROZO on the approach
plate correspond with an FMC database entry
<1.2.1.3.1.1.1.2> ID/FREQ combination usually suffice to identify
uniquely an NDB within range.
<1.2.1.2.5.1> FMC design
[1.2.1.1] CRW turned to "inbound heading" at [1.2.1.3] (Decision Failure:
Reasoning Failure)
<1.2.1.2.1> CRW unfamiliar with ROZO One Arrival and Rwy 19 Approach
<1.2.1.2.3> CRW used procedural shortcuts
<1.2.1.2.4.1.1> cultural dependencies in ATC/CRW discourse
<1.2.1.2.4.1.2.1> Colombian ATC Use-of-English training/certification
<1.2.1.2.4.1.3.1> no ATC radar coverage in Cali area
<1.2.1.2.2.1.1> lack of time for executing ROZO One arrival procedure
<1.2.1.2.2.2.1> arrival takes place at night
<1.2.1.2.2.2.2> few lighted areas on ground to provide visual reference
CRITIQUE
We find that the source nodes correspond pretty closely to what intuitively one could take as `original causes' of the accident. This is not to say that the actions mentioned in this list were all unwarranted. For example, extending the speedbrakes and leaving them out was necessary to get down fast. But this in combination with the course change led to a fatal excursion out of protected airspace. One may observe that three of these source nodes, namely, the Colombian government decision on beacon ID/FREQ, the cultural dependencies in ATC/CRW discourse, and the Colombian ATC Use-of-English training/certification, were emphasised in the NTSB recommendations but not in the final report.
This list could be used as follows. Procedures could be developed that avoid this fatal combination of circumstances. Exactly which procedures is a matter for expert judgement. For example, maybe airlines should not fly into Cali at night (expert judgement would not in fact be likely to draw this conclusion from this accident alone, given the other combination of factors). Some of these circumstances are already legislated against (being unfamiliar with the approach; accepting an approach which one has a lack of time adequately to execute, being unaware of extended speedbrakes). Avoidance is a matter for enhanced training, as is the non-optimal pitch profile flown in the GPWS manoeuvre. Enhanced ATC English and discourse training is also indicated. Pilot procedural modification is indicated: pilots should check Lat/Long. Technical modification is indicated: Enhanced GPWS, for example; maybe more perspicuous indication of Lat/Long on the FMC; maybe more perspicuous database notational standards; maybe modification of the ROZO ID/FREQ; maybe radar coverage in the Cali area. On the other hand, the NTSB pointed out that Cali is the only location they found worldwide in which the ID/FREQ combination does not suffice uniquely to identify an NDB within a radio reception area (7), which would point towards modification of the ROZO ID/FREQ as being an appropriate response. Many of these issues are explicitly addressed in the report recommendations and the NTSB recommendations. This gives us confidence that our formal approach is consistent with the judgement of experts, while enhancing the ability to check for completeness and consistency of an accident explanation.
We are somewhat concerned that an intuitively important component of the crew's cognitive state, namely
<1.2.1.3.1.1.1> CRW believes 'R' denotes 'ROZO' in FMC
does not occur in the `original cause' list. Our concern arises because, while one may inquire what encouraged them to hold this false belief (namely that ID/FREQ usually suffices for unique identification, and the ID/FREQ combination they chose corresponds with what is on the paper Approach plate), which does more-or-less completely explain the false belief, it does not vindicate the pilot behavior. They could have cross-checked better (the Lat/Long; being extra-aware whether the aircraft was turning away from course; gaining altitude while completing the cross-check). This case points out an important caveat.
It is important for our formal approach to realise that even if an event or state it has reasons in the WB-graph, not all the reasons may be included. The WB-graph method does not (yet) incorporate a method for identifying the important causal events and state predicates. It takes those which have been identified by the experts. Although reasons for the crew's mistaken belief about 'R' denoting ROZO are given, some are missed out (as noted above). Procedurally, this mistaken belief could have been avoided by appropriate checking, and the checking that might be deemed appropriate might extend beyond the Lat/Long checking. In fact, the report recommends that procedure dictates that a go-around should have been performed in this situation. We have not included a comparison with procedure in this particular application of the WB-graph method. Such a comparison is necessary, and a technique for performing it is used in the application of the WB-graph method to the `Oops' incident in (2).
One may further observe that paying attention to the source nodes alone might cause one to miss the wood for the trees. For example, responding directly to the night flight or lack of lighted objects on the ground, one might prohibit night flights into Cali, or light up the surrounding countryside. But these measures would not help during extensive cloudy weather. The crucial component here is reduced visibility, interior node <1.2.1.2.2.2> in the WB-graph. A more careful method for evaluating causal components of the accident, then, would also look upstream from the source nodes to identify more general themes, such as lack of visual reference, that could be addressed by legislation or training or other means.
We may conclude that identifying source nodes is an important component of accident analysis using the WB-graph, but that techniques for comparison with procedure; and source nodes, while being the true `original causes', may themselves describe circumstances too specific to dictate appropriate avoidance responses: one must check also further up the WB-graph for the most appropriate description of circumstances for which to formulate an avoidance response.
DISCRIMINATING THE `SIGNIFICANT' EVENTS
The WB-graph method as presented here does not incorporate any mechanism to indicate the relative weight attached to events and states. However, in order reasonably to assess the accident, such weights must be given, as demonstrated clearly to us by Barry Strauch:
[....] the [WB-graph] methodology does not appear to give enough weight to how the
crew's action in taking the controller's offer to land on 19 constrained their subsequent
actions. [....] not all decisions are equal at the time they are made, [....] each decision
alters the subsequent environment, but that while most alterations are relatively benign,
some are not. In this accident, this particular decision altered the environment to what
became the accident scenario. (11)
Intuitively, this decision led directly to the crew's high workload, and also, because of their unfamiliarity with the arrival and approach, to their loss of situational awareness, communication confusion, and lack of attention to indicators of their situation: in short, to most subsequent causal factors.
Strauch notes that, for example, because of this decision, the workload was such that the crew failed to look at the EHSI (electronic horizontal situation indicator), which clearly and continually indicated the event-state <[1.2.1.3.1]>, the continual left turn towards ROMEO, on a moving-map style display . The EHSI is one of two largish electronic displays in front of both pilots (the other displays physical flight parameters). Strauch notes that, because of this display,
[....] the interpretation of the effects of [the execution of [1.2.1.3.1.1] ] should have
required almost no cognitive effort. This is, in fact, one of the substantial advances of
"glass cockpit" aircraft over older ones. As a result, regardless of the considerable effort
required to verify R through the lat/long coordinates in the CDU, the EHSI presentation
of the projected flight path displayed the turn. (11)
Thus the high workload was such that even very easy cognitive tasks were significantly impaired, which is not indicated in state <1.2.1.3.2>, the relevant Awareness Failure. The point of accident analysis is to determine what changes could be made in the future to avoid similar events and situations. The force of Strauch's point is that the crew were not just highly-loaded, but in `cognitive overload'. Since they were in cognitive overload, modifying training requirements to emphasise, for example, paying more careful attention to the EHSI, would not help avoid a repeat; probably this could not cognitively have been accomplished - who knows? The appropriate action is to emphasise decision-making methods that avoid the crew putting themselves in a situation of cognitive overload, and that get them out of such situations quickly if they feel themselves entering one. A basis for determining this difference in prophylactic action must be given by any complete accident analysis method. The WB-graph method thus requires a means of identifying such significant events as the acceptance of the ROZO One-Rwy 19 arrival and approach, as Strauch suggests. How may we do this?
The decision to accept the ROZO One-Rwy 19 arrival and approach altered the goal of all subsequent actions, and thus in many cases those actions themselves. In the formal ontology used in the WB-graph method, a behavior is a sequence of exactly interleaved states and events: state-event-state-event-.... and so forth. Some of these behaviors fulfil the requirements (to land safely and normally on Rwy 19) and some of them (the accident sequence, for example) do not. At the point at which the ROZO One-Rwy 19 was accepted, the past behavior of the aircraft formed a definite, finite state-even-state-event-... sequence, which would be completed by one of a large number (possibly infinite, depending on how deep into the analysis one goes) of possible future behaviors. At this point, then, the state-event-state-event sequence looks like a sequence in the past and a tree in the future. (This is the semantics of, for example, the temporal logic CTL used in verification of concurrent algorithms, and a similar structure to the many-worlds interpretation of quantum mechanics.)
At the point of the ROZO One-Rwy 19 acceptance, the future behaviors satisfying the goal of the flight all consist in a safe landing on Rwy 1, and the future such behaviors after the acceptance all consist in a safe landing on Rwy 19. Not only are these sets of future behaviors disjoint (there is no behavior which belongs to both future trees), but they are radically disjoint - most of the events and states occurring along a future branch of the Rwy 1-tree would not occur along a future branch of the Rwy 19-tree. This radical disjointness property is precisely that which formally corresponds to `altering the environment' of the flight, in Strauch's words. The formal problem is then to find some logically and computationally sufficient means of assessing actions for the determination of `radical disjointness' of their future trees. We do not solve this problem in this paper.
A COMPARISON WITH THE CALI CONCLUSIONS
We tabulate and compare the conclusions of the Cali report with the WB-graph analysis. The conclusions of the Cali report may be found in Appendix 2
Findings 2, 13-14, 16-18 are outside the scope of the WB-graph. They concern the general procedural environment in which aviation is conducted, whereas the WB-graph concerns itself only with the immediate actions and states in the time interval during which the accident sequence occurs. Finding 1, on the other hand, consists of the pro forma statement that the pilots were trained and properly qualified, conjoined with a statement that they suffered no behavioral or physiological impairment. The latter conjunct is in the domain of reference the WB-graph - it states that a condition pertained which, had it not pertained, could have helped explain the accident (and thus altered the form of the WB-graph). As far as the WB-graph is concerned, it is of the same form as the assertion that all aircraft systems, indeed the aircraft itself, worked as designed and intended. This assertion is a state predicate which remains true for the entire accident sequence, and clearly has causal consequences: had the systems malfunctioned somehow, the WB-graph would have looked different. However, when using the Lewis semantics for counterfactuals to evaluate the edges in the WB-graph, the `nearest possible worlds in which ... is not true' are always those in which the systems functioned normally and the pilots suffered no impairment. We choose not to complicate the representation of the WB-graph by including these `environmental assertions', but that should not be taken to imply that we do not consider them causally relevant.
Finding 6 is a consequence of the crew's regulatory-procedural environment. It is a general requirement on flight crew in the US, Western Europe and other ICAO countries that the report judges was not adhered to in the Cali incident. However, it is not directly causal, like the violation of other procedural requirements, and does not appear explicitly in the graph, while nevertheless being related to <1.2.1.2.3>, that the crew used procedural shortcuts. It is, of course, important for explaining an accident that certain normative requirements were not adhered to, because the purpose of explaining an accident is to determine what may be changed in the future to prevent a repetition of similar incidents. If regulations were broken, that indicates that appropriate regulatory safeguards were already in place. We have suggested a method for identifying and including conflicts with normative requirements in (2), but don't apply it here, partly because the method is not yet fully developed and we feel that the Cali case is a more complex application which we prefer to address later.
The correspondence of the report's other findings, 3-5, 7-12 and 15, with items in the WB-graph is as follows:
Report Finding WB-graph entry
3 [1.2.2.1.1]
4 <1.2.1.2.2.1>
5 <1.2.1.2.2.1.1>
7 <1.2.1.2.1.2>, <1.2.1.3.1.1.1>, <1.2.1.3.1.2.2>, <1.2.1.3.1.3.1>
8 <1.2.1.3.1.3>, <1.2.1.3.1.1.1>
9 [1.2.1.3.1.1], <1.2.1.3.1>, <1.2.1.3.1.4>, <1.2.1.2.3>
10 <1.2.1.3.1>, [1.2.1.1]
11 <1.2.2.2>
12 <1.1.2.1.1.1>
15 [1.2.1.2.4]
We remark that Finding 15 and [1.2.1.2.4] correspond - because they are in contradiction! We noted above, however, that the causes of [1.2.1.2.4] are explicitly addressed by the NTSB Recommendations (7). We conclude that the report and the NTSB Recommendations do not concur on this event or its causal factors, and we chose to follow the NTSB view, as explained earlier and as argued by the second author (prior to the NTSB Recommendations) in (6).
With nearly sixty nodes, only sixteen of which correspond to the report's findings (of which there are only 10 pertinent to causally-relevant states or events in the domain of the graph), we conclude that the WB-graph yields a more thorough classification of the causally-relevant findings of the Cali accident investigation commission than the Findings Section (3.1) of the report.
CONCLUSIONS
We have analysed the causal explanatory relations between the events and states listing in the Cali accident report (1). We identified 59 causally-relevant and -necessary factors, and constructed the WB (`explanatory') relation between them. We represented the result in textual, then graphical form. We found it easier to construct the WB-graph in this fashion.
We found that the list of `source nodes', a conjunction of necessary and sufficient causes for the accident that were themselves regarded as contingent, is a fairly accurate indication of the causes, but should be used as guidance, and not uncritically, in formulating statements of cause and contributory factor. The WB-method does not yet include any method for weighing the relative importance of causal factors, so may not be used alone for distinguishing probable cause from contributory factor, or for assessing the comparative global significance of actions such as the decision to accept the ROZO One-Rwy 19 arrival and approach.
The WB-graph method is based on application of a rigorous logical criterion of explanation applied to the events and states identified by domain experts as crucial to the accident. The result is a data structure, the WB-graph, expressed in two forms, textual and graphical, each with their own analytic advantages. Both structures are manageable, as we have demonstrated on a real example. However, automated help, such as that provided by implementation in DATR, is highly recommended, both to avoid loacl errors and save the resources required to determine and correct their global consequences. Furthermore, the graph represents 59 states and events noted by the Cali accident investigation commission and the NTSB as being causally-relevant. In contrast, the reports Findings section lists only 16 of these (roughly a quarter), corresponding to 10 explicit findings.
We believe our results demonstrate the usefulness of the WB-method in event analysis.
Acknowledgements
We are very grateful to Barry Strauch, Chief of Human Factors at the US National Transportation Safety Board, for his detailed and insightful commentary on the first version of this paper, which is particularly visible in the section Discriminating `Significant' Events. The paper has been much improved thereby.
We also thank the referees of the Human Error and Systems Development Workshop in Glasgow, 19-22 March 1997, where this paper was given, for their helpful comments.
References
(1): Aeronautica Civil of The Republic of Colombia Aircraft Accident Report: Controlled Flight Into Terrain, American Airlines Flight 965, Boeing 757-223, N651AA, Near Cali, Colombia, December 20, 1995. Santafe de Bogota, D.C.-Colombia. Also available at http://www.rvs.uni-bielefeld.de.
(2): E. A. Palmer and P. B. Ladkin, Analysing an `Oops' Incident, in preparation, to be available at http://www.rvs.uni-bielefeld.de
(3): P. B. Ladkin, The X-31 and A320 Warsaw Crashes: Whodunnit?, Technical Report 96-08, RVS Group, Faculty of Technology, University of Bielefeld, available at http://www.rvs.uni-bielefeld.de, January 1996.
(4): P. B. Ladkin, Reasons and Causes, Technical Report 96-09, RVS Group, Faculty of Technology, University of Bielefeld, available at http://www.rvs.uni-bielefeld.de, January 1996.
(5): P. B. Ladkin, Explaining Failure With Tense Logic, Technical Report 96-13, RVS Group, Faculty of Technology, University of Bielefeld, available at http://www.rvs.uni-bielefeld.de, September 1996.
(6): D, Gibbon and P. B. Ladkin, Comments on Confusing Conversation at Cali, Technical Report 96-10, RVS Group, Faculty of Technology, University of Bielefeld, available at http://www.rvs.uni-bielefeld.de, February 1996.
(7): US National Transportation Safety Board, Safety Recommendation (including A-96-90 through A-96-106), October 16, 1996. Also available at http://www.rvs.uni-bielefeld.de.
(8): T. Gerdsmeier, A Tool for Building and Analysing WB-Graphs, Technical Report RVS-RR-97-02, RVS Group, Faculty of Technology, University of Bielefeld, available at http://www.rvs.uni-bielefeld.de, February 1997.
(9): Ernest Sosa and Michael Tooley, eds., Causation, Oxford Readings in Philosophy Series, Oxford University Press, 1993.
(10): David Lewis, Causation, Journal of Philosophy 70, 1973, 556-567. Also in (10), 193-204.
(11): B. Strauch, private communication, January 1997.
Appendix 1: Analysis of Pilot Behavior
To elucidate the pilots' actions, we use an extended information-processing model, in which for a given system state, a pilot's interaction with the system is considered to form a sequence:
perception-attention-reasoning-decision-intention-action
This sequence reads as follows.
• perception: An annunciation of the system state is presented to the pilot;
• attention: the pilot notices the annunciation;
• reasoning: figures out what are the possible actions to take;
• decision: decides on an action;
• intention: forms the intention to carry it through;
• action: and finally carries it out.
At least such a fine-grained decomposition of pilot behavior is needed for incident narratives. Failures can occur and have occurred at any stage in this sequence. Examples are:
• During the A330 flight test accident in Toulouse in 1994, an annunciation of the autopilot mode change was not displayed to the pilots, because the angle of attack of the aircraft was higher than 25ˇ. This was cited as a contributing factor in the DGA report.
• In the incident we analyse in this report, the pilot flying (PF) failed to notice that the altitude capture mode was no longer armed.
• In the B757 accident off Puerto Plata, Dominican Republic in 1996, the captain chose to switch on the center autopilot, after concluding that his air data was faulty. The center autopilot obtains its air data from the captain's air data system.
• During the B757 accident off Lima, Peru in 1996, the pilots had lost all effective air data, presumably related to fact that the left-side static ports were covered with masking tape which had not been removed as the aircraft was returned to service after cleaning. During the incident, the pilot asked for altitude data from Lima Tower, who reported indicating 9,000ft. The PF's AI was apparently reading similarly. He took a calculated risk to begin a descent, and impacted the ocean since his true altitude was a few feet above sea level. His AI read 9,500ft on impact.
• In GLOC (G-induced loss of consciousness) incidents, pilots who regain consciousness are reportedly unable to form the intention to recover an aircraft obviously heading for ground impact.
• The test pilot of the A330 let the departure from normal flight develop, presumably to obtain test data, and initiated recovery too late to avoid ground impact; in the B757 Puerto Plata accident, the crew were unable to take effective action during stick-shaker warnings, allowed the aircraft to stall, and could not recover the stall.
Appendix 2:
Section 3. Conclusions (from (1))
3. 1 Findings
1. The pilots were trained and properly certified to conduct the flight. Neither was experiencing behavioral or physiological impairment at the time of the accident.
2. American Airlines provided training in flying in South America that provided flightcrews with adequate information regarding the hazards unique to operating there.
3. The AA965 flightcrew accepted the offer by the Cali approach controller to land on runway 19 at SKCL.
4. The flightcrew expressed concern about possible delays and accepted an offer to expedite their approach into Cali.
5. The flightcrew had insufficient time to prepare for the approach to runway 19 before beginning the approach.
6. The flightcrew failed to discontinue the approach despite their confusion regarding elements of the approach and numerous cues indicating the inadvisability of continuing the approach.
7. Numerous important differences existed between the display of identical navigation data on approach charts and on FMS-generated displays, despite the fact that the same supplier provided AA with the navigational data.
8. The AA965 flightcrew was not informed or aware of the fact that the "R" identifier that appeared on the approach (Rozo) did not correspond to the "R" identifier (Romeo) that they entered and executed as an FMS command.
9. One of the AA965 pilots selected a direct course to the Romeo NDB believing that it was the Rozo NDB, and upon executing the selection in the FMS permitted a turn of the airplane towards Romeo, without having verified that it was the correct selection and without having first obtained approval of the other pilot, contrary to AA's procedures.
10. The incorrect FMS entry led to the airplane departing the inbound course to Cali and turning it towards the City of Bogota. The subsequent turn to intercept the extended centerline of runway 19 led to the turn towards high terrain.
11. The descent was continuous from FL 230 until the crash.
12. Neither pilot recognized that the speedbrakes were extended during the GPWS escape maneuver, due to the lack of clues available to alert them about the extended condition.
13 Considering the remote, mountainous terrain, the search and rescue response was timely and effective.
14. Although five passengers initially survived, this is considered a non survivable accident due to the destruction of the cabin.
15. The Cali approach controller followed applieable ICAO and Colombian air traffic control rules and did not contribute to the cause of the accident.
16. The FAA did not conduct the oversight of AA flightcrews operating into South America according to the provisions of ICAO document 8335, parts 9.4 and 9.6.33.
17. AA training policies do not include provision for keeping pilots' flight training records, which indicate any details of pilot performance.
18. AA includes the GPWS escape maneuver under section 13 of the Flight Instrument Chapter of the Boeing 757 Flight Operations Manual and Boeing Commercial Airplane Group has placed the description of this maneuver in the Non Normal Procedures section of their Flight Operations Manual.
3.2 Probable Cause
Aeronautica Civil determines that the probable causes of this accident were:
1. The flightcrew's failure to adequately plan and execute the approach to runway 19 at SKCL and their inadequate use of automation.
2. Failure of the flightcrew to discontinue the approach into Cali, despite numerous cues alerting them of the inadvisability of continuing the approach.
3. The lack of situational awareness of the flightcrew regarding vertical navigation, proximity to terrain, and the relative location of critical radio aids.
4. Failure of the flightcrew to revert to basic radio navigation at the time when the FMS-assisted navigation became confusing and demanded an excessive workload in a critieal phase of the flight.
3.3 Contributing Factors
Contributing to the cause of the accident were:
1. The flightcrew's ongoing efforts to expedite their approach and landing in order to avoid potential delays.
2. The flightcrew's execution of the GPWS escape maneuver while the speedbrakes remained deployed.
3. FMS logic that dropped all intermediate fixes from the display(s) in the event of execution of a direct routing.
4. FMS-generated navigational information that used a different naming convention from that published in navigational charts.
Accident Fault Trees
Lorna Love and Chris Johnson
Glasgow Accident Analysis Group,
Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK.
http://www.dcs.gla.ac.uk/~{love,johnson}
Computers are increasingly being embedded within safety systems. As a result, a number of accidents have been caused by complex interactions between perator 'error' and system 'failure'. Accident reports help to ensure that these 'failures' do not threaten other applications. Unfortunately, a number of usability problems limit the effectiveness of these documents. Each section is, typically, drafted by a different expert; forensic scientists follow metallurgists, human factors experts follow meteorologists. In consequence, it can be difficult for readers to form a coherent account of an accident. This paper argues that fault trees can be used to present a clear and concise overview of major failures. Unfortunately, fault trees have a number of limitations. For instance, they do not represent time. This is significant because temporal properties have a profound impact upon the course of human-computer interaction. Similarly, they do not represent the criticality or severity of a failure. We have, therefore, extended the fault tree notation to represent traces of interaction during major failures. The resulting Accident Fault Tree (AFT) diagrams can be used in conjunction with an official accident report to better visualise the course of an accident. The Clapham Junction railway disaster is used to illustrate our argument.
1. INTRODUCTION
Accident reports are intended to ensure that human 'error' and systems 'failures' do not threaten the safety of other applications. Unfortunately, these documents suffer from a range of usability problems (Johnson, McCarthy and Wright, 1995). Each section of the report is, typically, compiled by experts from different domains; systems engineering reports follow metallurgcal analyses, software enginnering reports follow the findings of structural engineers; human factors enquiries follow meterological reports. This structure can prevent readers from gaining a coherent overview of the way that hardware and software 'failures' exacerbate operator 'errors' during major accidents (Norman, 1990). The following pages argue that graphical fault trees can be used to avoid these limitations. Readers can use these diagrams to gain an overview of an accident without becoming 'bogged down' in the mass of contextual detail that must be presented in the official report. These structures increase the accessibility and salience that Green (1991) and Gilmore (1991) identify as being important cognitive dimensions for notations which are intended to represent interactive systems.
2. THE CASE STUDY
The Clapham Junction railway accident report (Department of Transport, 1989) is used to illustrate our argument. On the morning of Monday the 12th of September, 1988, a wiring error led to a series of faults in the signalling system just south of Clapham Junction railway station in London. A crowded commuter train collided head-on into the rear of a stationary train. The impact of this collision forced the first train to veer to it's right and strike a third oncoming train. This resulted in five hundred people being injured, thirty-five of those fatally and sixty-nine seriously. This accident provides a suitable case study because it typifies the ways in which human interaction with the underlying safety applications can cause or exacerbate system 'failures' (Reason ,1990). In this accident, human 'error' and organisational 'failure' led to a wring error in the signalling system. This error, in turn, provided drivers with false indications about the state of the railway network.
3. ALTERNATIVE APPROACHES
A number of alternative techniques might be recruited to describe the interaction between human 'error' and system 'failures' in accident reports.
3.1 Petri Nets
Figure 1 shows how a Petri net can represent the events leading up to the Clapham railway accident. The filled in circle represent tokens. These 'mark' places, the unfilled circles, which represent assertions about the state of the system. In this diagram, a place is marked to indicate that Mr Hemmingway introduced a hardware `fault' by leaving two wires connected at full on fuse R12-107. If all of the places leading to a transition, denoted by the rectangles, are marked then that transition can fire. In this example, the transition labelled 'The five drivers preceding the collision train do not realise that the irregularity of the signals they have passed was due to a signalling failure' can fire. All of the output places from this transition will then be marked. This would then mark the place denoting the fact that the five drivers preceeding the collision train did not report a signalling failure..
Figure 1: Petri net representing the events leading up to the Clapham accident
There are a number of limitations that complicate the application of Petri nets to analyse accidents that involve interactive systems. In particular they do not capture temporal information. Various modifications have been applied to the classic model. Levi and Agrawala (1990), use 'time augmented' Petri nets to introduce the concept of 'proving safety in the presence of time'. Unfortunately, even if someone can understand the complex firings of a 'time augmented' Petri net, they may not be able to comprehend the underlying mathematical formulae that must be used if diagrams, such as Figure 1, are to be used to analyse human 'error' and system 'failure' (Palanque and Bastide, 1995).
3.2 Cause-Consequence Diagrams
Cause - Consequence Analysis was developed by Neilson in the 1970s. The causes of a critical event are determined using a top-down search strategy. The consequences that could result from the critical event are then worked out using a forward search technique. Gates describe the relations between causal events. Figure 2 shows a Cause - Consequence diagram for the hardware problem that led to the system failure in the Clapham accident. Mr Hemmingway's concentration was interrupted as he worked on fuse R12-107. It can be argued that such diagrams illustrate the consequences of such problems in a more tractable format than the many pages of natural language description that are presented in most accident reports.
Figure 2: Cause consequence diagram of one aspect of the Clapham accident.
In Cause consequence analysis, separate diagrams are required for each critical event. Unfortunately, in an accident, there may be dozens of contributory factors and so several diagrams will be required. For instance in the Clapham accident, other diagrams would be required to represent the causes and consequences of bad working practices, limits on safety budgets and the events on the day of the accident. Such characteristics frustrate the application of these diagrams to represent and reason about the complex interaction between human and system failure during major accidents.
3.3 Fault Trees
Fault-trees provide a relatively simple graphical notation based around circuit diagrams. For example, Figure 3 presents the syntax recommended by the U.S. Nuclear Regularity Commission's, 'The Fault Tree Handbook' (Vesely, Goldberg, Roberts and Haasl, 1981).
Figure 3: Fault tree components
Fault trees are, typically, used pre hoc to analyse potential errors in a design. They have not been widely used to support post hoc accident analysis. They do, however, offer considerable benefits for this purpose. The leaves of the tree can be used to represent the initial causes of the accident (Leplat, 1987). The symbols in Figure 3 can be used to represent the ways in which those causes combine. For example, the combination of operator mistakes and hardware/software failures might be represented using an AND gate. Conversely, a lack of evidence about user behaviour or system performance might be represented using the OR/XOR gates. Basic events can be used to represent the phenotypical failures that lead to an accident (Hollnagel, 1993). Intermediate events can represent the operator 'mistakes' that frequently exacerbate system failures. An undeveloped event is a fault event that is not developed further, either because it is of insufficient consequence or because information is unavailable. This provides a means of increasing the salience of information in the notation (Gilmore, 1991). Less salient events need not be developed to greater levels of detail.
There are a range of important differences that distinguish the use of accident fault trees from their more conventional application. Fault trees are constructed from events and gates. However, many accidents are caused because an event did not take place (Reason, 1990). These errors of omission, rather than errors of commission typify a large number of operator 'failures'. Figure 5 illustrates the way in which fault-trees can be used to represent these errors of omission; Mr Hemmingway failed to perform a wire count, Mr Hemmingway's boss failed to perform an independent wire count.
Figure 4: An example of a fault tree representing part of the Clapham accident
Further differences between conventional fault trees and accident fault trees arise from the semantics of the gates that are used to construct the diagrams. Conventionally, the output from an AND gate is true if and only if all of its inputs are true. Accidents cannot be analysed in this way. For example, Figure 4 shows that the hardware error was the result of six events. In a 'traditional' fault tree the error would have been prevented if interface designers or systems engineers had stopped any one of these events from happening. In accident analysis, however, there is no means of knowing if an accident would actually have been avoided in this way. Most accident reports do not distinguish between necessary and sufficient conditions. An accident may still have occurred even if only one or two of the initiating events occurred. In this context, therefore, an AND gate represents the fact that an accident report cites a number of initiating events as contributing to the output event. No inferences can be made about the outcome of an AND gate if any of the initiating events do not hold.
The output of an OR gate is true if and only if at least one of it’s inputs is true. An OR gate can be used in an accident fault tree to represent a lack of evidence. Evidence can be removed accidentally or deliberately from an accident scene. Alternatively, evidence may be missing because the person holding the information died in the accident. For example, in the Clapham accident, we do not know if Driver Rolls actually noticed the irregularity of the signals he passed. The output of an XOR (exclusive OR) gate is true if and only if exactly one of the inputs are true. They are useful in accident fault trees when we know an intermediate event was caused by either of two events, but not both. Figure 5 shows how an OR gate can be used to represent two reasons why Driver Rolls reduced his speed; either he was concerned about the behaviour of the signalling system or he saw the train ahead of him brake. It also illustrates the use of an XOR gate. There was no testing plan for the signalling system in this area because either a key official ignored his responsibilities or he was not aware that he was responsible for this task.
Figure 5: Illustration of the use of OR and XOR gates in the context of an accident
4. ACCIDENT FAULT TREES (AFT diagrams)
The previous section identified some differences between the conventional application of fault trees to the design of safety-critical systems and their use in accident analysis for interactive systems. These differences could be supported by relatively simple changes to the interpretation of the notation. This section builds on the previous work by proposing a number of syntactic extensions.
4.1 Introducing Page References
Previous sections have argued that fault-trees provide a complementary notation which can be used in conjunction with conventional accident reports. The results of an initial usability test with accident analysts indicated that the standard notation did not support cross-referencing between the tree and the original document. Figure 6, therefore, shows how the events in a fault tree can be annotated to include paragraph number. This number refers to the paragraph number of the accident report that the information in the node is taken from. At first sight, this may appear to be a trivial change., However, it is important to emphasise that the fault tree represents an abstraction of the events that are recorded in an official report. As such, they emphasise some aspects of an accident, while choosing to abstract away from others. It is, therefore, vital that other members of investigation teams can challenge the sequences of events as they are recorded in any fault-tree. By requiring supporting references, analysts are forced to justify their interpretation of critical events from in the interaction between a system and its operator (Johnson, 1996).
Figure 6: Grounding AFT Diagrams In A Report
4.2 Representing Post-Accident Sequences
Fault trees typically stop at the 'undesired event'. In accident reports, events after the accident are important too. These, typically, include the operator actions that are taken to mitigate the effects of system failure. For example, the Ambulance service played an important part in the Clapham accident. Although their actions did not cause the accident, they contributed to the saving of lives. They reduced the consequences of the accident. It is, therefore, important to extended fault trees to include post-accident events. Figure 7 illustrates this approach. The rooted AFT explicitly frames the accident. Branches spread out both above and below the accident. The accident is at the centre of the tree. The roots below the tree represent the factors influencing the accident. The leaves above the centre specify the actions taken and the subsequent events following the accident.
Figure 7: Extract from the Clapham fault tree showing after-accident events
4.3 Introducing Time
Temporal properties can have a profound impact upon the course of human-computer interaction. Delays in system responses can lead to frustration and error. Conversely, rapid feedback from monitoring applications can stretch an operator's ability to filter information during critical tasks (Johnson, 1996). Figure 8 illustrates the PRIORITY-AND gate that has been proposed by the U.S. Nuclear Regulatory Commission's to capture temporal properties of interaction (Vesely, Goldberg, Roberts, Haasl, 1981). Sequential constraints are shown inside an ellipse drawn to the right of the gate. The gate event is not true unless the ordering is followed.
Figure 8: The PRIORITY-AND gate.
Unfortunately, there are a number of limitations with this approach. In particular, real-time is not supported. This is significant because precise timings can have a critical impact upon an operator's ability to respond to a critical incident. We have, therefore, extended the fault tree notation to include real-time. It is important to note, however, that is not always possible or desirable to associate an exact time with all of the events leading to an accident. For instance, Figure 9 only provides approximate timings. Given limited evidence in the aftermath of an accident it is unlikely that operators will be able to recall the exact second in which they did or did not respond to a system failure.
Figure 9: Extract from Clapham fault tree illustrating relative time orderings
A limitation with the approach shown in Figure 9 is that it does not account for the inconsistencies that may arise in any accident reporting process. Experience in applying AFT diagrams has shown that witnesses may frequently report different timings for key operator 'errors' or system 'failures'. In order to address such uncertainty, Figure 10 illustrates an annotation technique that we have used to explain potential contradictions in a timing analysis. This technique has proved particularly useful as it provides a focus for the detailed investigation of the timing evidence that is presented in a conventional accident report.
Figure 10: Extract from Clapham fault tree illustrating conflicting timings
4.4 Introducing Criticality
Many existing fault-trees fail to represent the criticality of an event. This is surprising because different faults will carry different consequences for the continued operation of an interactive system. For example, keystroke errors may only have a marginal impact whilst more deep-seated mode confusion can have catastrophic consequences. Figure 11 illustrates a graphical extension to the fault-tree notation that can be used to represent criticality. A negligible failure leads to a loss of function that has no effect on system. A marginal failure degrades the system but will not cause the system to be unavailable. A critical failure completely degrades system performance. A catastrophic fault produces severe consequences that can involve injuries or fatalities. It should be noted that we are currently evaluating a range of alternative presentation formats for these symbols.
Figure 11: Weighted fault tree nodes
Figure 12 illustrates the application of this extension. The failure of the signalling system was a catastrophic event. The failure of the five preceding drivers to report the irregularity of the signals was a marginal 'error'. Such reports could not have prevented the accident if it had happened to the first train. It is important to emphasise that the categorisation is a subjective assessment. What is important is not whether the reader agrees with our particular assessment, but that the diagram makes the categorisation explicit. Too often these assessments are left as implicit judgements within the natural language of an accident report. As a result, accidents have occurred because companies and regulatory organisations have disagreed about the criticality of the events described in conventional documents (Johnson, 1996).
Figure 12: Extract from Clapham AFT illustrating weighting properties
6. FURTHER WORK AND CONCLUSIONS
An increasing reliance upon computer-controlled safety systems has led to a number of accidents which were caused by a complex interaction between perator 'error' and system 'failure'. Accident reports help to ensure that these 'failures' do not threaten other applications. This paper has argued that fault trees can be used to support natural language accident reports. They provide an overview of the human factors 'errors' and system 'failures' that contribute to major accidents. Unfortunately, existing approaches do not capture the temporal information that can have a profound impact upon system operators. They do not capture the importance that particular failures have for the course of an accident. They only represent contributory causes and not post-accident events. We have, therefore, introduced an extended fault tree notation that avoids all of these limitations.
Much work remains to be done. Brevity has prevented us from providing empirical evidence that AFTs improve the usability of existing accident reports. We have, however, conducted a range of evaluations (Love, 1997). Initial results from these trials indicate that our extended notation can improve both the speed of access to specific material about an accident and can improve the overall comprehension of accident investigations. It is important to emphasise that the evaluation of AFTs is a non-trivial task. Accident analysts have little time to spare for experimental investigations. There are further methodological problems. For instance, it is difficult to recreate the many diverse contexts of use that characterise the application of accident reports. Finally, there are many reasons why evaluations should focus upon the long term effects of improved documentation rather than the short-term changes that are assessed using conventional evaluation procedures from the field of HCI. It may be many weeks after reading a report that engineers need to cross-reference a fact in it (Johnson, 1996). Further work intends to build upon research into the psychology of programming to determine whether it is possible to test for these long term effects through the improvement of documentation. For instance, Green has argued that structure maps can be used to analyse the cognitive dimensions of complementary notations (Green, 1991). This approach has not previously been applied to the graphical and textual notations that have been developed to represent human 'error' and system 'failure' during major accidents.
Brevity has also prevented a detailed discussion of tool support for AFT diagrams. We are developing a number of browsers that use the graphical representations to index into the pages of conventional accident reports. Many questions remain to be answered. In particular, it is unclear whether such tools can support multiple views of an accident without hiding the overall flow of events leading to major failures. Human factors analysts typically focus upon different areas of a tree than systems engineers. It is difficult to support such alternative perspectives and at the same time clearly show the interaction between systems 'failure' and operator 'error'. One possible solution would be to exploit the pseudo-3D modelling techniques provided by VRML.
ACKNOWLEDGEMENTS
Thanks go to members of the Glasgow Accident Analysis Group and the Glasgow Interactive Systems Group. This work is supported by UK Engineering and Physical Sciences Research Council Grant No. GR/K55042.
REFERENCES
Department of Transport. Investigation into the Clapham Junction Railway Accident. Her Majesty's Stationery Office. London, United Kingdom, 1989.
D.J. Gilmore, Visibility: A Dimensional Analysis, In D. Diaper and N. Hammond, People and Computers VI, Cambridge University Press, Cambridge, 317-329, 1991.
T.R.G. Green, Describing Information Artefacts with Cognitive Dimensions and Structure Maps. In D. Diaper and N. Hammond, People and Computers VI, Cambridge University Press, Cambridge, 297-315, 1991.
E. Hollnagel, The Phenotype Of Erroneous Actions, International Journal Of Man-Machine Studies, 39:1-32, 1993.
C.W. Johnson, Documenting The Design Of Safety-Critical User Interfaces, Interacting With Computers, (8)3:221-239, 1996.
C.W. Johnson, J. C. McCarthy and P.C. Wright. Using a Formal Language to Support Natural Language in Accident Reports. In Ergonomics(38):6, 1265 - 1283, 1995.
L. Love, Assessing the Usability of Accident Fault Trees in press.
J. Leplat. Accidents and Incidents Production: Methods of Analysis. In J. Rasmussen, K. Duncan and J. Leplat (eds.), New Technology and Human Error. John Wiley and Sons Ltd, 1987.
S. Levi and A. Agrawala. Real Time System Design . McGraw-Hill International Editions, 1990.
D.A. Norman, The 'Problem' With Automation : Inappropriate Feedback And Interaction Not 'Over-automation'. In D.E. Broadbent, J. Reason and A. Baddeley, Human Factors In Hazardous Situations, 137-145, Clarendon Press, Oxford, United Kingdom, 1990.
P. Palanque and R. Bastide. Formal Specification and Verification of CSCW Using The Interactive Co-operative Object Formalism. In M. A. R. Kirby, A. J.. Dix and J. E.. Finlay (eds.), People and Computers X, 213-232, Cambridge University Press, Cambridge, 1995.
J. Reason, Human Error, Cambridge University Press, Cambridge, United Kingdom, 1990.
W. E. Vesely, F. F. Goldberg, N. H. Roberts, D. F. Haasl. Fault Tree Handbook. U.S. Nuclear Regulatory Commission, 1981.
A Study of Incidents Involving Electrical/Electronic/ PProgrammable Electronic Safety-Related Systems
P.R. Croll*, C. Chambers*, M. Bowell**
*Correct Systems Research Group, Department of Computer Science,
The University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP.
e-mail: {p.croll, c.chambers}@dcs.shef.ac.uk
** Control and Instrumentation Section, Engineering Control Science Group,
Health and Safety Laboratory, Broad Lane, Sheffield, S3 7HQ.
e-mail: mark.bowell@hsl.gov.uk
This paper presents a study of 21 incidents in small manufacturing enterprises involving electrical/ electronic/programmable electronic (E/E/PE) safety-related systems, involving Programmable Electronic Systems (PES) commonly used in manufacturing industry for the implementation of Safety Related Systems (SRS). These incidents were originally investigated by the Health and Safety Laboratory (HSL). The aim of this study is to highlight the causes of these incidents and find common solutions to those causes. A fault schema suitable for the classification of incidents of this nature is proposed. The fault schema is based on the classification of incidents in terms of what is deemed to be the major contributory causes of those incidents. Each fault is also denoted as being either, a primary cause, a contributory cause, or a secondary cause, to aid the analysis of each incident. The results of the incident study are presented with For each incident, identified faults are labelled according to the fault schema and are also denoted as primary, secondary or incidental. Examples of particular causes are given. to indicate how and why a particular incident was classified in that way. The prominent faults are further discussed with the goal of highlighting the categories of faults most prominent in E/E/PES safety-related systems, and with the goal of pinpointing what areas future work on incident prevention should focus on. Finally, mitigation techniques which could form part of an E/E/PE safety-related PES system development methodology suitable for small manufacturing enterprises the manufacturing industrial sector are suggested.
INTRODUCTION
A study has been conducted of 21 incidents that involved electrical/electronic/programmable electronic (E/E/PE) systems, typically incorporating programmable logic controllers (PLCs), performing safety-related functions. The incidents were originally investigated by the Health and Safety Laboratory (HSL), the research agency of the UK government's Health and Safety Executive. The aim of this study was to highlight the causes of these incidents and find common solutions to those causes. HSL staff have indicated that the range and type of the incidents studied herein this report typify reported incidents in the manufacturing industry involving E/E/PE safety-related systems. Problems found in this study are similar to those found in other industrial studies reported in the literature [HSE 95, NEU 95].
The incidents reported in this study are necessarily anonymous and were originally recorded in for legal reasons. This work is based on incident reports produced by investigating officers working for HSL. These reports include all the information considered pertinent by the investigating officers to the incident under investigation,, although information regarding some aspects of the investigation is not always available after the incident. Sometimes i.e. eyewitness accounts of the incident, data regarding commissioning tests, or even indeed sometimes information such as blocks of PLC program code, become unavailable or have either been modified or ‘inadvertently’ deleted before they can be investigated. In several incidents looked at here, information regarding the incident is incomplete, and therefore, much of the work of the investigating officer has been to consider all probable scenarios that could have lead to the incident be identified. The reports also includes specific recommendations for future safe operation, with respect to each particular incident investigated.
CLASSIFICATION METHOD AND DEFINITIONS
Classification of incidents involving control systems are often made in terms of what is deemed to be the major contributory causes of those incidents; this is useful for highlighting areas of concern in the development of safety-related systems. Two classification methods were considered for this study:
a) allocating errors and failures to the phase of the safety life cycle (for example, of as in the draft IEC 1508 standard) where the root cause(s) of the incident originated, as in [HSE 95];
b) grouping directly by the cause directly, without referring explicitly to a life cycle phase, as in
Peter Neumanns book [NEU 95].
The first method has the disadvantage of being less intuitive than the second method, with confusion arising when faults originate in one life cycle phase but can be detected in another. Consequently, a modified version of the second method was used, where the incidents investigated dictated the fault classification method (see table 1). This is not a generic fault classification schema, because only a relatively small number of incidents were analysed. It could form the basis of a generic classification but this would need to be validated through more, possibly larger, studies.
Category |
Definition |
Example mitigation methods |
||
Requirements |
a b |
Inadequate system definition Inadequate safety requirements specification |
Structured requirements capture, use of hazard and risk analysis techniques e.g. CHAZOP. |
|
Hardware |
a b c |
Design Random failure Interface or communications malfunction |
Considered and well executed installation and commissioning plan, maintenance and inspection quality procedures, follow safety life cycle, use of fault tolerant designs. |
|
Software |
a b c |
Design Coding error CPU bug |
Rigorous use of structured software design methods, application of well thought out test cases. |
|
System use |
a
b c |
Unsafe system use and operation Human errorInadvertent mistake Deliberate bypass of correct operating procedures |
Safety culture, tamper proof backup systems, fail-safe system design. Strict enforcement of safe operational procedures via software where possible. |
|
Maintenance |
a b c |
Corrective maintenance Adaptive maintenance Perfective maintenance |
Total quality management, including adaptation of relevant safety standards, guidelines and maintenance procedures. |
|
Environment |
a |
Extreme operating conditions |
Use of hazard identification techniques early in the design. |
Table 1
: Classification of incidents and suggested mitigation methods.Below is further an explanation of the fault classes, as derived from the incident study.
Requirements:
a) Inadequate system definition: Faults due to an apparent lack of understanding of the system, how the system interacts with its environment and/or with its operators.
b) Inadequate safety requirements specification: Faults due to an inadequate regard for either functional safety requirements and their required , safety integrity requirements, or in some incidents both. Particularly a lack of safety features both, such as hardwired trips and, strategically placed emergency stop buttons etc.
Hardware:
a) Design faults: Faults were due to hardware design/implementation and problems especially concerning the layout or positioning of, for example, machine guards, emergency stop buttons, or sensors carrying out safety related functions.
b) Random faults: Faults caused by deterioration of hardware components that have failed at random times, for example, due to wear and tear or variations in manufacturing quality.
c) Interface or communications faults: resulting from problems connected with the communications interface, between the computer system, and hardware components such as actuators and sensors (these failures could be due to hardware or software).
Software:
a) Design faults: due to software design and/or method of implementation, logical or arithmetic errors, race conditions or imprecision.
b) coding errors; due to syntactic or typographical errors.
c) CPU bug; a logical fault in the design or manufacture of the CPU.
System Use:
a) Unsafe use and operation: the (possibly unintentional) failure of operators to follow safe operating procedures and practices (given that such procedures are in place), or the operation of equipment in a manner not intended by the equipment supplier.
b) Human errorInadvertent mistake: The inadvertent or accidental, incorrect operation of equipment.
Deliberate by pass of correct operating procedures: hazardous conditions arising c) from operators or maintenance personnel deliberately by-passing safe operating procedures in the course of their work.
Maintenance:
a) Corrective faults: introduced while attempting to remove existing errors from the system.
b) Adaptive faults: introduced while modifying the system to satisfy new requirements.
c) Perfective faults: introduced while attempting Finding better ways of implementing existing functions.
Environmental faults:
a) Extreme operating conditions: , e.g. fault(s) caused by the system EUC being situated in close proximity to, for example, a source of extreme temperature, such as a furnace, without sufficient thermal insulation.
RESULTS
The causes of the 21 incidents in this study are shown in table 2. Where there is more than one apparent cause of an incident, the actual (primary) cause is denoted by "
n" in table 2. Other secondary causes that contributed significantly to the incident, "contributory causes", are denoted by "•"; although they were not the final cause of the incident, if any one of them had been mitigated the incident probably would not have happened. Secondary faults were found in some systems that did not appear to be responsible for the incident (denoted by "O").It can be argued that all of the individual causes of incidents involving computer controlled systems could have been mitigated given the incorporation of sufficient safety features in the system design, e.g. software and hardware redundancy (fault tolerance), fail safe design methods and independent hard wire trips. Hence all incidents can be said to have been caused by safety requirements omissions. However in the case of the incidents studied here, only when "obvious" omissions are made, are they recorded as such.
Figure 1 shows incident causes by category, as a percentage of the total number of primary and secondary causes. The fact that the percentages derive from divisor is the number of causes, rather than the number of incidents, might seem to result in incidents with a large number of causes having a disproportionate affect on the overall percentages. In actual fact, if causes are given weightings to compensate for this, the difference between the percentage results is no more than 2% for any one cause.
ANALYSIS AND EXPLANATION OF RESULTS
Requirements stage
40% of faults contributing to the incidents (i.e. primary and secondary faults) in this study were in this category. Many of the incidents were caused by a lack of understanding of the systems involved, and in several cases a total lack of safety considerations. In 13 out of the 21 incidents studied, deficiencies in the safety requirements capture stage contributed directly to the incidents. For example, in more than one incident single channel safety systems were in place, and several of these safety systems were easily bypassed by operators leaving them with no protection. One of the more common deficiencies encountered was an inadequate level of sensing, where, for example in one incident, access to a guillotine blade from either side of the machine was not detected by sensors. In another incident, it was possible for the operator to work under a light curtain from the front of the machine. In these cases, a hazard analysis should have highlighted this type of potential hazard, indeed machines of this nature have been in regular use for some time. Independent safety-related protection systems are beneficial since they rely less on missing requirements knowledge.
Hardware faults
The faults noted here (26% of the total contributory faults) consisted mainly of hardware design and random faults. In the case of hardware design faults, typical causes of incidents were badly positioned guards, or sensors partially obscured by other physical components of the system. In one particular incident, power to a computer controlled lift was removed when the lift was on the fifth floor to allow corrective maintenance. Unfortunately, the on-board rechargeable battery failed causing the loss of positional data stored in battery backed ram, so that upon power up of the lift,
Incident number |
Req. stage |
Hardware faults |
Software faults |
System use |
Maintenance |
Env. |
|||||||||
a |
b |
a |
b |
c |
a |
b |
c |
a |
b |
c |
a |
b |
c |
a |
|
1 |
• |
O |
O |
|
• |
n |
O |
||||||||
2 |
• |
n |
• |
||||||||||||
3 |
n |
|
O |
||||||||||||
4 |
? |
||||||||||||||
5 |
• |
• |
n |
• |
• |
O |
O |
||||||||
6 |
n |
• |
• |
O |
|||||||||||
7 |
• |
n |
O |
O |
|||||||||||
8 |
n |
• |
O |
||||||||||||
9 |
• |
• |
• |
n |
|||||||||||
10 |
n |
• |
|||||||||||||
11 |
n |
• |
O |
||||||||||||
12 |
n |
• |
|||||||||||||
13 |
• |
• |
• |
n |
O |
||||||||||
14 |
• |
n |
• |
||||||||||||
15 |
• |
• |
O |
n |
|||||||||||
16 |
• |
• |
n |
• |
|||||||||||
17 |
• |
n |
O |
||||||||||||
18 |
• |
n |
O |
O |
|||||||||||
19 |
• |
n |
O |
||||||||||||
20 |
n |
• |
O |
||||||||||||
21 |
• |
• |
n |
||||||||||||
Key n Primary (immediate) cause • Secondary cause O Incidental fault ? Possible unconfirmed primary cause |
Table 2: Apparent causes of incidents investigated by HSL
and in the absence of actual positional data, the lift was assumed to be on the ground floor. Hence, when a command was received to travel up several floors, the lift overran the top floor and only stopped because the counter-weight hit its safety end stop causing a secondary safety system to be activated. Basic hazard analysis techniques should have highlighted these problems. Hardware faults can be mitigated by the use of fault tolerant design methods, and software can be used to detect and warn the operator about certain common types of hardware faults or put the system into a safe state.
Software faults
Only 11% of the contributory faults were caused by software faults. In some cases errors in program layout caused control variables to be updated at an incorrect time in the control cycle causing an incorrect PLC output. In one particular incident a "bug" in the CPU was suspected of causing a guard to fail, but the investigator could not be certain, indeed this conclusion was made by the elimination of all other reasonable possibilities. Clearly in such cases there are many possible explanations for this type of fault, including the possibility of it being a transient fault, and in the absence of sufficient evidence one has to make an informed guess as to the exact cause. In another incident, there were mistakes in timing calculations, and in the algorithms used to stop a machine safely contributed to an incident. One particular incident involved code written for a PLC, where some blocks of the code were written in relay ladder logic while other blocks of code were written in instruction lists. In both cases the code was poorly commented and consequently the functionality of the program was difficult to understand. In this study there were no incidents caused in full or part by apparent compiler errors.
Software faults can be reduced by using higher level programming languages (for example those specified in IEC 1131-3 [IEC 93]), rather than the very common relay ladder logic and/or instruction list, both of which are known to have deficiencies [LEW 96].
In this study there were no incidents caused in full or part by apparent compiler errors. The low percentage of software faults indicates that the design and coding of software is a comparatively minor problem for computer controlled plant. However, the amount of software code used to control individual items of manufacturing equipment and associated safety systems are relatively small in the cases investigated, and so the software is considerably easier to verify compared with, say, that needed for the Darlington nuclear generating station in Canada (2,000 lines of code for the emergency shutdown software [LEV 95, STO 96, PAR 91]).
Maintenance
Only 6% of contributory faults incidents were caused by problems associated with maintenance. Typically, in one incident a new safety guard was added after the equipment had been in use for some time and the correct operation of this new equipment did not match the machine it was fitted on. Another case involved various safety features being disabled to allow easy access to equipment to carry out corrective maintenance. In another incident, the brakes on a machine were miss-adjusted during corrective maintenance, this turned out to be the main cause of the incident.
Note that this category does not include faults which arose out of a lack of maintenance; these are classified as hardware design and random faults. For As an example, of multiple causes of incidents, in the lift hardware fault described above, if example from section 3.2, had the rechargeable battery been checked as part of the maintenance schedule before power-up, this incident would have been avoided, but this is not classified as a maintenance fault.
Environmental
None of the incidents looked at here were directly caused by positioning any part of the system the equipment under control too close to a source of extreme temperature or any other potential environmental hazard.
System Use
System use comprised 17% of all contributory causes. The main problem faced here was the failures by equipment operators to follow correct or safe operating procedures, or the deliberate bypass of safety features, with the general aim of making a particular machine more productive or easier to operate. Also safety features were often disabled to allow maintenance staff access to the machine.
The issue of system (mis)use is recognised in IEC 1508 [IEC 95], which contains the following requirement (where EUC is the equipment under control):
"The hazards and hazardous events of the EUC and its control system shall be identified under all reasonably foreseeable circumstances (including fault conditions and reasonably foreseeable misuse). This shall include all relevant human factor issues, and shall give particular attention to abnormal or infrequent modes of operation of the EUC."
Figure 2 shows the number of occurrences per sub-category of just primary and contributory causes making it easier to identify the most important distinct fault categories.
DISCUSSION
Nearly all incidents have several causes, as found here and also in [BRA 94], so it is more beneficial to examine general problem areas that appear to be the most significant contributors rather than any one cause in isolation.
We must also consider the reasons behind errors as well as the form they take. For example, sometimes the design of a system encouraged maintenance to take place in an unsafe manner, such as maintenance personnel being able to work on a machine with the power still switched on. In one case a machine operator saw a way of increasing output by bypassing the limited safety features present on the machine. Therefore, it is not always realistic to blame the incidents on operator error, because we then do not consider what allowed the operator to take such shortcuts. This should have been covered by appropriate hazard analysis.
Figure 2:
Number of occurrences per sub-category of just primary and contributory causesIndustry is likely to benefit from the adoption of a thorough systematic risk-based approach to systems design and development, engendered by a general company-wide safety culture. A risk-based design methodology would ideally consider human aspects, hardware failures and maintenance problems, with the effect of avoiding many of the incidents investigated in this study. However, there appears to be a general lack of experience in hazard analysis techniques, suggesting inappropriate levels of competence of both project management and engineers responsible for systems design.
Consequently, small manufacturing enterprises need to be educated in the need for risk assessment and hazard analysis. There also needs to be a mechanism for ensuring adherence to relevant standards and guidelines. Appropriate standards and guidelines must exist in the first place. They must be sufficiently approachable in terms of size and applicability, but also sufficiently flexible to allow for continuing technological development. The next draft of IEC 1508, to be released in Spring 1997, goes some way towards providing the necessary guidance, but is hampered by its generality across all industrial sectors. Application sector standards based on IEC 1508 are needed to improve the situation but are unlikely to be available before the turn of the century.
CONCLUSIONS
Although this study considered only a small sample (21 incidents), and as such the results are not as statistically sound as a larger study, the results are still very significant and should provide useful evidence of areas in real-life industrial practice that need addressing. The majority of incidents reported here could have been avoided had an appropriate level of attention to detail and expertise been brought to bear at the requirements and design stages. Safe operating practice is also a significant factor.
As previously discussed, the following tasks could have mitigated many of the problems encountered in this study:
Hazard analysis techniques (such as HAZOP or failure mode and effects analysis), performed at the requirements, design and maintenance stages;
Incorporation of safety features including fail-safe and fault tolerant designs, and external risk reduction facilities such as independent safety interlocks;
System modelling and simulation in the requirements and design stages, to give engineers an insight into the operation of the machine and its immediate environment;
Regular monitoring and review of operational activities, to help prevent faults occurring due to lack of regular maintenance or changes in operating practice;
Safe operating procedures and practices set-up before full-scale operation commences, with operators trained in their use and importance.
REFERENCES
[BRA 94] J Brazendale, J and. R Bell, R.. ‘Safety-related control and protection systems: Standards Update’, Computing and Control Engineering Journal, Vol. 5, Nou 1, pp. 6-12, Feb 1994.
[HSE 95] UK Health and Safety Executive. Out of control, Her Majesty’s Stationary Office, 1995.
[IEC 93] IEC. Standard IEC 1131: Programmable controllers - part 3: programming languages, International Electrotechnical Commission, Geneva, 1993. (In Europe EN 61131-3).
[IEC 95] IEC. Draft standard 1508: Functional Safety of electrical/electronic/ programmable electronic safety-related systems, International Electrotechnical Commission, Geneva.
[NEU 95] Neumann, P.G. Computer related risks, Addison Wesley, 1995.
[LEW 96] R W Lewis, R.W.. ‘How can IEC 1131-3 improve the quality of Industrial Control Software’, IEE, UKACC Int. Conf on CONTROL’96’, pp. 1389-1393, Sept 1996.
[LEV 95] N G Leveson, N.G. Safeware: system safety and computers, Addison Wesley, 1995.
[NEU 95] P G Neumann. Computer related risks, Addison Wesley, 1995.
[PAR 91] D L Parnas, D.L. AG J K Asmis and , G.J.K. J Madey, J. ‘Assessment of Safety-Critical Software in NNuclear Power Plants’ , Nuclear Safety, Vol 32, No 2, pp. 189-198, 1991.
[STO 96] N Storey., N. Safety-critical computer systems, Addison-Wesley, 1996.
Agents and Actions:
Structuring Human Factors Accounts of Major Accidents
Colin Burns, Chris Johnson, and Muffy Thomas
Glasgow Accident Analysis Group,
Dept. of Computing Science, University of Glasgow,
EMail: {burnsc, johnson, muffy}@dcs.gla.ac.uk
WWW: http://www.dcs.gla.ac.uk/~{burnsc, johnson, muffy}
Conventional accident reports suffer from a number of usability problems. In particular, it can be difficult for readers to identify the ways in which operator ‘errors’ exacerbate system ‘failures’. Critical traces of interaction are often lost in a mass of background detail. These problems can be avoided by constructing abstract models of major accidents. Analysts can use mathematical abstractions to focus in upon critical events in the lead up to a failure. Unfortunately, previous specification techniques for human computer interaction cannot easily be used in this way. Notations such as Z, CSP and VDM provide no means of distinguishing those operators that actively influenced the course of an accident from other users who were less involved. Similarly, these notations provide no means of explicitly representing the critical events that alter the course of human-machine interaction. In contrast, this paper describes how a Sorted First Order Action Logic (SFOAL) can be used to represent and reason about the human contribution to major accidents.
1 Introduction
Hollnagel (1993), Rasmussen (1989) and Reason (1990) have all identified human ‘error’ and operator mismanagement as critical factors in the lead up to major accidents. Unfortunately, it can be difficult to assess the impact of such problems from the many hundreds of pages that are presented in natural language accident reports. There are further problems (Johnson, McCarthy and Wright, 1995). The natural language used in accident reports can be ambiguous; this makes it difficult for manufacturers to probe behind terms such as ‘high workload’ to identify potential solutions to past failures. The sheer scale of natural language accident reports may also mask factual errors and inconsistencies; this weakens the credibility of the overall reporting process. Above all the mass of systems engineering, metallurgical, thermodynamical and other detail can obscure the ‘human factors of failure’. In contrast, this paper argues that mathematically based notations can be used to represent and reason about the human contribution to major accidents, as documented in official reports.
1.1 Why Formal Methods?
A number of authors have applied formal methods to support interface development. Harrison and Thimbleby have recruited an algebraic notation to specify high-level requirements for interactive systems (Harrison and Thimbleby, 1989). Dix has developed a range of algebraic tools for analysing high level properties of multi-user systems (Dix, 1991). Other authors have extended the application of these techniques to model the human ‘errors’ and system ‘failures’ that are described in accident reports. Johnson, McCarthy and Wright have used Petri nets in this way (Johnson, McCarthy and Wright, 1995). Thomas has used first order logic (Thomas, 1994). Telford has applied Lamport's Temporal Logic of Actions (Johnson and Telford, 1996). None of these approaches provides explicit means of representing and reasoning about the agents, the human users and operators, that contribute to major failures.
Figure 1: Petri Net Representation of the Kegworth Accident (Johnson, McCarthy and Wright, 1995).
For example, analysts must trace through the many different places and transitions in figure 1 to identify all of the agents who were involved in the Kegworth accident. This problem has also been recognised by others (e.g. the work on interactors at York (Harrison and Duke, 1995)). In this paper, we employ a notation which also avoids this limitation. Sorted First Order Action Logic (SFOAL) is a modal logic introduced by Khosla (1988). SFOAL forces analysts and investigators to explicitly consider the agents that contribute to an accident. It also provides syntactic structures that can be used to represent the actions that they perform.
1.2 The London King's Cross Case Study
The official report into the King's Cross Underground Fire is used to illustrate the argument that is presented in this paper (Fennell, 1988). This case study is appropriate because the Kings Cross accident typifies the complex interaction between human and system ‘failures’ that lead to most major accidents (Hollnagel, 1993). It, therefore, provides a realistic testing ground for our techniques.
On the 18th November, 1987, a fire broke out on escalator 4 of the Piccadilly line in King's Cross Underground Station, London. It left 31 people dead and many seriously injured. Despite smoking having been banned within the Underground for a number of years, poor policing of this policy had led to passengers continuing to smoke freely within the premises. On this occasion, a lit match was dropped on the escalator 4 where, due to the crabbing movement of the escalator, it slipped between the skirting board and the step treads and landed on the running tracks of the escalator. Years of incomplete maintenance had resulted in an accumulated layer of grease and detritus which the match ignited. Once lit, the fire spread upwards to the escalator wall. A series of communications failures involving both automated and manual systems led to considerable delays and a lack of coordination in the response to this initial incident. As a result, the fire was not effectively tackled and smoke and flames engulfed the tube lines ticket hall.
2 Identifying Critical Properties
The official accident report into the Kings Cross fire contains a mass of contextual information that ‘sets the scene’ for the accident itself. For example, Chapter 7 is devoted to the history of escalators on the Underground. The construction of a higher level model of an accident forces analysts to strip out such detail and focus upon the critical properties that directly led to an accident. Systems engineering observations might be formalised in this way. Analysts might choose to represent the crabbing motion of the escalator where the fire started, as described on page 15 of the report:
"Gaps were observed between the treads on the Piccadilly Line escalator 4 at King's Cross. They were caused by the crabbing movement of the escalator."
This can be formalised as follows:
crabbing_motion (esc_4) (1)
Human factors observations can be represented in a similar manner. At one point in the accident P.C. Bebbington attempted to use communications equipment to call his headquarters:
call (pc_bebbington, headquarters) (2)
A failure to use computer-controlled and manual safety equipment can also be represented using logical negation. This enables analysts to represent both errors of commission and omission (Reason, 1990). For example, the water fog system (a sprinkler system fitted beside the running tracks of the escalator) was not activated and this was cited as a critical factor in the development of the fire:
 activate (P, water_fog_system) (3)
This formalisation is intended to produce a clearer representation of accident. It is important to note, however, that the task of identifying what is relevant belongs to accident investigators and not to the developers of formal modelling techniques. This is similar to the way in which accident investigators employ mathematical models to simulate physical processes, such as combustion.
A specific contribution of our work is to extend the applications of modelling techniques to operator interaction with both manual and computer-controlled systems.
3 Distinguishing Between Objects
The use of first order logic in the previous sections suffers from a number of limitations. In particular, it does not distinguish between the different types of objects and agents that may be involved in an accident. For example, the Kings Cross report refers to people, such as P.C. Bebbington and Leading Railman Wood. It refers to escalators, such as escalator number 4 where the fire began. It refers to safety systems and automated applications, such as the water-fog system and Closed Circuit Television (CCTV) facilities. The properties of these objects differ. For example, it would make little sense to apply the call predicate (2) to an escalator. Using SFOAL's type system, all objects are classified as belonging to a particular type. For example, esc_4 represents the escalator where the fire started, water_fog_system is a safety mechanism:
esc_4: Escalator (4)
water_fog_system: Safety_mechanism (5)
Types can then be associated with functions and predicates such as those introduced in the previous section. For example, if an analysts wished to refer to the running track of an escalator this could be specified using the following function:
running_track: Escalator
This can only be applied to escalator objects of type Escalator. It cannot be applied to objects of any other type. Similarly, human factors engineers might want to define a function between safety systems and the individuals who were responsible for operating them. This operating_officer function can only return a person, it cannot return any other type of object.
operating_officer: Safety_mechanism
At first sight, this technique may seem to offer minimal benefits. However, an important stage in any investigation is the identification of those mechanisms that should have preserved the safety of an application. Similarly, it is vital to identify those individuals who were responsible for both maintaining and operating safety systems. It is a non-trivial task to identify such critical responsibilities from the many hundreds of pages that are contained in most accident reports.
4 Identifying Key Personnel
The previous clause, (7), introduced a type called Person.. The intention was to distinguish operators from the other objects that must be included in any model of an accident. Identifying the individuals, or agents, who contributed to, or responded to, a failure is a critical stage in any accident enquiry. However, this can be surprisingly difficult with natural language accounts. SFOAL, therefore, provides an agent type, Agt, to represent ‘non passive objects’ that interact with and affect their environment. The use of the term ‘agents’ instead of ‘users’ reflects the origins of SFOAL in the specification and verification of interactive software. In accident analysis, computer systems and other automated mechanisms might also be viewed as agents, if they are capable of affecting the environment. For example, an automated water fog system capable of detecting fire and activating itself could be considered as an agent, as opposed to the manual system which cannot affect the environment without user intervention (5).
The following clause identifies some of the agents from the King's Cross accident. The headquarters agent demonstrates that abstract representations can also be made of collections of individuals within the system (i.e. the individuals at the Transport Police Headquarters).
Constants
mr_squire,
relief_station_insp_hayes,
booking_clerk_newman,
met_line_station_insp_dhanpersaud,
picc_line_controller_hanson,
area_manager_archer,
picc_line_acting_traffic_manager_weston,
pc_bebbington,
pc_balfe, headquarters
:Agt (8)
Again, the discipline of explicitly identifying those users that are involved in an accident provides a focus for any analysis. A particularly important point here is that the different chapters of an accident report are often drafted by different teams of investigators. In consequence, it can be difficult to trace the ways in which any one individual's actions have effects that propagate through the various sections of the report. Agent definitions, such as those given above, encourage some agreement over which personnel might have interacted with the safety mechanisms (and any other systems) during major accidents.
5 Identifying Critical Actions
Through object and agent types, SFOAL provides means of explicitly identifying the components, systems and users that are involved in accidents. We have not, however, considered the exact nature of operator intervention during an accident. For example, we have not provided means of representing the actions that activate the safety mechanisms. Fortunately, SFOAL provides the predefined type Act to describe these actions:
Action Names:
report_fire: Agt
call: Agt® Act
move_to: Location® Act
activate:Safety_mechanism® Act
stop_ticket: Agt® Act
close_gate: Gate® Act (9)
The explicit representation of critical actions helps to clarify the course of human-system interaction that led to the failure. These critical events are often buried within the mass of natural language that is used by most accident reports.
6 Linking Agents to their Actions
The descriptions of agents and actions provides only limited benefits unless analysts can explicitly link users to the actions that they perform. This can be achieved using a modal connective. Formally, this constrains the possible worlds, or scenarios of interaction, that might hold after an agent has performed a particular action. For example, the following clause specifies that if some agent fails to activate the water fog system then the behaviour of the system changes so that fire_safety_inactive is true. Such statements are important not only because they model key factors which define the accident behaviour, but also because, from a human factors view, they describe an operator's failure to respond (for whatever reason) to the demands of the scenario.
Â[A, activate (water_fog_systems)]fire_safety_inactive (10)
It can also be extremely important to link specific agents to particular actions. For instance, Mr Squire was the first person to report the fire to a member of staff:
[mr_squire, report_fire (booking_clerk_newman, fire)]correct_fire_notification (11)
An important point here is that the use of the modal connective, [ ], helps to distinguishes the users' actions from other relations in a specification. Such distinctions are often lost in other approaches to the formal modelling of interactive systems (Johnson, McCarthy and Wright, 1995). A final advantage is that the modal connective can be used to introduce the notion of change over time because it describes properties that hold after an action has been performed. This is a significant benefit because it can be used to capture traces of interaction between systems and their operators without forcing analysts to specify the exact real-time interval of all actions. Given limited evidence in the aftermath of an accident it is, typically, impossible to reconstruct complete timelines for all operator actions. Elsewhere we describe how real-time might be gradually introduced into interval notations such as SFOAL (Johnson, 1997).
7 Forming Sequences of Interaction
It is not possible to place most of the interactions which occur in accidents into a strict sequence. In the King's Cross accident, virtually the only events with precise timings are communications with the emergency services and the occurrence of the flashover: the digital clock in the ticket hall was stopped by the heat at 19:45; all communications with the emergency services were automatically logged. However, the manual logging of telephone calls between London Underground staff proved to be untrustworthy as it was discovered that a number of the clocks used were inaccurate.
Sequencing of the actions documented in the report is problematic due to the varying accuracy of the timing information: for some actions there is little or no information about the time interval in which they were performed; for other actions, this information is presented. The size of these intervals is not uniform: some are precise (i.e. the ‘major incident’ emergency message was recorded at 19:45:58); most are fairly vague, to the nearest minute or worse. The modal connective can be used to represent the effects of actions performed at both known and unknown time intervals.
7.1 Known Time Intervals
When timings of actions in the King's Cross report are given, they are generally given to the nearest minute. Several actions may occur in one minute and there is no means to determine the relative sequential ordering of these actions. If the performed actions do not interfere with the effects of other actions in the time interval, they can be modelled as occurring in parallel. For example, the following actions occurred at 19:41:
"19:41
Booking Clerk Newman told by P.C. Balfe to stop selling tickets. One of the sets of Bostwick gates at the stairs to the perimeter subway from the tube lines ticket hall was closed by an unidentified police officer.
Piccadilly Line Controller Hanson alerted Area Manager Archer at
Finsbury Park." (page 51)
These can be formalised, using the parallel composition operator ‘||’to represent that there is no further information about the sequence of the events at 19:41:
[pc_balfe, stop_ticket (booking_clerk_newman) ||
PC,close_gate (perim_subway_gate) ||
picc_line_controller_hanson, call (area_manager_archer)]
booking_clerk_newman (stopped) ^ perim_subway_gate (closed) ^
area_manager_archer (alerted)
(13)
If, on the other hand, performing one of the actions in the time interval does interfere with the effects of other actions, then these actions may be represented as a sequence. For example:
"19:40
Mr Hanson telephoned Piccadilly Line Acting Traffic Manager Weston, who telephoned Metropolitan Line Station Inspector Dhanpersaud. "(page 51)
This can be represented using the SFOAL sequential composition operator ‘;’:
[picc_line_controller_hanson, call (picc_line_acting_manager_weston);
picc_line_acting_manager_weston, call (met_line_station_insp_dhanpersaud)]
picc_line_acting_manager_weston (alerted) ^
met_line_station_insp_dhanpersaud (alerted)
(14)
Clauses such as (13) and (14) are extremely important as they clarify, as far as possible, the relative ordering of human-human, human-computer and, more broadly, human-system interaction during the course of an accident.
7.2 Unknown Time Intervals
Given the incomplete information available in reports, it is likely that size of the time intervals between some of the interactions in a sequence are unknown. As such intervals could be longer than other represented time intervals, each action must be described as being performed at discrete time intervals. As an example, because there was no aerial system installed in the underground, P.C. Bebbington had to leave King's Cross Underground station in order to radio his headquarters and alert them of the fire. The described effect of the call action, in this case, relies on P.C. Bebbington's being located outside which in turn relies on his having moved there. This exemplifies the way the effects of an action being performed can be affected by previously performed actions. P.C. Bebbington's actions occur in a distinct sequence and the call was automatically logged by the Police Logistical Operational Database (PLOD) at 19:33. However, there is little information about the size of time interval in which these actions were performed. Formally, this can be stated as follows:
[pc_bebbington, move_to (surface)][pc_bebbington, call (headquarters)]
headquarters (alerted) (15)
This clause states that after the agent pc_bebbington performs the action move_to (surface) followed by the action calls (headquarters) the behaviour of the system changes so that alerted (headquarters) is true. Accidents generally result from complex sequences of events (Leveson, 1995). Sequenced action descriptions, such as clause (15), are thus vital to modelling and reasoning about accidents as they describe the effects of an action from a given context. Sequenced action descriptions also provide valuable insights into why certain behaviour occurred in the system, as well as when. In the above example, the formal description implies that we can only deduce that the Police Headquarters has been alerted if P.C. Bebbington was on the surface when he called them.
8 Conclusion
The complexity of accidents is a major factor behind the usability problems identified with accident reports. A number of mathematical abstractions of accidents have been constructed which abstract away from much of the detail and resolve ambiguities in the text of the report. This paper has applied a formal notation, SFOAL, to this problem. This approach encourages the analyst to focus on the fundamental objects of interest in the behaviour of the accident: the users and their interactions with safety systems. The linking of these two within the modal connective permits explicit definition of operator characteristics and of human-machine interaction. The various sequencing operators, both within and outside the modal connective, aid the modelling of behavioural timelines.
Currently, the models we are building are focussed on the human machine interactions that take place during major accidents. One future goal is to build more general descriptions of the system, so that we can reason not just about the actual behaviour observed in the accident, but also about other potentially hazardous traces of interaction. Accident report recommendations invariably focus on ‘social factors’ of the system such as the control hierarchy, legislation, and company policies (Toft and Reynolds, 1994). We propose to examine a Deontic Logic based on SFOAL, which expresses notions of permission and obligation (Khosla, 1988), as a means to represent such structures.
Bibliography
Dix, A.J. Formal Methods for Interactive Systems. Academic Press, 1991.
Fennell, D. Investigation into the King's Cross Underground Fire. Her Majesty's Stationary Office, London, 1988.
Harrison, M.D. and D.J. Duke. A review of formalisms for describing interactive behaviour. In IEEE Workshop on Software Engineering and Human Computer Interaction, volume 896 of Lecture Notes in Computer Science. Springer-Verlag, 1995.
Harrison, M.D. and Thimbleby, H.W., editors. Formal Methods In Human Computer Interaction. Cambridge University Press, 1989.
Hollnagel, E. The phenotype of erroneous actions. International Journal Of Man-Machine Studies, 39:1-32, 1993.
Johnson, C.W. Reasoning abut human error and system failure for accident analysis. In Howard, S., Hammond, J., and Lindgaard, G., editors, Interact’97. Chapman and Hall, 1997.
Johnson, C.W., McCarthy, J.C., and Wright, P.C. Using a formal language to support natural language in accident reports. Ergonomics, 38(6):1265-1283, 1995.
Johnson, C.W. and Telford, A.J. Extending the applications of formal methods to analyse human error and system failure during accident investigations. Software Engineering Journal, 11(6):355-365, 1996.
Khosla, S. System Specification: A Deontic Approach. PhD thesis, Imperial College of Science and Technology, University of London, 1988.
Leveson, N. Safeware: System Safety and Computers. Addison-Wesley, 1995.
Rasmussen, J. Coping safely with complex systems. Technical Report Risż-M-2769, The Risż National Laboratory, 1989.
Reason, J. Human Error. Cambridge University Press, 1990.
Thomas, M. A proof of incorrectness using the LP theorem prover: the editing problem in the Therac-25. High Integrity Systems, 1(1):35-49, 1994.
Toft, B. and Reynolds, S. Learning from Disasters. Butterworth-Heinemann Ltd, 1994.
Incorporating Human Factors into Safety Systems in
Scottish Nuclear Reactors.
Iain Carrick,
Operational Safety Engineer,
Health, Safety and Licensing Division, Scottish Nuclear
1. Introduction, Scope and Historical Review.
Scottish Nuclear is Scotland's Nuclear Power Generating Company. We produce over 50% of Scotland's total electricity requirements. The Company owns and operates the Advanced Gas-Cooled Reactor (AGR) Nuclear Power Stations at Hunterston on the Firth of Clyde and at Torness on the East Coast of Scotland. The first AGR unit at Hunterston was commissioned in 1976, hence that plant has been in commercial operation for over twenty years. Torness was commissioned nine years ago. The Magnox Station, Hunterston A, enjoyed an excellent operating record from 1964 until its closure, on economic grounds in 1990.
The safety of nuclear reactors ultimately depends on the human beings who design, construct, operate and maintain the plant. Scottish Nuclear has always taken human factors into account in assessing the safety of its Stations. This paper reviews the practical measures taken to incorporate the operators' role in maximising the overall reactor safety achieved with respect to both normal and fault conditions. This approach has significantly shaped the development of both hardware and organisational systems.
The British Advanced Gas-Cooled Reactors (AGRs) were constructed in accordance with well developed and conservative Design Safety Guidelines. Prior to construction, detailed Safety Cases were assessed in great depth by both the Safety Division of the operating company and by the Government's independent, Nuclear Installation Inspectorate, the NII. The Designs incorporated the basic nuclear safety philosophy of Defence-in-Depth, illustrated in figure 1 (all figures are included at the end of this paper), and wherever practical, systems were built to be fail-safe. Furthermore, the large thermal inertia of Advanced Gas Cooled Reactors (AGR) tends to produce a slow thermal response, which is tolerant of both plant and human induced faults.
Although all practicable steps are taken to assure normal, safe operation, it is assumed that things can go wrong. A wide range of potential fault sequences are studied and barriers put in place to prevent the sequences developing to the point of serious safety consequence. Although there may exist a very large spectrum of fault event sequences this study is practical because a fundamental set of bounding, or "Design Basis" faults are selected for detailed study. Sensitivity studies are than conducted to demonstrate that there are no "cliff edges" just beyond the "feasible" bounded set.
Thanks to the quality of design, construction and operation, Scottish Nuclear have not experienced any incidents resulting in significant radiological hazard. The highest classification assigned under the International Nuclear Event Scale has been 1, defined as an Anomaly, outwith the authorised operating regime. Indeed, there have been no incidents above level 2 at any of Britain's civil nuclear power stations. Level 2 means an Incident with potential safety consequences on-site, but having no off-site consequences.
Generally, the analysis of large scale, technological accidents reveals a variety of causes and a sequence of events which could have been safely aborted if any one of a number of safety barriers had operated properly. The barriers, or levels of Defence-in Depth, which are in place to prevent and limit accident sequences at Scottish Nuclear Stations include :-
Hardware Barriers.
• Alarms to warn of abnormal conditions and system failures,
• Automatic shutdown on departure from the normal, safe operating window,
as measured by a range of parameters (Several can detect each fault.),
• Automatic implementation of post-shutdown cooling,
• Diverse, segregated and redundant provision of safety critical functions.
Procedural Barriers.
• Safe Operating procedures,
• Regular, frequent checks on safety systems,
• Post-maintenance testing by different personnel,
• Graded application of Quality Assurance,
• Operator Training on both normal and fault conditions
using a full-scale, Control Room Simulator,
• Face-to-face briefings prior to commencement of safety work.
In addition to such barriers, Scottish Nuclear have established a range of working practices which are intended to employ human skills in achieving the highest levels of safety and performance. These formalised practices do indeed constitute safety systems and go beyond hardware or procedures into the realm of Organisational Safety Culture. Their objective is to avoid the circumstances which might give rise to an accident sequence; to be pro-active in promoting safety rather than reactive in responding to faults or threats which have arisen.
This paper focuses mainly on these human factors initiatives whilst discussing some of the evolution of other safety systems which have benefited from the Company's long-standing concern with the operators' contribution to safety. Four examples are briefly reviewed here.
From the outset, our AGRs were provided with on-site, real-time, full-scope simulators of the Central Control Room and its interfaces with the reactors, generating plant, electrical and common services. This included manual controls, automatic control loops, alarms, automatic protection functions and indications. The simulators enable realistic familiarisation and refresher training to be based on a wide range of simulated plant faults and transient scenarios. Hence the reactor operator's response to the first sign of trouble is clearly defined and well rehearsed .
Improvements in the ergonomics of Central Control Room (CCR) design are clearly apparent when the spacious, computer-based Torness CCR is compared with the Hunterston B design, some 10 years previous. Only urgent safety alarms appear on facias in Torness, as virtually all alarms are computer processed. The computer alarms are grouped but can readily be expanded into more detail by the desk operator to aid diagnosis. Shift operations engineers were involved in specifying optimal control desk layouts and display formats.
Alongside the mid-life plant refurbishment and review of the safety case at Hunterston B Power Station, it was decided to review and, where necessary, improve procedures and safety culture with respect to the best international practices. This led to the development of 90 Principles, based on international standards. The topics addressed included :- Management, Operations, Maintenance, Technical Support, Training, Radiation Protection, Chemistry and Emergency Preparedness. Subsequently, Station engineers specified how they complied with these principles, identifying any shortfall which had to be rectified.
Once a reactor has been operating for a few years the effectiveness of the various barriers and safety systems are reviewed on a rolling basis. Fundamental to the review of system effectiveness is information on weaknesses which have been noticed through routine tests and minor events.
2. Safety Improvements, Incorporating Human Factors.
The International Atomic Energy Agency (IAEA) currently recognises that -
"Significant risk reductions and gains in the safety and performance of operating nuclear power plants have been achieved since the issue of INSAG-3 in 1988. This has been possible due to the objective of attaining operational excellence through
- a more comprehensive treatment of Safety Culture and Defence-in-Depth,
- application of lessons learned from nuclear plant operational experience,
analyses, tests and other research,
- introduction of self-assessments at all levels of the organisation,
- more explicit treatment and use of PSA."
The achievement of World Class standards of Excellence in Safety and Operational Performance has been a key objective for Scottish Nuclear since its inception in 1990. Our strategy for achieving this matches well the above observation by the IAEA.
2.1 Range of Human Factors Initiatives at both Stations.
2.1.1 Operational Experience Feedback Engineers were appointed at Torness and Hunterston in 1990. At the same time, managerial level, Site Incident Panels were established at both Stations, to meet monthly to consider remedial measures and the lessons to be learned from undesired events. The more significant matters raised at the Station Panels are reported to the Company Safety Supervisory Board.
A subsequent initiative has been the introduction of "Near Miss" and "Discrepancy in Safety Culture" report forms at Torness and Hunterston respectively. Any member of staff may detail an event which, under different circumstances, could have led to injury or plant damage. The Safety Engineer reviews all the forms and may conduct a follow-up interview. This is intended to catch potential problems at an early stage before they become an actual problem for the Station.
Working level, Loss Control Panels have been created at both Stations to give practical consideration to such minor events and industrial safety near misses. At both Stations "Blame Tolerant" reporting systems have been established which encourage all personnel to provide information on near misses and plant events.
2.2.2 An important step in consolidating the right attitudes across the whole workforce was the adoption of a Company Safety Management System. Managers and engineers were trained in the application of the system and several were assigned personal responsibility for improving the measured performance in specific areas. The initial focus on training supervisors and engineers was intended to motivate and enable them to set a better example and to demand higher standards of safety. Supervisors now hold regular safety discussions with their work group.
2.2.3 Virtually all Station staff have been involved in a safety culture seminar. Various groupings have been tried, including a Management Team, all staff in one department, a shift crew, a mixture of industrial staff, safety representatives and supervisors. In most cases the format has involved "audience participation" after the basic concepts have been discussed. This might be in the form of analysing a couple of relevant events or reviewing work procedures in order to make them more user friendly. Safety awareness was enhanced throughout the work force, using team briefings, posters, displays and videos, These imparted a thorough knowledge of the safety implications of their tasks and established a STAR practice of Stop-Think-Act-Review. Interdepartmental teams were set up to identify means of improving performance and safety.
2.2.4 Directors and senior managers make time to see plant conditions and working practices at first hand. A need to improve conditions and written procedures was identified leading to provision of additional resources. Safety Representatives and engineers / middle management also conduct regular safety inspections. The standard of housekeeping achieved is a small element in "Gainshare", performance bonus payments to staff.
2.2.5 Scottish Nuclear strongly support the UK, Nuclear Utilities, Peer Evaluation Programme. About twenty-five senior engineers from each Station have benefited from evaluator training, followed by in-depth observation and analysis of practices at other UK Stations. Two independent Evaluations have been conducted at both Scottish Nuclear Stations. A few engineers and managers have also taken part in similar evaluations in the USA, France, Japan and China under the auspices of the USA Institute of Nuclear Power Operators (INPO) and the World Association of Nuclear Operators (WANO).
2.2.6 Hunterston management have adopted a common action tracking system for all safety related actions. They now hold a weekly management meeting, attended by the Company's independent, nuclear safety Site Inspector, which has safety issues first on the agenda and where the Plant Manager actively chases actions to completion.
2.2.7 Local, shop-floor, Safety Representatives (fulfilling functions identified in the Health and Safety at Work Act) are now invited to attend the top-level, Safety Supervisory Board (SSB) meetings and to report back to their colleagues on how safety issues are considered. The SSB is chaired by a non-executive Director and both the Chief Executive and the Director of Health, Safety & Licensing attend.
2.2.8 Occupational Safety courses have been run by the Industrial Safety Department for all engineers and supervisors at both Stations. They emphasise the roles of these personnel in ensuring safe working practices and environment, compliant with safety regulations.
2.2.9 Scottish Nuclear supports a range of nuclear safety research through a nuclear industry management committee. This has very wide scope, including Human Factors R&D. Current work includes human reliability in maintenance and how best to discern human factors in events.
2.2 Quality Assurance.
When Scottish Nuclear was created in 1990, the Company set out to establish a fully integrated Quality System. This is designed to ensure that the right things are consistently done correctly. Human factors principles are embedded in this system, which has led to Certification of Scottish Nuclear under ISO9001 and BS 5882. Rigorous definitions of personnel and functional responsibilities, coupled with clear specification of all important interfaces contribute to quality, safety and efficiency.
It is established practice in SN that any safety-significant procedures are independently verified prior to final implementation. This principle of checking by a second person is carried through into the actual conduct of key tasks, which normally include "hold points". Examples include the Panels which are constituted at both Stations to review proposals to change Operating and Maintenance Instructions. These panels include representation from relevant Station Departments and the off-site Health, Safety and Licensing Division.
Proposed plant modifications are classified according to their potential impact on radiological safety. Written justification, to an established format and standard, is required for independent checking and authorisation. For the most safety significant proposals this entails formal nuclear safety assessment by the headquarters Health, Safety and Licensing Division, followed by consideration at the Company's Nuclear Safety Committee (which includes authoritative external members) and finally, assessment by the Nuclear Installations Inspectorate.
Both Stations have recently conducted Management Reviews of their QA systems. These have revealed similarities between the types of non-conformance arising in audits and the root causes of undesirable plant events. For several years, timely clearance of non-conformances has been used as a performance indicator, reported to the SSB.
At Hunterston, in addition to the usual identification of non-conformances against the QA Standard sections, any non-conformance that resembled a precursor to a plant event (about half) was considered to be a potential Causal Factor and assessed against the Operational Feedback root cause criteria. Figure 2 compares the profiles of actual root causes with the potential revealed by QA audits. It may be concluded that the main QA system implementation problems are in the areas of Personnel & Work Practices and of Written Procedures & Documents, and further that these QA problems are actually causing plant events. Various initiatives are underway at both Stations (eg. Stop-Think-Act-Review) to encourage staff to use procedures more effectively.
Both Stations have introduced methods of reporting document deficiencies, accessible to all staff. About 180 requests are raised per annum at each Station. Unfortunately, the resources deployed and priorities assumed at both Stations have not yet achieved the rapid turnaround required to convince all staff of the efficacy of the system.
Partly in response to the perceived difficulties associated with procedures and work practices, surveillance audits have been introduced at Hunterston in addition to traditional audits. These constitute "a reality check" to identify real weaknesses in working practices and to target recommendations for tangible improvements. Five topics per annum are selected for this surveillance on the basis of managerial concern and events. It is intended to implement similar audits at Torness next year.
A surveillance evaluation is conducted by a small team comprising, for example, a technical expert, the Feedback engineer and the Quality engineer. This expert team studies the procedures for the key tasks, observes them being conducted and discusses them with the staff involved before recommending improvements. Alternatively, the working level "Loss Control Group" may be asked to investigate the problem and make recommendations.
2.3 Simulator Training.
Over the years the simulators have been validated against the various plant transients which have been experienced. Although not of the same severity addressed in fault studies, this experience gives confidence that the initial stages of faults are well modelled and that the plant response to controlling actions is valid. Furthermore, the simulator response has been compared with a range of fault transient safety analyses.
Whenever a significant new plant problem is experienced the simulator engineers can model this behaviour and within days Control room engineers may gain simulated experience of the abnormal plant behaviour. The Training section are provided with reports of all significant reactor faults at their Station and relevant generic faults from other Stations through the Feedback Process. Applicable faults will be modelled on the simulators and incorporated in future shift training sessions. Relevant feedback has also been utilised from aircraft operational problems.
The simulators are also used by reactor operating engineers to verify both new and modified, normal and fault, operational instructions. Even beyond design basis faults and guidelines have been invoked on the simulators.
All shift control room engineers receive mandatory initial and refresher training on the simulators. The quality and frequency of their training is closely monitored by the Company. Because of the proximity of on-site simulators, they are very heavily utilised, limited more by the number of trainee man hours available. The amount of simulator training per man has increased substantially in the last few years. Figure 4 shows the growth in refresher training. At Hunterston it has recently been necessary to devote more time to initial training on account of staff turnover.
The control room engineers exercise as two or three man teams, with their shift colleagues. All training is reviewed with the trainees, including the CCR supervisor as an integral part of the team. Their views are fed back to both the Training Engineer and the Shift Principal Engineer / Manager. The Shift Principal has overall responsibility for and takes an active interest in the simulator training of his staff.
The simulators can also be used as direct training aids to demonstrate plant performance characteristics, such as control loops. Real time graphs of related parameters may be produced to reinforce operators' grasp of the underlying plant behaviour.
2.4 Symptom Based Emergency Response Guidelines.
Operating Instructions have been developed to guide the operators response to recognised families of anticipated faults. Normal practice is to provide fully automatic protection systems for faults which require action in less than 30 minutes, although the Operator may be able to contribute to safety within this time scale.
To cater for a wider range of hypothetical, more severe faults, guidelines were developed to help operators respond effectively to symptoms, without knowing in detail the root causes of a fault. These guidelines would also enable the operator to respond effectively to any fault sequences which had not been addressed in the event-based, prescriptive Instructions.
Figure 5 illustrates how these Symptom-Based, Emergency Response Guidelines (SBERGs) are related to the range of fundamental safety functions, potentially affected by significant faults. It can be seen that the description of the basic safety functions is very broad, such as reactor heat removal. Each is related to a simply measured parameter, which will show whether the safety function is being adequately met. Additional monitoring parameters are identified. The system is designed to focus the operator's attention on the primary safety issues he should address, in order of priority.
SBERGs 1 to 8 are considered responses to faults which impair various fundamental safety functions.
The order of the SGERGs generally reflects the timescale for effective operator response. If two or more of the fundamental safety functions are challenged, the highest order SBERG (lowest no.) is entered initially.
Figure 6 summarises the structure of the guidance on reactor heat removal. Simple checks are prompted to ensure that the suggested course of action is having the desired effect. If an action does not produce the required improvement in the time specified further remedial actions are proposed. The purposes of the fault analyses and Instructions/Guidelines are to:
(a) demonstrate the robustness of the system design to faults of any kind,
(b) identify the need for automatic protection and its associated reliability,
(c) specify operator actions to reduce the consequences of any fault that may arise,
(d) improve the reliability with which the operator may be expected to take effective actions.
2.5 Probabilistic Safety Assessment & Human Reliability Assessment.
The safety role of the operator has been well documented within each Plant Safety Case. Following the Chernobyl accident in 1986, this was re-examined in the context of every recognised potential fault. It was confirmed that no urgent safety demands would be made on the operator.
As part of the recent Periodic Safety Review at Hunterston a new probabilistic safety assessment has been developed and this includes human reliability assessments. Indeed, the recent Periodic safety review for Hunterston 'B' provides a powerful focus for developing the scope of both qualitative and quantitative human factors safety assessment for reactor fault conditions. This has included the results of research work which developed a framework for quantifying operator recovery over longer time scales (5-10hrs). This allowed the beneficial effect of the AGR reactor thermal inertia to be claimed in a justifiable manner, and improved recovery procedures to be developed. Tangible design improvements associated with the Periodic Review include tertiary feed systems, diverse tripping and diverse shutdown.
The Torness Periodic Review will similarly develop both the plant systems modelling and the operator interaction with these essential systems under fault conditions. The recovery actions over longer time scales modelled for both stations will be based on the existing analysis supporting the SBERGs.
3.Selected Systems explained in more depth.
3.1 Operational Experience Feedback.
For any feedback system to be effective it must be accurate and meaningful. Only by investigating the fundamental causes of undesirable events or near misses can adequate preventive measures be taken and valuable lessons learned. The results of some of our trend and pattern analyses, at both Hunterston and Torness, indicate that the causes of "run of the mill" plant events (which of themselves have no safety impact but which might be precursors of safety significant events) are the same types that contribute to more significant events.
It can also be seen in figure 3 that human factors form the largest group of causal factors. Indeed, similar causal factors are reported from events at nuclear stations throughout the UK. We should therefore attend to the root causes of such events in order to avoid the potential consequences of more significant events. In support of this objective both Stations operate an enlightened "blame tolerant policy". This means that, provided an individual has not wilfully disregarded personnel safety nor deliberately breached mandatory procedures, no disciplinary action will be taken if he reports the problem. The feedback sections at each Station typically consider around 200 relatively minor events per annum.
As part of an R&D initiative to understand better the human aspects of events, Torness is currently experimenting with totally confidential reporting to an independent, professional, third party. This exercise brings wider psychological, expertise, from outside Scottish Nuclear to bear on this problem.
Although the feedback process addresses both equipment and human factors, we will concentrate here on the latter, the thought processes that lead people to take inappropriate actions. Scottish Nuclear use the most relevant components of the Human Performance Enhancement System (HPES) developed by the Institute of Nuclear Power Operators. HPES Root Cause Analysis requires trained evaluators and can be time consuming.
Around forty supervising staff have been trained in Root Cause Analysis at each Scottish Nuclear Station. The training was in the form of workshops using real events. Torness have subsequently evolved relatively short Root Cause Workshops for events with significant human factors. All persons involved in the event participate under the facilitation of the Feedback Engineer.
The approach of tackling root causes revealed through "run of the mill" plant events or near misses sounds good in theory but in practice it is quite challenging to achieve for three basic reasons :
i) People are usually reluctant to report that they nearly made a significant error and often do not appreciate the potential benefits of reporting minor events.
ii) Individual events with no safety impact can easily be written off as insignificant one-off occurrences not meriting corrective actions.
iii) Most times you have to dig to find root causes and it can be difficult to justify an extensive use of resources investigating something that can appear, to management and other colleagues, to have no real safety impact.
To provide management with both an overview and a convincing grasp of the important issues, event Trend and Pattern Analysis reports are presented annually to the Incident Panels. These reports process and display all events which occurred at the Station over the last few years according to severity, plant systems, direct causes and root causes. Recommendations are made to address areas of weakness identified by the analysis. This enables management to focus resources where real safety and other benefits can be gained.
An innovative and effective approach is employed at Hunterston in the form of a Human Factors Workshops for Managers, facilitated by a senior manager assisted by a team which includes representatives from Operational Experience Feedback, Industrial Safety, Quality Assurance and Engineering. Each of these individuals presented examples of events in their fields and supported them with information from trending analysis work. Ownership of the issues was then established within the workshop environment by inviting each member of the management team, in turn, to chair a discussion session. The purpose of the session being to understand the human factors underlying the events and develop appropriate corrective actions.
Eleven topics were addressed in several sessions leading to management focus on primary concerns and a large number of fruitful suggestions. These were then worked into an action plan with each improvement area being allocated to a manager. For example, actions to address roles and attitudes were derived from concerns arising in several sessions of the workshop. Most of the eight action plans to emerge from the workshop involve a cross section of personnel in their resolution.
In order to support the management team the facilitators have run additional Human Factors Workshops with supervisory staff to involve them in the improvement process. These were based upon the areas of improvement defined by management and drew upon the same actual events used during the management workshop.
Both Stations issue a quarterly Feedback Newsletter circulated to all supervisors and middle management which gives a brief description of all events which have happened on the site plus a summary of any particularly relevant events from other sites. Likewise, the Industrial Safety Engineer circulates an Occupational Health & Safety quarterly report which details near misses, accidents to contractors and areas of potential concern.
3.2 Scottish Nuclear's Safety Management System.
Following the vesting of Scottish Nuclear, the Company decided that a more balanced approach should be given to nuclear, radiological and industrial safety matters. A dedicated Industrial Safety Engineer was appointed at each station and it was decided to adopt the International Safety Rating System (ISRS) in the form of a Company Safety Management System, after considering several established safety management systems. Two key features of the system are:-
(1) A systematic safety management programme designed to improve performance.
(2) Regular monitoring of performance using structured auditing and comparison against recognised standards.
Over 300 UK companies and many more world-wide use the ISRS as an independent safety yard-stick. It embraces not only industrial safety but a much wider range of safety-related aspects of the company's operations. Figure 7 lists the numerous aspects of the ISRS employed by Scottish Nuclear. According to the range of elements addressed and the standards achieved a company is assigned a rating from 1 to 10.
A safety management programme to progress from level 1 in 1992, to level 5 by 1994 and level 8 by 1997 was produced at each station. The key elements of the ISRS system were assigned to individual managers and supervisors to programme and control. Hence responsibility was rolled down the structure to ensure that targets set would be decided by the personnel who had the task of achieving these goals.
The regular audits, which from the second key feature of the safety management system, not only demonstrate the effectiveness of the programme but also provides guidance on how to improve it. Should the answer to an audit question be 'no', then by instituting an appropriate practice or procedure, the effectiveness of the safety programme will be improved. The method by which the procedure or practice is developed into the work pattern is the responsibility of the Element Manager and associated Section Team. Hence a new culture was developed of utilising the implementer of a work practice to draft and agree the method by which the work would be carried out.
Figure 8 shows that both SN Stations have achieved the planned improvements as measured by rigorous external audits (DNV). Both Stations have set about increasing the range of ISRS elements which are actively managed. The latest additions are Group and Personal Communications, clearly beneficial to personnel performance and error reduction. They provide positive feedback to all personnel and reinforce the attitude of safety first. The improvements are reflected in other, independent measures and manifest in a range of effective, self-improvement initiatives.
3.3 Peer Evaluation.
In 1991, nuclear operating companies in the United Kingdom set up a Peer Evaluation Function which is currently based at Magnox Electric's Headquarters in Gloucestershire. The initiative was based on developments at the USA Institute of Nuclear Power Operators (INPO). The operating standards and culture of a nuclear Station are observed in detail and considered against international performance objectives and safety criteria by peers selected from similar plants in the United Kingdom. There is a rolling three year programme for reviews which are carried out by a 15 strong team who are on-site for a two week period reviewing station procedures and performance. The topics addressed include :- Management, Operations, Maintenance, Technical Support, Training, Radiation Protection, Chemistry, Emergency Preparedness and Industrial Safety.
It is a novel experience for staff at a long established plant to find their work methods being scrutinised by peers who are familiar with the tasks being undertaken and have the expertise to judge the quality of the performance. In fact, Hunterston Power Station received an excellent report from their first Peer Review Team. Part of this was due to the fact that the Management Team at Hunterston Power Station had supported the concept of Peer Review to a very high degree. Torness underwent Peer Evaluation in October 1992 and November 1995.
In 1992, a senior manager was seconded for two years to the Peer Evaluation Team including a spell at INPO Headquarters in Atlanta, Georgia. Hunterston provided 25 engineers over a period from 1992 to 1994 to cover reviews at other stations. The normal resource anticipated to support the initiative would have been 12 engineers taken over the period. Many techniques employed on the station were modified following adoption of ideas and practices noted at other locations and imported to current work methods.
The influence of techniques learned from INPO or Peer Evaluations led to the creation of "Principles" handbooks in several Departments. These documents were prepared to ensure that each individual within all line Departments fully understood how their efforts could aid the achievement of excellence at the Station. The contents of the documents were developed to provide first line supervisors, support staff, planners and craftsmen with management's expectation for carrying out the daily work programme. These document were intended to serve as a bridge between knowledge (that already possessed by professional staff) and procedural requirements.
A recent development is the conduct of Company Self Evaluation of operational practices between the 3 yearly external Peer Evaluations. This was introduced at Torness in August 1996 and focused on the weaknesses identified by the last external Peer Evaluation. The Hunterston plant Manager led the evaluation team which comprised staff from both Stations. The results of this Self-Evaluation were made widely known in a Station newsletter. Over 65% of the areas identified as needing improvement had either been dealt with or were well advanced.
4. Assessing the Effectiveness of these Initiatives.
4.1 Outcome of IAEA OSART & Revisit.
In April 1994 an international team of some 15, highly experienced, nuclear engineers and technical scientists, led by a senior representative of the International Atomic Energy Agency (IAEA), conducted a routine, three week, operational safety assessment (OSART) of Hunterston B. The mission results showed that Hunterston was performing well. Some of the strengths noted included :-
• An experienced, dedicated and enthusiastic management team and staff.
• A number of programmes promoting improved performance, efficiency,
safety and continuing upgrading of plant equipment and material conditions,
• Good quality training facilities and technically qualified instructors,
• An effective incident panel for reviewing plant events,
• Good facilities and equipment for radiation protection and emergency preparedness.
The OSART team also made a number of proposals for management's consideration to improve various plant activities. As usual, a brief revisit took place in October 1995. This confirmed that the Station had made sufficient progress against virtually all actions arising from the OSART. These showed Hunterston in a very good light compared with numerous OSARTs conducted by the IAEA throughout the world.
Several of the Scottish Nuclear practices, described in the main text above, have recently been recognised by the International Atomic Energy Agency as key attributes of a good safety culture. These practices will be commended in a forthcoming IAEA report for application at nuclear plants world-wide.
4.2 Industrial Safety.
During 1996 Scottish Nuclear received Gold Awards from the Royal Society for the Prevention of Accidents (RoSPA) for each of its three locations for the Third year in a row. It also received the Sector Award for the best performing electricity industry company in the field of Industrial Safety. According to the latest figures issued by the Electricity Association (Up to the end of 1996) Scottish Nuclear had the lowest Accident Frequency Rate of these UK electricity generating companies. Figure 9
shows the improvement in Scottish Nuclear's accident frequency rate over the last 5 years.
4.3 Performance Indicators.
The current ISRS, safety management rating of 8 for both Stations puts Scottish Nuclear in the top 10% of the numerous international, "high-tech" companies which utilise this scale. The ISRS reflects a wide range of the Company's activities, covering both industrial and radiological safety.
Hunterston has displayed a significant improvement in several safety performance indicators, including Accident Frequency Rate (AFR), Radiation Doses, and Implementation of Incident Panel Actions. Similarly, at Torness there has been an improvement in AFR, Simulator Training of CCR staff and numbers of incomplete safety-related modifications. Commercially both Stations have substantially improved output over the 7 year operating life of Scottish Nuclear.
5. Conclusions.
5.1 The Scottish Nuclear reactor systems and operating philosophy only place demands on the reactor operators which can readily be achieved within the required time scales. Nevertheless, operators have a powerful safety function which is reinforced through several practical organisational systems, described in this paper.
5.2 Improvement in safety culture should herald improvements in safety performance. Hence, a strong Safety Culture is fundamental to ongoing safety improvement in Scottish Nuclear and constitutes a significant component of Defence-in-Depth.
5.3 It is practical to raise safety culture on a mature station by managerial initiatives even where there is no obvious imperative for radical change. There has to be very clear commitment to the improvements at all levels of management and as many staff as possible should be given an opportunity to contribute and participate in the change process.
5.4 The Peer Evaluation process is an excellent medium for sharing experience and good practices.
Prevention And Recovery Of Errors In Software Systems
Tjerk W. van der Schaaf
Safety Management Group,
Eindhoven University of Technology
P.O. Box 513 - Pav. U-8, 5600 MB EINDHOVEN, The Netherlands
E-mail: TSC@ TM.TUE.NL
The presentation of this paper was supported by a Human Capital and Mobility Network on Human Error Prevention.
In this paper the first phase of the PRESS project (Prevention and Recovery of Errors in Software Systems) is described: a feasibility test to apply industrial risk management and incident analysis techniques to the software domain. After introducing the PRESS project, these techniques are briefly outlined. A case study is presented in which IT-related problems reported to a helpdesk were analyzed for their root causes using the PRISMA methodology (Prevention and Recovery Information System for Monitoring and Analysis). Finally an ongoing project on learning from field testing embedded software products is described.
1. The PRESS project: Prevention and Recovery of Errors in Software Systems
In this project the Eindhoven Safety Management Group tries to apply risk management and incident-analysis techniques developed in industry to the software domain. More specifically, the focus is on reporting, analyzing and controlling "human error" during the phases of developing, testing, and using complex software.
The approach we use is not only to identify failure causes and then implement preventive measures; also insight into the so-called recovery factors (leading to timely detection, diagnosis and correction of failures) must be achieved in order to build these into the software production system.
The following goals are distinguished in the PRESS project:
- a quantitative insight into the nature and frequency of different types of human error during software development;
- a balanced approach to improving system development by using both prevention and recovery possibilities;
- development of testing procedures specifically directed at different types of human error;
- a blueprint for an effective and efficient software-incident reporting system, using customer complaints as input;
- validation of predictive methods (like auditing checklists) for software reliability by comparing their results with the actual incident causes.
2. PRISMA: learning from incidents
The analytical basis of the PRESS project is the PRISMA methodology: The Prevention and Recovery Information System for Monitoring and Analysis. PRISMA is a tool capable of being used continuously and systematically to monitor, analyse and interpret incidents and process deviations (van der Schaaf, 1996). Originally developed to manage human error in the chemical process industry, it is now being applied in the steel industry, energy production and in hospitals. The initial focus on safety consequences has been extended to provide an integral approach that is able to manage all adverse consequences (safety, quality, environment and reliability), based on the assumption that a common set of causal factors is responsible for these various issues.
The main goal is to build a quantitative database of incidents and process deviations, from which conclusions may be drawn to suggest optimal counter measures. These counter measures can assist not only in the prevention of errors and faults, but also to promote recovery factors and to ensure timely corrective action. PRISMA uses both actual but rare accidents and the abundantly available near misses to accomplish this.
The PRISMA approach consists of the following main components, which will be discussed briefly:
1. The Causal Tree incident description method.
2. The Eindhoven Classification Model (ECM) of System Failure.
3. The Classification/Action Matrix.
Causal trees (van Vuuren & van der Schaaf, 1995), derived from fault trees, are very useful to present critical activities and decisions which occur during the development of an incident. These activities and decisions are presented in chronological order, and show how different activities and decisions are logically related to each other (figure 1). It becomes clear when using causal trees, that an incident is the result of a combination of many technical, organisational and human causes. The ‘root causes’, which are found at the bottom of the causal tree, are the main product of the first phase of PRISMA, and constitute the inputs to the second phase: classification of failure root causes.
To classify technical, organisational and human root causes, a model is needed. The Eindhoven Classification Model (figure 2) was originally developed to classify root causes of safety related incidents in the chemical process industry. The ECM focuses on three types of causes separately and in a pre-defined order. First technical problems are considered, by looking at the design of the equipment used, construction problems, or unexplainable material defects.
The second step focuses on contributing factors at an organisational level, such as the quality of procedures, or the priorities of management. Only after looking at possible technical and organisational problems are human causes considered. This order is chosen to counteract the sometimes strong bias within companies to start and stop analysis at the level of the operator as the end-user, and leave the technical and organisational context of any mishap unquestioned.
The human section of the model is based on the SRK-model by Rasmussen (1976). Rasmussen has developed a basic model of human error based on three levels of behaviour: skill-, rule-, and knowledge-based behaviour (S-B, R-B, K-B). This SRK-model has been operationalised to describe operator errors in process control tasks by combining it with characteristic task elements, which as a whole, cover the entire spectrum of operator sub-tasks.
The last category (‘unclassifiable’) is reserved for those contributing factors which can not be categorised in one of the above mentioned categories.
In order to develop an actual tool for risk management it does not suffice to stop at the analysis stage of failure classification. These classification results must be translated into proposals for effective preventive and corrective actions. To fulfil this purpose a Classification/Action matrix was developed, incorporating the theoretical foundations of the ECM.
In the matrix (figure 3) the most preferred action in terms of expected effectiveness for each classification category is indicated by ‘X’. The last column’s ‘no!’s’ refer to particularly ineffective management actions, which are none the less often encountered in practice.
Equipment |
Procedures |
Information & communication |
Training |
Motivation |
|
TE TC (TM) OP (OM) HK1 HK2 HR1 HR2 HR3 HR4 HR5 HR6 HS1 HS2 |
X X
X X |
X |
X X |
X X X X X X |
no! no!
no! no! |
Figure 3:
The Classification/Action Matrix.
3. NMMS: implementing an incident reporting system
PRISMA as an analytical tool can serve its purpose only within a comprehensive system which on the input side feeds a constant stream of incident reports into PRISMA, and at the output side is able to generate , implement and evaluate effective and efficient corrective measures. The Near Miss Management System (NMMS) derived from a variety of experiences in industry and transportation (Van der Schaaf, Lucas & Hale, 1991) provides a seven module framework or checklist to design and evaluate incident reporting systems:
1. Detection module aimed at reporting of the occurrence of near misses/incidents by employees.
Question: How to motivate this (self-) report activity?
2. Selection "interesting" reports (those with high feedback value) must be selected
out for further analysis.
Questions: - Which selection criteria?
- Which decision methods?
3. Description detailed structure incorporating all relevant components (system characteristics, technical faults, errors, recoveries, etc.) and their (chrono-) logical relationships.
Questions: - How detailed?/ Which stopping rule?
- Which (tree-like) technique, and which type of database?
4. Classification components must be classified according to a system model
comprising both the technical, organizational and human aspects.
Questions: - Classification of all components or only of the "root"
causes, etc.?
- Which model is best suited?
5. Computation facilities for statistical analysis of data resulting from 4.
facilities for the manipulation of the structures of 3 for sensitivity analyses and simulation.
6. Interpretation periodic translation of results in structural measures ( general factors) and ad hoc measures ( specific/ unique factors).
7. Evaluation following the effectiveness of implemented measures : feedback to 1, but also using other, independent measures of "safety performance".
The implementation aspects of a near miss management system are not to be underestimated (they are probably comparable to implementing a successful Total Quality Programme). Three essential aspects are : top level management commitment, unbiased reporting by employees, and support for middle management (e.g. safety officers) who are responsible for describing and analyzing the reported events (Van der Schaaf, Lucas & Hale, 1991):
Management commitment is vital to ensure that organizational learning from near misses should be its only function. At least a voluntarily reported near miss should never have any negative repercussions for those reporting it;
Unbiased reporting may be motivated by training all employees in recognizing near miss situations; by showing them exactly what is being done with the reports they handed in; and by giving them fast and frequent feedback of the results.
Supporting the safety staff is necessary to fully appreciate the cognitive backgrounds of the human error model, and to ensure an objective and uniform approach in describing, classifying and interpreting the reported events.
PRISMA and the NMMS framework could be used to support two mechanisms for learning:
- internally, by focusing on reported errors during the R&D phase;
- externally, by collecting user problems once the product is on the market.
4. A feasibility study of PRISMA
A case study has been carried out to test the industry-based PRISMA methodology in the software domain. It was carried out at the helpdesk of the Dutch company A: regional or local offices report their problems (i.e. non-availability) with IT applications to this national helpdesk.
An evaluation of their current incident databases using the PRISMA classification model revealed an incomplete and extremely technology biased identification of incident causes: a subsequent "reference database" using carefully conducted Critical Incident Interviews on recent software problems showed many "technical" causes to have an organizational or human root cause.
Case Study
Schaftenaar (1996) describes a feasibility study within the context of a two year project sponsored by the Dutch government which aims to gain insight into the quality of service in the IT sector, specifically concerning the use and control of IT products.
A main issue within the project is the Service Level Agreement. In this document the supplier and the customer agree on a specific level of service to be provided by the supplier. If the supplier does not meet the agreement, there could be direct financial consequences. One can imagine that both customer and supplier are eager to know the actual level of service provided by the supplier. In order to evaluate this, a large amount of data is needed concerning the use and control of the supplied IT products. In company A, a partner was found who possessed a large amount of such data.
Problem Definition
The data mentioned earlier concerning the use and control of IT products is an output of the control process implemented at A. This process is based on the Information Technology Infrastructure Library (ITIL), which is the current standard on control in the IT sector. An important part of service is ensuring that any flaws in the supplied products are dealt with quickly and effectively. The processes concerned with this part of service are Help Desk Management and Problem Management.
Help Desk Management is used to guaranty the continuity of service, while Problem Management is used to improve the level of service in the future. In other words, Help Desk Management is used to deal with "incidents". Incidents are deviations from the standard system function. Problem management is used to deal with problems. Problems are the "root" causes of incidents.
Problem Management and Help Desk Management are complimentary processes. Both must be used to ensure good quality service. Because problems are never really visible, information about problems must be gathered via incidents. However, this data may be so diverse that real insight may not be possible. To gather information the data must be classified. The object of this assignment was to assess what data was being collected, how it was registered and classified and how (or if) the data was analyzed.
Current Database
A first step in the assignment was to examine the available data in the database. It was not possible to get a clear picture of the types of incidents A is dealing with. This was mainly because only roughly half of all telephone calls to the helpdesk were coded and the validity of the database was presumably low. This low validity came to light during a check of incident codes used. Random calls were examined to see if the free text description of the incident related to the incident code. Often verification was not possible due to the vagueness of the free text description. In the cases where verification was possible it was clear that more than 30% of the calls were incorrectly coded.
Table 1. The ECM-IT
Error Code |
Descriptive Label |
Example |
OP |
Organisational Procedures |
No procedure for restart after serious incident |
OM |
Organisational Management priorities |
Fast introduction of application is more important than adequate testing |
HK1 |
Human Knowledge 1 (system status) |
User is not aware of the fact that the programme is saving file and removes disc |
HK2 |
Human Knowledge 2 (goal) |
Trying to solve problems when the goal is to solve incidents |
HR1 |
Human Rule Based 1 (license) |
User tries to repair PC, but is not qualified |
HR2 |
Human Rule Based 2 (permit) |
Changing network parameters without permission from controller |
HR3 |
Human Rule Based 3 (co-ordination) |
Not informing colleague about changing printer set-up |
HR4 |
Human Rule Based 4 (checks) |
Not checked if printer is plugged in |
HR5 |
Human Rule Based 5 (planning) |
Using ‘print screen’ key instead of print function |
HR6 |
Human Rule Based 6 (tools/information) |
Sending colour print to a monochrome printer |
HS1 |
Human Skill based 1 (controlled movement) |
Making typing error on keyboard |
HS2 |
Human Skill based 2 (whole body movement) |
Throwing coffee on keyboard |
TEH |
Technical Engineering Hardware |
Capacity data-lines is too small |
TES |
Technical Engineering Software |
Design error in application |
TCH |
Technical Construction Hardware |
Hard disc is installed incorrectly |
TCS |
Technical Construction Software |
Programming error |
TMH |
Technical Materials Hardware |
No more printer toner |
TMS |
Technical Materials Software |
Printer driver is missing |
X |
Unclassifiable |
Random peek in use application |
An examination of the Problem module of the information system showed that this model was hardly being used. In a population of 30,000 incidents only about a hundred problems were identified. Since the original categories used in the information system are not based on a model, it was not possible to gain insight into the kind of problems which were identified, therefore the PRISMA model of failure classification was used. The original model was transformed for application in the IT domain. This meant that a distinction between hardware and software was introduced. Also, the order of the main categories was changed to counteract the existing strong bias of seeing only the technical aspects of incidents (see Figure 6). The final model, the ECM-IT, is shown in Table 1.
Using this model it was possible to gain insight into the types of problems in the database. The problems that were registered were all of a technical nature. (See Figure 6). This would seem an unlikely situation and more research was necessary to further investigate the validity of the data.
Critical Incident Database
Since the data in the existing database was unreliable another source of data was found. Critical Incident Interviews were used to obtain information about incidents directly from users.
Critical Incident Interview Technique
The Critical Incident Interview Technique (CII) was developed to collect data on human behaviour in a particular work situation. The technique enables a researcher to register events which influence the objective of a task positively or negatively in a systematic manner. Such events or incidents are often only known to the person or persons directly involved. These individuals are asked to report these events. A CII is used to collect more detailed and often sensitive information following receipt of the confidential report.
Causal Tree Method
A causal tree is made of the incidents gathered through reports and CII’s. A causal tree gives a detailed description of the sequences of events culminating in an incident. The endpoints on the tree are called "root causes". These root causes must be classified according to the chosen classification model. Each incident is analyzed producing a set of classified causal elements as opposed to nominating only one element as the "main cause" of the incident. A fictitious example is given in figure 4.
Figure 4. Fictitious causal tree.
Eindhoven Classification Model
Root causes found in the analysis were classified using the ECM-IT (figure 5). Classification of a root cause identified in the causal tree involves moving from the top of the model through the classifications until an accurate category to describe the root cause is found. This process is repeated until every root cause in a causal tree is classified each in a single category.
Several CII’s were conducted at A. In this way 117 root causes from a total of 17 incidents were gathered and classified. Table 2 shows the results of the analysis of the root causes.
Table 2: Number of root causes per main category.
Types of root causes |
Technical |
Organisational |
Human |
Unclassifiable (X) |
# in database |
39 |
61 |
13 |
4 |
A comparison was made between the types of root causes found in the CII’s and the root causes found in the Problem Management module of the original database (see figure 6). The results are striking. Not only were more root causes found in 17 PRISMA analyses that in the regular analysis of 30,000 incidents, but the types of root causes were also very different. It became clear that not only technical factors but also organizational factors significantly contributed to failures. Other research also found a similar situation (Van Vuuren & Van der Schaaf 1995). It is suggested that organizational factors are for the most part "latent" (Reason, 1990). These latent errors are not revealed until other errors occur which make them visible. This time delay often renders organizational factors much more difficult to detect.
Figure 6. Comparison of regular analysis results (left chart) with results of PRISMA analysis (right chart).
Finally, not only failure factors but also recovery factors were examined. These recovery factors have proven to be very limited in this case study. IT systems tend to be opaque obstructing detection and therefore recovery.
5. Follow-up
One might consider the above case study as primarily focusing on externally learning from errors, using the problems of end-users as triggers.
An on-going second case study looks at (the absence of) internal learning mechanisms in the embedded software R&D department of a major manufacturer of retail petroleum systems, with the following preliminary results:
their use of "field tests", looking at the performance of newly developed or adjusted software products under field conditions, was successfully evaluated by comparing it with the 7 module NMMS framework originally developed by our Eindhoven group for the chemical process industry.
Its malfunction could be explained, and an alternative "learning system" was proposed.
References
Rasmussen, J. (1976) Outlines of a hybrid model of the process operator. In T.B. Sheridan & G.H. Johannsen (eds.), Monitoring Behavior and Supervisory Control. New York: Plenum Press.
Reason, J. T. (1990) Human Error. New York :Cambridge University Press.
Schaaf, T.W. van der (1992) Near miss reporting in the chemical process industry. PhD thesis, Eindhoven University of Technology.
Schaaf, T.W. van der (1996) PRISMA: A risk management tool based on incident analysis. Proceedings of the International Conference and Workshop on Process Safety Management and Inherently Safer Processes (pp. 245-251). October 8-11, 1996, Orlando, Florida.
Schaaf, T.W. van der, Lucas, D.A. and Hale, A.R. (1991) (eds.) Near Miss Reporting as a Safety Tool. Oxford: Butterworth-Heinemann.
Schaftenaar, L. (1996) Classification as a tool for help desk processes. MSc. Report. Faculty of Technology Management, Eindhoven University of Technology.
Human Error Recontextualised
S. Dekker, B. Fields, P. Wright
British Aerospace Dependable Computing System Centre
and Human Computer Interaction Group
Department of Computer Science, University of York, York, YO1 5DD, UK
1. Introduction
1.1. Problem definition
The design of interactive systems for use in complex dynamic domains often leads to new technology that is introduced into evolving and ongoing operational practice. It could be said that in this new technology system builders design primarily for data availability, not necessarily operability. Operability is taken to be the responsibility of the operator: it is (s)he who must find the right data at the right time. On the basis of their knowledge of the domain and the practitioner population, designers make (and have to make) assumptions about how a human is going to perform in relation to their system; how they are going to find the right data at the right time and act correctly in response. Such assumptions are sometimes more and sometimes less well-informed. For example, a designer may assume that the practitioner will know where to look in the display layout, given a certain system state, or that training will take care of a particular knowledge requirement.
Once systems are fielded, however, it often turns out that some of these operability assumptions are unwarranted. Practitioners may in fact not know where to look, or they may not call the right piece of knowledge to mind in that context. The problem is that current human error assessment techniques do not systematically support a designer in evaluating the validity of assumptions about human performance.
This problem is perhaps illustrated by the huge gulf between the terms used to describe human performance issues before, during and after a design is fielded. For instance, in the design phase many human error techniques may analyse task descriptions for "errors of omission". Yet the practitioner caught up in an evolving problem situation may indicate that "he couldn’t keep track of what the system was doing". Similarly, analysts of mishaps post-hoc will rarely use the terms of predictive techniques and speak instead about such things as "a lack of situation awareness".
What these differences in terminology point to is that current human error evaluation techniques may fall short in how they capture the operational context and the practitioner’s cognition that lies behind the creation of interaction failures. We look at these shortfalls in turn below.
1.2. Shortfalls in current techniques
The cognitive work done by a human in complex domains and the errors occurring within it are fundamentally context-bound. Human error does not occur in a vacuum but is determined in part, and enabled largely, by the operational context in which it occurs. The way in which many techniques attempt to take into account this context is to treat it as an afterthought — for example by using performance shaping factors as independent variables to adjust some error likelihood. But context is what practitioners rely on to impose meaning on the information they receive; it is what guides them in the formulation and revision of goals and intentions; it is what determines and constrains the knowledge and attention that must be brought to bear for successful problem solving.
To say something about human error, or human-machine interaction failure, with more predictive significance, then, we would need to recreate the operational reality as it would exist for a practitioner caught up in a concrete evolving problem solving situation, for example in terms of the goal trade-offs a practitioner needs to make and the attentional and knowledge demands placed on him or her in situ. A characterisation of this operational reality must include in detail the resources afforded by the human-computer interface, but few techniques explicitly analyse for example the complexity of system moding as a factor contributing to erroneous human assessments.
Even if context would be employed in a much richer way, the description of error forms in many techniques would not be able to deal with this enrichment. Many still describe error at a behavioural level, which would not be able to capture the way in which, say, intention formation might become erroneous in the light of uncertainty and poor system feedback. In other words, there is a lack of cognitive depth when it comes to describing error forms in most human error assessment techniques. Especially given that many complex systems have seen a shift from manual control to much more supervisory and cognitive ways of being managed, the behavioural description of, for example, an "error of commission" fails to capture the ways in which interactional failures between human and machine occur, how they persist, and how they cause trouble for overall system integrity.
1.3. The proposal in this paper
Existing human error techniques actually do appear to serve other goals of a design organisation. For instance, they can provide quantification of error which might be needed to construct a safety case, and they afford easy integration with existing task and workload analysis tools. This is why in this paper we explore requirements for a complementary approach to assessing human error possibility. The aim is a "recontextualisation" of human error by taking into account both the cognitive precursors to erroneous behaviour and the operational context in which human performance problems occur.
This orientation to human error identification has a number of implications for the design practice. Most importantly, such an approach can be used not as a generator of error probabilities at the end of a design life cycle, but instead early on in a design process when there are still opportunities to inform re-design of an envisioned system. Also, the term ‘human error’ could best be replaced by the term ‘interaction failure’, in order to capture the joint contributions of the human and the machine to operational difficulties.
1.4. Structure of this paper
If interaction failure is context and cognition dependent, then the identification of potential interaction failures during system design should be context and cognition sensitive. In what follows below, we try to deal with both. We first explain how scenario-based assessment can be an approach to evaluating a design in a way that supports the recontextualisation of interaction failure. From putting a design in a concrete scenario, we can expose the kinds of assumptions that system builders might have made about the operability and human performance in relation to their design. We then examine ways in which we can systematically use scenarios to help in this exposition, and look at the ways in which we can characterise the kinds of cognitive demands and potential for interaction failure that flow forth from using a particular design in context. The concepts are illustrated throughout the paper with examples from a design project.looking at the flight deck of a hypothetical commercial airliner
2. Scenario-based design and evaluation
2.1. Scenarios for understanding context
As argued above, the details of an operational environment contribute in a fundamental way to the demands made on practitioners and the resources provided to them. An understanding of realistic, concrete scenarios therefore provides a valuable analytical tool in identifying demand-resource mismatches and the potential for failure. Scenarios provide an insight into what kinds of operational demands are imposed on the real-life use of a proposed design and how this gives rise to additional cognitive demands for practitioners caught up in.
In what might be termed "traditional" approaches to human error assessment, particularly task-based assessment and design, the emphasis is on creating abstractions of the way in which a system will be used in the field. Such abstract descriptions of work attempt to generalise over a number of (or even all) actual situations of use of the system. Descriptions of tasks typically focus on the overt actions carried out by a practitioner (Shepherd 1989) or in the knowledge structures possessed by practitioners that allow the actions to take place (Johnson and Johnson 1991). Development then involves a process of turning the abstract representations into concrete forms, either by designing systems to support tasks and satisfy requirements, or by making predictions about the system's operation.
The scenario-based approach, on the other hand, argues that the design process should take the specific and concrete, rather than the general and abstract as its primary input (Greenbaum and Kyng 1991, Kyng 1995). The justification for this view is twofold: First, that concrete scenarios encode the contextual details that shape the way that situations unfold in practice in a way that is absent in context-free abstractions like task analysis. The effect is to expand the analysis beyond the "head and hands" scope of task analysis to include many of the contextual factors that shape the way action unfolds in real situations. Second, that by becoming directly involved in the design process through scenario construction and definition, practitioners may be better able to bring to bear, their skills, domain expertise, tacit knowledge and so on.
Historically, the concept of scenario has been used in two quite distinct senses, and in reviewing the literature on scenario-based design, Kuutti (1995) identifies work in both these areas. One view is that the role of scenarios is to capture the richness of the contextual and situational factors upon which human action is contingent. An alternative view is that scenarios can be used to represent system-centred episodes of interaction (e.g., Hsia et al. 1994). In the work described here, the aim is to capture these two aspects by describing both a "situation" in which action takes place and the "action" itself. It is in this respect that our scenarios will go beyond the "timeline analysis" (e.g., Day and Hook 1996) often employed in the design and assessment of aircraft flight decks and in other aviation-related domains, where actions are recorded with only a shallow representation of the context in which they occur. In addition to the description of sequences of activity used for timeline analysis, the situational dimension of scenarios offers greater contextual breadth and cognitive depth and may be targeted at issues other than mission success, such as interactional breakdown.
2.2. Selecting scenarios for interaction failure analysis
We turn our attention now to the practical issues of how to generate or elicit scenarios, and how to decide whether the scenarios generated provide the kind of coverage of the problem space that is needed for the design activity at hand. The use of the contextual perspective provided by scenario-based design has been recognised as important, and much attention has been paid to how a focus on concrete instances of usage can alter the nature of the design process (see, for instance, Carroll 1995). However, much this work has concentrated on the use and elicitation of scenarios, but by and large has not focussed on the question of how designers should select particular scenarios for use in the design process. In other words, which, of all the possible work situations, should be selected for further elaboration and detailed examination? In order to help in this respect, we aim to look both at generic interactional problems that might occur, and at how these general problems become "instantiated" in reports of practitioners’ experiences.
The phenomena of interest i.e., the interactional problems that we want to avoid
We have, as one starting point, the proposed choices about a design and technology. This already points to some scenarios as being more interesting than others. For example, if a flight deck is being re-designed and new and different technology inserted, we may wish to home in on scenarios in which the new devices or new technology on the flight deck figure highly, or in which work in the new system will be organised in a radically different way as a result of technology change. Technological proposals gives one focus for our analysis, but a look at the problems that arise with the use of technology can give more specific guidance for scenario selection.
We said earlier that there is a large gulf between the ways in which human error is talked about during the design phase, and how it is characterised during and after system operations. In order to bridge that gulf, then, we need to look at the operational world and extract from it the kinds of generic human-machine problems that form the phenomena of interest — i.e. that form the kinds of problems we want to avoid in the design we are evaluating.
When we look at many modern complex, dynamic domains, we see for instance problems with practitioners getting lost in display page architectures, problems with practitioners’ mode awareness; misassessments of automation status and behaviour; having to deal with interacting goals and conflicting evidence during anomaly response. Although particular domains will experience problems that are specific to that domain, these are generic human-machine problems found across a wide range of operational contexts and can be used to guide us further. These phenomena point to the sort of ingredients that we are looking for in a scenario: interleaving parallel tasks and multiple interacting goals; shifting status and behaviour of automation; multiple uses of one display, etc. The theoretical model of Section 3 aims at providing a systematic basis for identifying such problems in the context of a particular scenario.
The most obvious way to help instantiate the phenomena of interest is to make use of expert practitioners, or, for that matter operational experience from other sources (e.g., incident reports). This allows us to constrain (and to evaluate for relevance) the kinds of scenarios we might be thinking about.
Instantiating the phenomena of interest: Operational experience.
An important source of insight in the construction of scenario descriptions will come from experience of observation of the fielded system or currently extant similar systems. An important part will be played here by current practitioners who will help to identify situations in which breakdown and failure is either particularly likely or consequential. (cf. the Future Workshops of Kyng 1995) Another source of information here will be a historical record of incidents where breakdown occurred — such as are found in accident reports and confidential incident reporting systems.
In aiming for "coverage" or "completeness" with respect to operational experience, one might attempt to collect together a set of scenario descriptions that contain the situations that practitioners identify as being particularly challenging or hard. Note, however, that, in contrast to scenario based design, the aim here is not to identify scenarios that capture and highlight the requirements for a system, but to identify ways in which a system design might turn out to not fully support the cognitive activities that a practitioner needs to engage in. One way in which the coverage provided by a collection of experience-based scenarios might be compromised is if the process of generating scenarios from experiences is unduly biased, for example by hindsight (Woods et al. 1994). The way in which such biases might be avoided is to rely on the phenomena of interest above for guidance about which situations may be problematic, rather than relying solely on the expertise of practitioners.
Case study scenarios
The machinery is now in place to allow us to elaborate our case study scenarios in more depth. In order to be able to decide on the content of a scenario, we can direct our attention towards each of the three areas above.
In our hypothetical example, interviewing air crew currently flying in an existing airliner may reveal situations that involve the flight deck crew in a number of activities that are perceived by them to be both difficult and critical. Two examples of such situations that we will return to several times in this paper are a severe bird strike and complex hydraulic systems failure. From an experience point of view, the stories told and accounts given by practitioners about how such problem situations unfold will be of particular interest. Coupling this with proposals to alter the technological support for the pilots’ activities (e.g., to provide electronic displays showing procedures to be carried out, to supplement the "old fashioned" paper checklists), indicates that a number of cognitive demands will be made of the joint human-machine system in the new configuration.
Together, the two scenarios will provide an important coverage of the problem space. While they do not involve interaction with all the devices on the flight deck, or exercise the full range of behaviours that any one device may exhibit, the scenarios do cover a wide range of problems that a complex cognitive system may encounter. The severe bird strike scenario, for example, is a situation where time pressure, the autonomous behaviour of the environment, and the need to engage in and manage a number of concurrent activities are significant factors. In the hydraulics scenario, on the other hand, time pressure and concurrency are less important, but the need to carry out complex reasoning and inference, and co-ordinate a number of types of knowledge from different sources is crucial to a successful outcome. For each of these scenarios, we now describe some of the most important contextual factors, as well as the actual actions that the practitioner carries out.
Scenario 1: Severe bird strike
Context: The first scenario takes place when our aircraft is flying at low altitude during the climb shortly after takeoff, with two crew on the flight deck, the captain being the handling pilot (see Fischer et al. (1995) for a discussion of why the different priorities of captains and first officers might be significant). A severe bird strike occurs in the two port-side engines (numbers 3 and 4) of our four-engine aircraft. The result is a fire in the number 3 engine and a failure of number 4, both conditions resulting in warnings being presented to the pilot. Along with the warnings are procedures for dealing with the fire, failure and several secondary failure conditions (e.g., generator failures) that arise as a result of the engines being unserviceable. These procedures are presented to the pilot via the electronic procedures display format (the detailed design of which is one part of the current design activity) and it is also highly likely that both pilots will remember many of the emergency drills, as they form a part of the pilots’ basic training.
Actions: The sequence of actions carried out in this situation can now be described. We focus on the actions of the pilot not flying, as these will tend to be direct responses to the emergency situation. The pilot flying, on the other hand will primarily be engaged with flying the aircraft, though will often check and confirm the co-pilot’s actions. Table 1 shows some of the system’s behaviour along with the corresponding actions carried out by the pilot, and the order in which they occur. The right hand column records some of the information and cues that are available to assist the pilot in deciding what actions to take. In this case the resources available will be the "drills" or pre-determined plans for carrying out responses to the warnings. In section 3 we will return to the concept of information that acts as a resource for deciding how to act, and explore it in some more detail.
System behaviour |
Pilot actions |
Information available |
Engine 3 fire |
||
Engine 4 failure |
||
Flap 0 |
||
Generator 3 failure |
Adjust rudder trim |
|
Generator 4 failure |
||
Select ENG information page |
HP cock 3 off |
Eng 3 fire drill |
LP cock 3 shut |
Eng 3 fire drill |
|
Fire extinguisher 3: shot 1 |
Eng 3 fire drill |
|
HP cock 4 off |
Eng 4 failure drill |
|
LP cock 4 shut |
Eng 4 failure drill |
|
HP air supply, right side off |
Eng 3 fire drill, Eng 4 failure drill |
|
Start switch: windmill, |
Eng 3 fire drill, Eng 4 failure drill |
|
Engine 3 fire still in progress |
Fire extinguisher 3: shot 2 |
Eng 3 fire drill |
Generator, eng. 3: off |
Eng 3 fire drill |
|
Generator, eng. 4: off |
Eng 4 failure drill |
|
Busbars: COUPLE |
Eng 3 fire drill, Eng 4 failure drill |
|
Select ELEC information page |
||
Monitor voltages & frequencies |
Gen. 3, 4 fail drill |
|
Transformer (right): COUPLE |
Gen. 3, 4 fail drill |
|
CPU fault lights: Check |
Gen. 3, 4 fail drill |
|
Check services lost |
Gen. 3, 4 fail drill |
Table 1: Actions in the engine failure and fire scenario .
There are a few important points to note. On the face of it, the actions might seem to be the result of routine, rote following of written procedures. However, closer inspection reveals a more complex pattern of behaviour.
• The generator failures are "secondary" failures, caused by the "primary" engine problems. It is the recognition of this fact that allows the pilot to defer the actions associated with the generator failures until after the most pressing engine related actions have been carried out.
• The procedures are not followed for each warning in sequence: the actions from the two engine related procedures are interleaved, and certain of the actions are common to both procedures but need only be carried out once.
• Some negotiation between the two pilots is required about shutting the engines down. This goes beyond a simple confirmation by one pilot that the other’s actions are correct, and may result in the shutdown actions being delayed (because, though damaged, an engine may still be producing thrust that is essential at low altitude until the remaining engines achieve full power).
• This negotiation between pilots, and any other cross-checking that might serve to improve reliability can only be hindered by the fact the pilot flying will be fully engaged in keeping the aircraft airborne and gaining altitude.
Scenario 2: Complex hydraulics system failure
Context: The second scenario involves the crew on the same flight deck as before in diagnosing complex problems in real-time. The hydraulics system consists of three independent hydraulics reservoirs (A, B, and C) connected to three sets of aircraft control surfaces (Rudder, Elevators, and Ailerons) via two types of actuators (primary and secondary). The pilot can select which hydraulics reservoir is used to drive each of the control surfaces. A hydraulics page of the electronic information display shows the quantity and pressure of fluid in each of the reservoirs, and observations of changes in these values can help the pilots in making inferences about the presence of leaks in the systems.
In this scenario, two leaks have occurred in the primary (A) actuator of an elevator and in the secondary (B and C) actuator of an aileron. The objective of the pilot in this situation is twofold: (i) to set the hydraulics controls such that no fluid is leaking and control of the aircraft is maintained (i.e., to ensure process integrity) and (ii) to determine as much information as possible about where in the system leaks are occurring. A number of pre-computed procedures (recorded in the "Quick Reference Handbook" and displayed on the electronic procedures display format) cover failures in the hydraulics systems, and will be recalled by the pilot. However, as we shall see, the procedures are not adequate in complex failure situations such as this one, and in order to achieve the goals, the pilot may rely on instrument readings, knowledge of the history of the episode, and a body of detailed knowledge about how the underlying system works and how its failure modes are manifested in observable values.
Actions: In the second scenario, the sequence of actions begins in much the same way as in the previous situation: a system-generated warning occurs and the pilot responds by following procedures. After a fairly short time, however, it becomes clear that the problem is more complex than the pre-defined procedures have accounted for. At this point the pilot must make inferences about the causes of the observed behaviour in order to select further actions. The sequence of actions that occur is recorded in Table 2, along with the system behaviour and the information that allows the pilot do decide which actions to perform.
System behaviour |
Pilot action |
Information available |
Hydraulics A warning |
||
Select HYDRAULICS information page |
Monitor hydraulics level values |
|
Switch all to B |
A failure drill |
|
A leak stops |
||
B leak starts |
||
Switch all to C |
B failure drill |
|
B leak stops |
||
C leak starts |
Infer: must be a primary leak and a secondary leak; don’t know which |
|
Switch elevator to A |
No drill |
|
A leak resumes |
Infer: leak in elevator primary |
|
C leak continues |
Infer: leak in rudder or aileron secondary |
|
Switch elevator to C |
||
Switch rudder to A |
||
A leak stops |
||
C leak continues |
Infer: no leak in rudder secondary |
|
Switch aileron to A |
||
No leaks |
Table 2: Actions in the hydraulics diagnosis scenario
It is clear that the actions in this situation unfold in a very different way from the bird strike situation. Indications of the status of the underlying system are very important, as is the process by which the pilot draws conclusions from them, based on a model of how the hydraulics system is interconnected, and precisely what the various display indications mean in terms of this model.
Such observations about how the interaction episode unfolds can be quite enlightening about where the critical points in a scenario or design are, and where the breakdowns might arise. However, in the next section, we describe a theoretical standpoint from which to conduct an analysis of scenarios in a more systematic manner. This framework will then be employed in an illustration of one way in which scenarios may be analysed.
3. Information resources and cognitive demands
In order to critically evaluate scenarios in order to identify the potential problems for the cognitive system, we take as our starting point the guidance of the demand-resource mismatch view of human error outlined by Woods (1990). The aim is to trace down areas not just in the design, but in the operation of a design that carry the highest potential for human-machine interaction failures. Our starting point is formed by the interactional resources provided to a practitioner by the design. We then place the design in an operational context (the scenario) and through this we attempt to reveal the kinds of cognitive demands that the design resource does not cater for, or for that matter, the extra cognitive demands that the design in its very configuration actually creates in that context.
3.1. The information available: Resources
Recent work in human-computer interaction and computer-mediated collaborative work, has emphasised the value of viewing cognition as a distributed phenomenon (Halverson 1994; Hutchins, 1995). Within this view, action is seen as guided by the use of information resources that are distributed between machine and humans (Wright et al., 1996; Fields et al. 1997). Note that the use of the term "resources" in this paper is similar to that adopted by Suchman (1987), and others working in the "situated action" tradition to refer to a mechanism used in the construction of behaviour; this is in contrast to the use of "resource" to refer to some limited capability that is divided out among a number of competing activities (e.g., see Wickens 1992).
The kinds of information that can serve as a resource for deciding how to act includes a pre-determined "plan" for achieving some goal, information about the effect certain actions will have on the system, and about the organisation, behaviour, and current status of the system itself, historical information about how the scenario has unfolded so far, and so on. In a cognitive system, such information resources may be represented "internally" by the human or may be represented "externally" in the interface or system. External resources may represented explicitly (e.g., a written procedure is an explicit external representation of a plan) or implicitly (e.g., an interface constraint can be an implicit representation of a plan). The existence of information resources in the head of a practitioner or in the environment is, on the one hand, the means by which action may take place, but on the other, is a source of demands on the human-machine cognitive system, which must maintain and co-ordinate the resources it needs.
Such a viewpoint on interaction reduces the emphasis of task analysis at the behavioural level and emphasises instead, the information context that supports decisions about action. Such a resource-based view of interaction is potentially valuable for a consideration of interaction failure.
As suggested in section 2, one of the starting points for our analysis is a proposed or envisioned design that provides some of the resources a practitioner may use to deal with the domain problems in a particular scenario. We have also argued that (implicit) assumptions about human performance and operability are attached to those technological choices and design proposals. The aim is to expose those assumptions. We do this through the identification of resources that are not in the designed artefact, but in the head of the operator and how these resources in the head create demands of a cognitive nature.
3.2. Resources, demands and mismatches
As we said in the previous section, one of the cornerstones of our approach is the concept of the information resources that exist in the interface or operational environment, or are assumed to exist in the heads of practitioners, and upon which the formation of intention about which courses of action to take are contingent.
When information resources play this role, a number of requirements on the human in the cognitive system will arise from the precise form in which the resources are represented, and the way they are used to meet the requirements and objectives of the work domain, in the context of the scenario of interest. These emergent requirements, to manage and co-ordinate resources and use them to take actions, are referred to here as demands that the cognitive system faces. The fact that demands arise in a scenario is not necessarily problematic — the human may be perfectly able to cope with them — indeed, demands are a natural and unavoidable consequence of using resources. However, there are times when the use of resources gives rise to residual, or extra demands, or times when the demands themselves will cause problems. In these cases we might loosely refer to a mismatch between the resources in a system, and the demands associated with using them. The ultimate aim of our approach is to provide a way of identifying the potential for mismatches that could to interaction breakdown and error, and to guide the search for design measures to reduce the risks.
We now look at an example to illustrate these concepts and then move on to a more systematic look at the kinds of demands that follow from resource usage.
3.3. Identifying cognitive demands: an example
In order to illustrate the points made above, we return again to the case study scenarios, and look at some of the demands that arise in a situation where the pilot is following pre-determined procedures for dealing with emergencies or other anomalous conditions. An automated, electronic display will give procedural guidance, or external representations of plans in resource terms, when warnings occur. This representation contains not only the actions of the plan and the order in which they are to be carried out (i.e., the plan itself), but it also indicates which of the actions have been done (i.e., a "marker" showing what to do next). The display is capable of showing only a single procedure, corresponding to a single warning, at any one time, and in cases where a number of warnings are concurrently "active", the pilot may select which of the several available procedures is be displayed. Considering how these resources are used in a particular operational situation will allow use to articulate concerns about the demands that arise.
For example, in a situation where a single warning condition occurs, a single procedure will be activated, and the demands of using that resource will include the requirement that the pilot sees, understands, and correctly interprets the procedure. If, in contrast, several warnings occur, resulting in several procedures being "active" some additional demands fall to the pilot. For a start, the pilot must be aware that several procedures are available and switching between them is a possibility. Furthermore, there is often an operational requirement that certain "urgent" actions in a procedure be carried as soon as possible, while other "clean-up" actions may be deferred until later. In order for the pilot to be able to meet this operational requirement by switching between the several procedures in order to complete all the "urgent" actions before moving on to the "clean-up" phase, a number of demands emerge.
One of these demands in particular arises as a result of the need for the pilot to know which of the actions from a given procedure are urgent, and which perform clean-up functions and can be deferred until later (and this may in turn depend in the pilot having knowledge of the current situation and system status). This distinction is not represented in the computerised procedure. In other words, meeting the operational requirement of carrying out the urgent actions first (i.e., the scenario), together with the resources provided by the procedures display creates the need — or demand — for additional resources in the pilot’s head. This may lead to a mismatch between the requirements of the problem scenario and the set of resources possessed by an actual pilot — a situation that could manifest itself as an interaction failure (for example, as an error of omission or mis-ordering of some of the urgent actions).
What this example shows, is that attached to the provision of a designed resource are (mostly implicit) assumptions about what the practitioner is going to know, where he or she will look, and what he or she will decide to do first based on other evidence or competing pressures. Of course, some might argue (and certainly designers will) that design is not the only resource: a practitioner’s experience and training is just as valid a resource to deal with domain problems. While this is undoubtedly true, our approach endeavours to expose a designer’s assumptions about the demands placed on practitioners and their ability to cope with such demands. In other words, with external, designed resources come assumptions on the part of the designers about users internal resources and their ability to carry out the cognitive work of managing internal resources and co-ordinating internal and external representations.
The other issue raised by this example is that introduction of new technology (such as a computerised procedure display system) may undermine the cognitive strategies by which experienced practitioners successfully avoided data overload in the past (Woods, 1989). For instance, for work in relation to this document it was observed that in using paper procedures, practitioners would wedge their thumb between pages that listed the relevant procedures. The thumb turned out not only to be a navigation aid ("how do I get to the right procedure") but also a memory aid ("the clean-up actions of this procedure still need to be done"). The impossibility to use those "old" strategies my in the new system lead to additional cognitive demands, that relate to memory burdens and navigation between procedures.
Before returning to our case study scenarios with these analytic concepts, we can look in a little more detail at the nature of the demands that arise when practitioners use resources to meet problem domain requirements.
Categorising demands
How can we characterise the kinds of demands that flow forth from operability assumptions that we have identified with the help of scenarios? Drawing on analyses of a wide range of complex systems failures, Woods et al. (1994) describe three aspects of cognitive demands, concerned with knowledge, attention and strategy. These aspects of demand have particular significance when considered in the light of the kinds of resources, resource representations and patterns of resource usage that give rise to them. Here are some ways in which we could consider them in the light of our framework:
knowledge aspect — if a resource is internal, then a knowledge demand is simply that the person knows the resource and the situations in which it is applicable; if the resource is externally represented then a demand is the user possess the knowledge needed to interpret the external representation;
attentional aspect — for an internal resource, the attentional demands relate to the need for the person to "activate" the resource at the right time. For an external resource, on the other hand, an attentional demand is that the person recognise the resource’s salience and direct their attention towards it at the appropriate time.
strategic aspect — there are typically many ways to make use of the available resources to meet the operational objectives of a scenario, and a strategic demand is to decide how. For example, if a plan resource is available, then a strategic aspect of demand is the need to decide whether to follow the plan or to adopt a different approach and use different resources.
We are currently looking at ways in which we can help practitioners or other participants in the exploration of scenarios see for themselves what kinds of demands their operability assumptions might create. What stresses the importance of taking these aspects of cognitive demands into account during design is that they invariably turn up in post-mortems of mishaps with fielded designs, albeit in various guises. For example "lack of mode awareness" may refer to a situation where the attentional demand of keeping track of automatic mode transitions was insufficiently supported by system feedback.
Note that the logical conclusion to the arguments above is not that "resources in the world are good", and "resources in the head lead to demands and breakdown"; on the contrary, all resources have demands associated with their usage. What the theoretical framework wishes to lay out instead is a systematic exposition of the human performance, or operability assumptions attached to design decisions. Once these assumptions, and the cognitive demands associated with them, have been identified, they can be used in an evaluative way, for example, by taking them back to designers and practitioners, in order to reflect on their for relevance and seriousness.
3.4. Using scenarios to expose operability assumptions
Our approach is to use the scenario descriptions as the basis of a range of inspection and analysis techniques that to identify possible demand problems of the kind described above. Three classes of demand and interaction failure identification techniques are based on walkthroughs, models and simulations.
Walkthroughs of the scenario can be carried out in conjunction with experts or practitioners The participants will be asked to critique the action of the scenario in a way that will help to identify possible pitfalls and alternative (and possibly erroneous) ways in which the scenario could unfold (for instance, see the retrospective analysis of Dekker (1996), or the use of scenarios by Bżdker (1991)). This approach is reliant on the expertise and mix of the practitioners, and their ability to articulate tacit knowledge and to envisage how scenarios will unfold "for real".
Model-based approaches such as THEA (Wright et al. 1994) or a number of Human Reliability Analysis techniques (Hollnagel 1994); a model is used to systematically guide an inspection or analysis of the scenario to look for places where breakdowns might occur and have a significant impact. This is typically carried out by analysts or designers, rather than practitioners, and will be reliant, among other things, on the "veracity" of the models used. The example below illustrates how a model, based on the theoretical framework expressed in Section 3 can help to systematically highlight some of the issues in one of our example scenarios.
Simulations can be used in a variety of ways to probe a range of questions about where the problems in the use of a system might lie. The crucial factor here is the fidelity of the simulation as a whole, and this can vary in a number of dimensions: with respect to the users (from undergraduate students to real airline pilots), the technology (from desktop mock-ups to full simulation facilities), and the context in which the action takes place (from part-task simulations and simple laboratory experiments to situations where multiple concurrent tasks are active).
The aim of each of these is very similar: to help understand how the way a situation "for real" might differ from the scenario as written down, and in particular where interaction failures and breakdowns are most likely to occur. Each method has it’s own strengths and weaknesses for identifying potential interaction failures. The point is not that any one approach is better than the others, but that together they provide complementary, though qualitatively different, converging sources of evidence about the breakdowns to which system will tend to be susceptible. This can be of vital importance in knowing how to improve the design of an artefact in order to address safety-related concerns. For the purposes of illustration, we show what the results of only one of these — a model-based inspection — might look like.
4. A prototype approach to analysis
In this section we illustrate how an analytic approach aimed at uncovering some error problems might work. The approach aims to identify some of the demand-resource problems and the errors that they could lead to. The approach involves looking at each item in the scenario in turn and asking
• what information resources are required in order for the item to take place;
• whether these are provided in the interface (or the wider environment) or are assumed to be in
the pilot’s head;
• what demands arise from the use of these resource in order to carry out the scenario item;
• what types of error could plausibly from a failure to meet these demands; and
• what effect could such errors have on the system as a whole.
The reason for looking at how demand problems may manifest themselves as concrete errors and system effects is to provide one means of assessing the severity or criticality of particular demand-resource mismatch problems (and therefore a way of assessing where best to expend re-design effort). The results of the analysis can be recorded in a table under the following headings:
Scenario item |
Information resources |
Emergent demands |
Possible failure |
Possible effects on the system |
Case study — analysing interaction failure in the bird strike scenario
It is now possible to show the application of a model of one of the scenarios — the severe bird strike — to see how it can be used to probe the potential for breakdowns and failures in the scenario. Rather than attempt to apply the analysis to "scenario items" at the level of individual actions, we group the behaviour recorded in Table 1 into a number of larger action groups. In addition to making the analysis more manageable, these actions relate more directly to the various goals of the pilot in this situation. One consequence of this grouping, however, is that we no longer have a means of discussing failures that occur at the lower levels of the pilot’s interactions with the system. This, however, seems quite justified as the low level interactions are typically not the point at which demand-resource conflicts emerge in this situation. The requirements associated with deciding precisely which operations to perform is well resourced (through the Quick Reference Handbook, the electronic procedures display formats, and knowledge in the pilot’s head), and conflicts between goals tend not to emerge at the level of individual interactions, but at the level of the higher level action groupings shown in Table 3.
Scenario item |
Information resources |
Emergent demands |
Possible failure |
Possible effects on the system |
Engine 3 urgent actions |
Procedure; Warning — both in the interface. |
Attentional demand to notice warning and procedure May conflict with goal to maintain thrust Time pressure is a factor here |
May delay close while resolving conflict. May lose thrust goal and close too early. |
May fail to extinguish fire. May lose thrust.
|
Engine 4 urgent actions |
Procedure; Warning— both in the interface. Knowledge of action criticality |
Need to knowledge of action criticality and use it appropriately |
May omit or delay engine 4 close task. |
Engine 4 problems could develop into fire. |
Clean-up engine 3 |
Procedure— in the interface.; Knowledge that engine fire procedure not complete |
Knowledge demand: must recall fact that procedure is incomplete |
May omit or delay and continue with engine 4 or generator tasks. |
Marginal effects on safety. |
Clean-up engine 4 |
Procedure— in the interface; Knowledge that engine failure procedure not complete |
Knowledge demand: must recall fact that procedure is incomplete |
May omit or delay and continue with generator tasks. |
Marginal effects on safety. |
Manage generator problems |
Procedure; Warning— both in the interface. |
Knowledge demand: must recall generator problems less critical than engine clean-up |
May fixate on warnings and carry out this activity early. |
Marginal effects on safety. |
Table 3: Model-based analysis of the actions of the bird strike scenario
This analysis highlights a number of aspects of the interaction that might turn out to be problematic when the system is used in practice. The overriding factor that is evident in this analysis is that the "management" of the overall interaction might cause problems due to the paucity of externally represented resources to support correct switching between the groups of actions., and the additional demand that this placed on the pilot in terms of the possession and management of internal resources, and their co-ordination with external resources.
5. Conclusions and limitations of the framework
This paper has presented a framework for the assessment of potential human-machine interaction failures. The approach sponsors the recontextualisation of human error, by taking into account both the details of the operational environment and the cognitive drivers behind the creation of interaction breakdown. The use of scenarios combined with proposed designs, has been presented as a way to support the recontextualisation of human error.
This approach would help a design organisation avoid the trap of having to do human error analysis at the latest stages in the design process, where the evaluation becomes narrowly focused on justifying that the system could work, and where opportunities for informing re-design are no longer available. Instead, the approach outlined in this paper suggests more a strategy of "falsification" of a designer’s assumptions about human performance. By applying proposed designs in scenarios, it helps us discover the impact of the new design on the overall human-machine problem solving ensemble and exposes (implicit) assumptions about the operability of the design. Thus, a method based on this framework would allow us to:
• evaluate the potential for human-machine interaction breakdown early on in a design process;
• expose assumptions about human performance in early versions of the design;
• capture the operational context and cognition behind the creation of error;
• inform re-design when system specifications are still fluid.
There would be certain shortcomings of a method’s incarnation based on the current framework. These relate to the time-paced and event-driven nature of typical complex, dynamic domains. If, for instance, hardware, or other system failures propagate through the system over time, it means that changing evidence about the problem comes in all the time, forcing re-assessments of hypotheses, re-evaluation of intentions, and the need to tap into shifting resources on part of the human practitioners. How situation assessment may go sour over time as a result of this cannot yet be sufficiently captured by the current framework.
References
Bżdker, S. (1991) Through the interface: A human activity approach to user interface design. Lawrence Erlbaum Associates Inc.
Carroll, J. (Ed.) (1995). Scenario-Based Design: Envisioning Work and Technology in System Development. J. Wiley and Sons.
Day, P. and M. Hook (1996). PUMA: A description of the PUMA method and toolset for modelling air traffic control workload. Technical report, Roke Manor Research Ltd., Romsey, UK.
Dekker, S. (1996) The complexity of management by exception: Investigating cognitive demands and practitioner coping strategies in an envisioned air traffic world. PhD Thesis. Ohio State University.
Fields, B., Wright, P.C. and Harrison, M.D. (1997) Objectives strategies and resources as design drivers. To appear in Interact 97: Sydney Australia.
Fischer, U., Orasanu, J. and Wich, M. (1995). Expert pilots' perceptions of problem situations. In 8th International Symposium on Aviation Psychology, Ohio State University, Columbus, Ohio.
Greenbaum, J. and M. Kyng (1991). Introduction: Situated design. Chapter 1 of J. Greenbaum and M. Kyng (Eds.), Design at Work: Cooperative Design of Computer Systems, pp. 1-24. Lawrence Erlbaum Associates Inc.
Halverson, C.A. (1994) Distributed Cognition as a Theoretical Framework for HCI: Don't Throw the Baby Out With the Bathwater -- The importance of the Cursor in Air Traffic Control Department of Cognitive Science, University of California, San Diego Report 9403.
Hollnagel , E. (1994) Human Reliability Analysis — Context and Control. Academic Press.
Hutchins, E. (1995). Cognition in the wild. Boston, MA: MIT Press.
Hsia, P., Samuel, J., Gao, J., Kung, D., Toyashima, Y., and Chen, C. (1994, March). A formal approach to scenario analysis. IEEE Software, 33-41.
Johnson, H. and Johnson, P. (1991) Task Knowledge Structures: Psychological basis and integration into system design. Acta Psychologica 78.
Kuutti, K. (1995). Work process: Scenarios as a preliminary vocabulary. Chapter 1 of Carroll (1995), pp. 19-36.
Kyng, M. (1995) Creating Contexts for Design. Chapter 4 of (Carroll 1995), pages 85-107.
Shepherd, A. (1989) Analysis and training in information technology tasks. Chapter 1 of Diaper, D. (Ed) Task Analysis for Human-Computer Interaction. Ellis Horwood.
Suchman, L. (1987) Plans and Situated Actions: The problem of Human-Machine Communication. Cambridge University Press.
Wickens, C.D. (1992) Engineering Psychology and Human Performance. 2nd edition, HarperCollins.
Woods, D (1989) Modelling and Predicting Human Error. Chapter 19 of Elkind, J.I., Card, S.K., Hochberg, J. and Huey, B.M. (1989) (Eds) Human Performance Models for Computer Aided Engineering. Washington DC: National Academy Press.
Woods, D. (1990) Risk and Human Performance: Measuring the Potential for Disaster. Reliability Engineering and System Safety 29: 387–405.
Woods, D., Johannesen, L., Cook, R., and Sarter, N. (1994) Behind Human Error: Cognitive Systems, Computers and Hindsight. CSERIAC State-of-the-Art-Report SOAR-04-01.
Wright, P.C., Fields, B., and Harrison, M.D. (1994) THEA: Techniques for Human Error Assessment. Dependable Computing Systems Centre Technical Report TR/94/16, University of York.
Wright, P.C., Fields, B., and Harrison, M.D. (1996) Distributed information resources: A new approach to interaction modelling. In Green et al. (Eds) Proceedings ECCE8: European Conference on Cognitive Ergonomics, EACE.
Analysis Of A Human Error In A Dynamic Environment:
The Case Of Air Traffic Control.
Marie-Odile BéS
UniversitŽ Lille3 et Univ. de Valenciennes et H-C
Tel: 27 14 12 34 poste 4004
Email: bes@univ-valenciennes.fr
The constant air traffic increase and the possibilities offered by computerisation and automation lead to consider new tools supporting operator’s activity. This joint project between C.E.N.A., L.A.M.I.H. and Percotec consisted in evaluating various implementations of a principle of distribution of workload between air traffic controllers and an expert-system called dynamic allocation of tasks. Our work focuses on the analysis of human error intervenig during these simulations.We present the study of a case of human error in order to highlight the specific difficulties encountered by operators in managing the temporal aspects of a dynamic environment, and relate these to Reason's (1990) Generic Error Modelling System dealing with the question of human error. In conclusion we bring out some parts of this model that should be deepened in order to encompass the temporal caracteristics of dynamic environments.
1. INTRODUCTION:
Reason's (1990) work, following Rasmussen's (1986) contribution, bearing on human error in process control provide a rich framework for the study of human errors, thus we recapitulate those contributions, indeed the study of human error has proved of use to reveal the needs of operators and the demands they have to meet.
In the first part we present the theoretical framework for the study of human error, then the experiment, then some basic notions of air traffic control and the case study followed by the conclusion.
2 Human Error
2.1 Definitions
We present several definitions of human error putting forward the problems they deal with to conclude on the definition we will use.
In the framework of system reliability Leplat (1985) states that "a human error is produced when a human behaviour or its effect on a system exceeds a limit of acceptability". In this definition the main point is the variation of performance of the human-machine system, this variation should not go beyond an accepted limit or norm. The problem here is the existence of a norm to which we can compare the performance of the human-machine system, yet using a norm depends on the choice of a reference point, in particular it should be relevant to the operator’s activity.
This definition leads us to take into account that observed behaviour with similar performance score can result from varying cognitive processing done by operators when they are confronted to complex and dynamic environments. When human error is tackled in such a way it is thus a deviation between error and performance, not between modes of production of errors and a model of activity which can account for the observed performance.
Rasmussen (1990) stresses that in the study of human error if one considers the variety of actions included in the class of human errors; and wants to understand those, it is necessary to categorise human errors on the basis of the mechanism at their origin, this is what Reason (1990) undertook. This author defines human error as "all the occasions in which a planned sequence of mental or physical activities fail to achieve its intended outcome, and when these failures cannot be attributed to the intervention of some chance agency". Thus only intentional actions can be considered as human errors; here intention is understood as the desired end state of actions.
The definition that we will retain comprises both dimensions described above, a human error will be a deviation from a norm, and a deviation from the preliminary intention to reach this norm. In our work this norm is the minimum separation limit between aircrafts, the preliminary intentions are extracted from the concurrent verbalisations.
2.2 The Generic error modelling system
Reason (1990) differentiates between error types according to the level of cognitive activity described by Rasmussen (1986). At the skill-based level slips and lapses are deviations of actions from the preliminary intentions due either to a failure in execution or in storage. At the rule-based level mistakes refer to the inappropriate use of diagnostic rules, at the knowledge-based level mistakes refer to the correct execution of intentions which are inadequate to reach the goal sought after.
The various types of errors (slips, lapses and mistakes) distinguished according to the cognitive level of activity, contrary to one could believe, do not determine the form of errors observed, in fact errors usually take the form of the application of the mistaken application of a routine procedure frequently used. For this results from the predominance of the two cognitive biases: frequency gambling and similarity matching which are induced by the mechanisms of recollection of information.
Errors thus take most often the following forms:
i: at the skill-based level: it is usually the intrusion of a routine procedure during the execution of a procedure requiring sustained attention from the operator combined with the mental preoccupation either for an external or an internal element. Interruptions can also be the source of omissions.
ii: at the rule-based level: it is either the misuse of a good rule, the fact that the more a response behaviour was activated in the past the readier it will be in mind, or the application of a bad rule, here the rule itself is faulty.
iii: at the skill-based level: the difficulties come from bounded rationality and from the incompleteness of the representation the operator has of the situation as well as from the type of problem dealt with, static or dynamic.
The existence of a delay between the formation of an intention to act and the execution of the action can induce a loss of the primary intentions, which is called prospective memory failure. This designates the capacity to remind oneself of the actions that will have to be carried out, and the moment of their execution, Reason (1990, p107) stresses that prospective memory errors are among the most common forms of human fallibility in everyday life. As slips and lapses depend on the failure of attentional checks (most often omission of a check), prospective memory failures depend on the reactivation of primary intentions, unless they are periodically refreshed by attentional checks in the interim, they will probably become overlaid by the other demands placed on the conscious workspace. In both cases, the mode of control of attention is feed-forward emanating from stored knowledge structures (motor programs, schemata or rules) and the performance is based on a very flexible and efficient dynamic world model.
3 PLATFORM AND EXPERIMENTAL SETTING
The dynamic allocation of task has been implemented and resulted in a computer platform (SPECTRA V.2), it was tested during three weeks at Riems’ En-Route air traffic control Centre (C.N.R.A. EST). The aspects investigated are the effects of three experimental conditions on performance, human errors and human-machine cooperation: a. without tools, b. with the expert system’s support, the allocation of tasks being chosen by the planning controller, c. with the expert system’s support, the allocation of task being chosen by a task scheduler (the planning controller can modify the allocation if the radar controller requires it). The simulation platform comprises five screens, one for the radar screen and one for electronic strip board, the radar controller and the planning controller have each of theses, and a planning screen dedicated to the planning controller so that he can task allocation.
Instead of interacting with the pilots through radio frequency, the controller uses a mouse and clicks on elements of the interface which then provide interactive menus enabling transmission of orders to the pilot, either on the radar screen (i.e. on the icon of a plane), or on the strip board (i.e. on the electronic strip). Special attention was paid to the realism and fidelity of the situation, the geographic sector used was Reims’ Centre, the traffic used respected the rules of navigation in this sector (routes, levels, types of flight etc.) and included the visual alarm existing on a real work-station called safety net (SFNT, the safety net is released when converging planes are under the minimum limits of separation, 8NM horizontally or 15km, the icons of the planes involved turn red and start flashing on the radar screen).
Qualified controllers practised (two hours) with the platform after a presentation of all the elements of the interface and tools was done, the practice session included three scenarios (45 min each) , first a scenario with a low workload then two long scenarios with heavy workload.
Three pairs of controllers (one planning and one executive) completed three sessions of 45 min, one for each experimental setting (once a controller was affected to a role he would remain in the same role for the three sessions). Each session was followed by an auto confrontation and an interview based questionnaire (for each of the controller separately ) on various aspects of the interface, tools, errors committed and human-machine cooperation (the radar controller did the auto confrontation first and then the interview). The sessions were audio-recorded as well as the auto confrontations, the computer record of each session were saved, enabling not only the auto-confrontation via a replay but also the coding of the controllers actions and system states.
The data collected, spontaneous verbalisations, controllers actions on the system, system states and auto confrontation were integrated into a single protocol. Since the radar controller is the executive controller in charge of responsibilities for the flights in his sector we have focused on his activity. We present below an analysis of an error committed under the condition described above (a. without tools), indeed it is necessary to know controllers’ errors when working by themselves before dealing with the effects of the various experimental conditions.
4 Notions of Air Traffic Control:
Before turning to the case study we describe some aspects controllers' activity, distribution of tasks, procedures used and prescribed objectives allowing the analysis of data which are of interest for the case presented.
En-route air traffic controllers work by pairs, the radar controller (R.C. also called executive controller) his main tasks are to insure communications with pilots and conflicts resolution inside the geographical sector he is in charge of, the planning controller (P.C.) synchronises activity with adjacent sectors and is in charge of conflict detection for planes entering the sector which he signals to the (R.C.), he underlines the identifications (i.e. BA1997) of planes involved in a conflict with a specific colour. Then the R.C. is in charge of the conflict resolution.
The time spent in the sector by a flight comprises three key stages: entrance, crossing, transfer to the following sector.
Regarding the entrance, after the pilot has called ("APPEL(identification)" in our coding scheme) the (R.C.) can take the flight in charge ("ACT-STRIP (identification, assume)") the time between those two events can vary, for instance if the (R.C.) is busy with other tasks.
The (R.C.) must then integrate the flight to the rest of the traffic and deliver an initial clearance (flight level authorised in the sector), this entails an analysis of the route followed by the flight in order to detect conflicting aircrafts. During the crossing the (R.C.) delivers deviation orders, either change in heading ("ACT_TRAJ(identification,cap,value)") or in level ("ACT_TRAJ (identification, CFL,value)"). If there are several intersections on the route of a plane the (R.C.) will have to take it into account (requiring several conflict detection), thus he may have more than one action on a single plane.
Finally when the plane approaches the boundary of the sector the (R.C.) transfers the flight to the next sector (in our coding scheme "ACT-STRIP(identification, TRANS_SEC)").
It is thus the (R.C.) who is in charge of the decisions regarding the flights crossing his sector due to this prevalence and to the focus of our work, human error, we will concentrate on the activities of the radar controller in priority.
To understand and evaluate radar controller's activity we said earlier that an accepted norm has to be chosen. The minimal distances of separation are used (to delimit our corpus of errors) but do not suffice to describe controller's activity, thus the procedures and objectives taught to controllers in their training in Reims Centre will be used as a reference model of the prescribed tasks (cf. Leplat (1985), as opposed to the activity observed via concurrent verbalisations. This reference to prescribe tasks will help us first to understand observed activity and second to evaluate deviations from the prescribed procedures and objectives.
Two objectives are of interest for the sequence we will detail in the case study:
i. the (R.C.) must answer in priority to pilot's call.
ii. The (R.C.) must be able at any moment to tell what will be the next message he will have to do or that he will receive.
The first objective means that the (R.C.) should be able to answer immediately to any pilot call, whilst the second means that he should know at any moment what will happen in his sector to be able to cope with it and insurance that the situation remains secure.
5 CASE STUDY
The analysis is presented in two parts, first the whole sequence of conflict resolution between DLH795 and SLR2744, then the resolution is placed in its context, this error was chosen for it entails two parameters: i. the error constitutes a potential threat to security. ii.radar controller's actions played a critical role in the occurrence of the incident.
5.1 CONFLICT RESOLUTION BETWEEN SLR2744 DLH795:
The following points are notable (cf. Fig.1):
i. the (R.C.) identifies a problem between DLH795 and SLR2744 early, before any of the planes involved has called 20:58:46 "OK., I have a conflict. 3.5 at Epinal."
ii. The planning controller underlines the identification of each planes 5min before the release of the alarm (21:01:52 "CO MARQ IND").
iii. Pilots' call for the two planes happen less than 5mn before the alarm takes place there is thus only a short time span left for the (C.R.) to act. The speed is 481 knots (8 NM/mn) this means DLH795's call takes place 4mn 43s or 43NM before reaching the minimum separation, for SLR2744 2mn 14s or 17NM before minimum separation is reached.
iv. The call of DLH795 is followed by integration in the sector but no verbalisation about it is uttered. The (R.C.) does several other tasks in succession. He has not recalled what he identified earlier and he will act 2mn 16s later. We consider this first delay consider as an omission of the prior intention.
v. Then the (R.C.) comes back to the conflict 21:04:00 "Sobelair (SLR2744) Bruxelles. Speeds ? Identical, they cut." Since the distances to the crossing point and the speeds are equal he decides to give identical deviation to each plane. This is the type of resolution which chosen here, but his analysis is interrupted.
vi. He comes back to it at 21:04:49 and finally gives the first action of deviation to the DLH795 (18NM before minimum separation).
vii. At last the SLR2744 will call at 21:05:10 the (R.C.) will integrate it one minute later and then deviate it (9NM before minimum separation) this constitutes a second delay.
The (C.R.) has detected the conflict well head of the alarm and has chosen a resolution which is acceptable yet the omission of prior intention, the accumulation of several delays whilst the situation is urgent lead to the violation of minimal separation.
Going back to the two objectives stated above the first objective regarding answers to pilot call is met for DLH795 but not for SLR2744 (cf. vii.). Regarding the second objective the (C.R.) is not able to give a deviation to the DLH795 soon after accepting it in the sector but 2mn 13s after it was accepted (cf.iv., v.,vi.). The delays detailed above are linked to the demands put on the (R.C.) by other pending tasks he had to carry out interspersed with the conflict resolution between DLH795 and SLR2744. We present below the whole conflict resolution in context.
5.2 ANALYSIS OF THE CONTEXT:
We present the sequence of actions carried out in parallel of the conflict resolution, in order to present the external conditions contributing to the incident described above. Verbalisations from 21:02:30: until 21:06:40 that is during conflict resolution of DLH795 and SLR2744 (since DLH795 enters the sector until the alarm is released).
(1) DLH795 is integrated into the sector without verbalisation.
(2) 21:02:46 "So, LTU 3.30 2.30 (LTU123). It must be sent, Who will obstruct ? The Malef (MAH558) I'll have a crossing and I won't be able to descend it (LTU123). I'll put it at 2.9. As soon as the LTU123 is integrated in the sector the detects a problem with MAH558.
(3) 21:03:09 "O.K. TWE (TWE347) has called me, Hello maintain 3.30 I'll call you again." The (R.C.) maintains the flight at its entrance level in order to come back to it later.
21:03:25 "Right, kilo mike (IT17KM) and KLM (KLM629), they cut." The (R.C.) makes a detects a conflict but no action is carried out.
(4) 21:03:40 "The Luft' (DLH5028), I should have put it there." The (R.C.) notices the position of the strip and transfers it to the following sector.
21:04:00 "So, Sobelair (SLR2744), Bruxelles. Speeds ? Identical. They cut." here the (R.C.) makes the part of the conflict resolution between DLH795 and SLR2744.
(7) 21:04:24 "What is this ? Ah yes, it has to go down (IT401QG) it's late. To 120, it's a mercure, it goes down really well. the (R.C.) is interrupted into his reflection and has to answer to the IT401QG which has to be giving a descending order.
(9) 21:04:49 "OK. if I need a direct route it (SLR2744) will go this way, to FLORA. Ah but no I don't have it. On the other hand this one yes I have it (DHL795), let’s go, I am going to give him 10 (cap 10 left, DLH795). This one (SLR2744) I don't have it, did it call me ? No, not yet." the (R.C.) has taken up again the conflict resolution of DLH795 and SLR2744 in order to finish it and give a deviation to DLH795.
(10) 21:04:56 "The BRY (BRY657) has called, 3.9 it stays at 3.9." the (R.C.) maintains the BRY657 at its entrance level in order to come back to it later.
(11) 21:05:14 "TWE (TWE347) I have to shoot it." the (R.C.) starts the descent of TWE347.
In the four following extracts the controller deals with a conflict between IT17KM and KLM629:
21:05:05 "At 3.10 they cut, Kilo Mike (IT17KM) where is it going ?"
21:05:24 "Where is it (IT17KM) ? Ah it's going to Boulogne."
21:05:31 "I am going to deviate him (IT17KM) from the KLM (KLM629) as soon as I have it on the frequency, maybe I have him (IT17KM) by the way ?"
21:05:40 "No I don’t' have him (IT17KM) on the frequency yet."
(13) 21:05:46 "Ah, this one has called (SLR2744)."
(16) 21:05:56 "I can give him (SLR2744) a small deviation." the (R.C.) gives a deviation order to the SLR2744.
(17) 21:06:02 "Portos (TAP345) 2.8. Let's go." after obtaining the required separation for a pending conflict the controller gives a new clearance to the TAP345.
21:06:09 "2.8 mmmh, I have to check the other one at 3.30, the Scandi' (SAS858)." the (R.C.) makes a check on an interfering plane.
(18) 21:06:14 "There is the AFR347 which is calling, a 747 he wants 3.5 and goes to Abbeville, here 3.5, they cut. I'll put it at 3.10. Here is it ahead ? Scandi’ goes up at 2.7.
The (R.C.) detects a conflict between ARF347 and SAS858.
(19) 21:06:45 "I haven't cleared it (AFR347) yet, 2.6 Hello. Since it does not climb well I will have a problem." Then he notices that he is late and delivers the clearance.
Finally the MAH558 is integrated into the sector without verbalisation.
During this sequence of activity we would like to underline the importance of pilot’s call and integration tasks which frequently interrupt the treatment of the conflict between DLH795 and SLR2744. All the interfering tasks make it difficult for the controller to meet the two objectives described in section 5.1, they play an important role in the occurrence of this incident.
6. CONCLUSION:
The analysis of this incident enables to describe two set of factors, internal and external , leading to the incident, we will first recapitulate internal factors that is factors depending on (R.C.) activity and then external factors depending on the context of the resolution.
We have highlighted several internal factors:
i. the absence of action on DLH795 after its integration is an omission of prior the intention, we can relate it to a prospective memory error or failure to remember to carry out intended actions at the as described by Reason (1990).
ii. the delays accumulated in answering pilot’s call and in implementing the deviation actions.
As regards external factors the numerous other pending tasks (pilot’s call, integration tasks, transfer to the next sector etc....) interspersed with the analysis of the conflict resolution and whilst the conflict is underway induce several delays first during the analysis of the conflict and second in the implementation of actions. The high tempo of activity required from the (R.C.) make it difficult for him to be ahead of the situation in order to remain in control of it.
Regarding the choice of conflict resolution (deviating of the two planes) we said earlier that it was not an error as regards the conflict itself but revealed insufficient, if we take in consideration the whole context and the high workload, indeed it would have been more economical to deviate only the first plane entering the sector (DLH795). The (R.C.) did not take it into account, neither the fact that pilot’s call were delayed thus putting himself in a situation where he had little time to act and a lot of other tasks calling for his attention. This could be described Reason’s framework as the misapplication of a good rule, resulting from the predominance of the cognitive bias similarity matching to a new situation, it is here the management of the temporal constraints of the situation which is the problem.
Following this case study we can emphasise several dimensions which have consequences for the prevention of human-machine system failure:
i. the existence of delays in the implementation of intended actions in management of dynamic situations under heavy workload may increase the frequency of prospective memory errors.
ii. thus even when an operator has identified a problem it is not a guaranty that he will be able to meet the time constraints of the situation in a dynamic environment.
iii. the frequent interruptions of activity in order to attend to other pending tasks endanger controller’s activity. It can compromise not only the analysis of the conflict and its supervision (as seen in the case presented) but it could also hinder the detection of the conflict.
Specific support offered in the management of workload and time constraints could be of great help to the operators in order to overcome the difficulties described above.
We would like to thank Pr. Millot for his interest and support of this work.
REFERENCES:
Leplat, J. (1985) Erreur humaine, fiabilitŽ humaine dans le travail. Paris, Colin.
Rasmussen, J. (1986) Information processing and human-machine interaction. Amsterdam, North-Holland.
Rasmussen, J. (1990) Event ananlysis and the problem of causality. Dans Rasmussen, J. et al. (Eds) Distributed Decision Making: Cognitive models for cooperative work. L.E.A..
Reason, J. (1990) Human error. C.U.P., Cambridge.
Van Daele, A. (1988) L'Žcran de visualisation ou la communication verbale ? Analyse comparative de leur utilisation par des opŽrateurs de salle de controle en sidŽrurgie, Le Travail Humain, 51, 1.
Other view of the trajectory of planes (cf Fig 3)
Other view of the trajectory of planes (cf Fig 3)
Surveillance Work Process Analyses: A Necessary Step in the Development of Aviation Safety Decision Support Systems
Heather W. Allen and Marcey L. Abate
Statistics and Human Factors Department
Sandia National Laboratories, Albuquerque, NM 87185
INTRODUCTION
Recent events in the aviation industry have focused attention on the United States Federal Aviation Administration’s (FAA’s) Flight Standards Services, which is responsible for overseeing aviation operational and safety activities. Particular attention has been paid to the management and control of methods and data associated with surveillance activities performed by FAA aviation safety inspectors. One of the ongoing efforts responding to this increased scrutiny is the Aviation Safety Risk Analysis Technical Support (ASRATS) effort, begun in 1995 at Sandia National Laboratories. A goal of the ASRATS effort is to assist the FAA in methods fostering an efficient, data-based organization in which aviation safety inspectors perform surveillance activities prioritized by safety criticality.
In pursuit of this goal, ASRATS work has focused on understanding the processes and systems related to planning, performing, and recording aviation surveillance activities. Currently, results of surveillance activities, which are often used in planning future activities, are recorded by manual input into multiple FAA databases. In addition, FAA aviation safety inspectors often have access to selective databases maintained by other Federal agencies, foreign governments, airlines, and aircraft manufacturers when planning surveillance activities and conducting risk analyses. Given the complex array of customers and suppliers of the FAA databases, as well as the number of external accessible databases, the need has been recognized for computerized decision support so that the demand on the aviation safety inspector’s decision-making capabilities are maintained within limits consistent with the human’s information processing capabilities. Such systems are currently being developed for certificate holder risk assessment, for the identification of emerging safety trends, and for targeting the allocation of FAA aviation safety inspection resources toward areas of greater risk. However, the need and development of decision support systems lead to consideration of both allocation of function and the form of communication between the human and computer. For this reason, recent ASRATS research has concentrated on defining the tasks that inspectors must perform, and using these task definitions to drive the integration of improved data analysis and presentation methods into the analytic software systems being developed as decision support tools for FAA inspectors. Thus, this paper is written to demonstrate the importance of task focus as a necessary step in the design and development of an aviation decision support system.
If task focus, and ultimately work process analysis, is a necessary step in the design and development of an aviation decision support system, where does this focus fit into the design process? Generally, task focus, which leads to the design and implementation of a systems’ operational requirements, is integral at all phases of the system design and development process.
During the proposal phase of system development, ideas for a new system, based on a real data-driven need, are typically generated. The ideas generated often stem from a performance analysis of the existing system, or way of doing business if a system does not exist. This analysis may include the results of a combination of interviews, task analyses, fault isolation or root cause analysis, and brainstorming. The results of these activities typically provide a hierarchy of possibilities ranging from, perhaps,
no system is suited for supporting the work to be done,
through
one decision support tool might support the work to be done,
to
multiple system tools are best suited for supporting the work to be done.
The feasibility phase might include a cost/benefit analysis, taskflow practicality study, and a determination if simple manual solutions are available that might make a proposed system unnecessary. In essence, it is during this phase of system development that checks are performed to insure that a proposed system is practical and worthwhile. The consequences of failing to address the feasibility of a proposed system is typically an expensive system that is impractical, unnecessary, or not worthwhile.
The definition phase is a critical phase from an interface usability standpoint. It is during the definition phase that data on the existing environment is typically gathered. This data includes user characteristics, taskflow, the environment, and perceived problems with the current environment or taskflow. From this data gathering effort, a high-level taskflow design is developed. This design describes system boundaries and high-level functions. Thus, the taskflow, resulting from a task analysis, is intended to match the work to be done with the people who will do it. This process involves identifying tasks, organizing those tasks into a flowchart, and identifying manageable work modules with clearly marked beginning and end points. A work module is a set of tasks that a user accomplishes as part or all of his or her job. It is a basic unit of work. Usually, one or more work modules are combined to form a job. Having systematically derived work modules assists in the design of interfaces and the preparation of facilitator materials, such as instructions, performance aids, and training.
Failure to spend adequate resources in the definition phase often results in a system that does not fit the user or the environment, and has an awkward or confusing system model. It is imperative that the system be designed to both reflect and mitigate the work to be done.
During the preliminary design phase the screenflow architecture is designed to match or reflect the taskflow design. This architecture is often, and most appropriately, referred to as the conceptual model – the foundation for the interface. It is during this phase that the design of the flow of screens and database structure occurs. The consequences of failing to adequately prepare the screenflow architecture is an interface where the user jumps around the system to get work done.
The detailed design phase results in screen layouts and error handling design. It is during this phase that screen design standardization occurs, as does error message standardization and development, and protocol simulation testing. Failure to adequately complete this phase results in screens that are hard to understand and use.
During the implementation phase, human factors activities typically include determining the best overall user support strategy, and the preparation of user support products. While preparation of the support products is stressed at this phase, these materials tend to evolve over all phases of system development, beginning with the definition phase, and ending with the performance review phase. The consequences of failing to perform these activities will be impractical or unusable documents, training or job aids.
Last, the conversion phase hallmarks the introduction of the system. It is during this phase that the system is fielded to the user population. Failure to properly execute this activity may result in an awkward and expensive system installation with lingering bad feelings.
Keeping in mind the system design and development phases discussed, when an analysis of an existing system or process is conducted, the methodology used will dictate, to some extent, the outcome and potential utility of the results. Our intent in this effort is to understand the FAA airworthiness (which includes maintenance and avionics) and operations surveillance inspection activities such that we can visually represent, and validate with supporting multi-year inspection data, the as-is process. Further, with the results of the as-is process serving as its’ foundation, we can then develop a re-engineered (re-eng) model that will serve to improve the efficiency and effectiveness of the process(es) and lead to the creation of a safety critical work process model in which inspection activities are prioritized by criticality of failure. This paper describes our efforts to date developing the as-is and re-eng models..
Given this preface, what methodology might best serve the needs identified? Review of the literature (e.g., Wilson, Dell, & Anderson, 1993; Juran & Gryna, 1988) indicates that the use of traditional task analysis, defined as the "study of what an operator (or team of operators) is required to do, in terms of actions and/or cognitive processes, to achieve a system goal" (Kirwan & Ainsworth, 1992, p. 1) is most appropriate in performing this type of analysis. This methodology aids in collecting information, organizing it, and using it to make design decisions. In essence, the use of task analysis provides the analyst with a blueprint of human involvement in a system, building a detailed picture of that system from the human perspective.
Review of the literature also indicated that root cause analysis, one of the many tools often used to support Total Quality Management (TQM) efforts, can be used in two ways: reactively (problem-solving), to identify catalyst(s) contributing to problems, and prevent problems from recurring by correcting or eliminating the catalyst(s); and proactively (predictive), to examine current operations and help identify areas and activities that can be improved. Perhaps one could surmise that the fullest potential of the root cause analysis is to use the methodology proactively, rather than reactively.
Focusing on root cause analysis, several techniques or methodologies that may be used include: Change Analysis, Barrier Analysis, Event and Causal Factors Analysis, and Tree Diagram. After considering each of the analyses possible, the Tree Diagram approach was adopted.
The Tree Diagram is a graphical display of an event or activity that logically describes each of the event’s contributing factors. Tree diagrams are useful in helping visualize and analyze more complex systems or problem situations. Tree diagrams can be used either reactively or proactively. When used reactively to investigate accidents or incidents, they are referred to as fault or root cause trees. When used proactively to systematically analyze, plan, and organize processes or activities, they are referred to as positive trees.
Root cause analysis allows maximum use of other analysis techniques, e.g., Pareto analysis. Pareto analysis involves ranking problems, or contributing factors, by their significance and then concentrating on the more important. The ranking criteria may be tailored to the organization or the problem area itself, and generally include cost, risk or probability weighting, or severity or priority level assignments. Other criteria might include worker safety, environmental considerations, legal or regulatory compliance, or potential impact on goal achievement or mission completion. Regardless of the weighting or other applied criteria, the pattern that usually emerges is one of a smaller percentage of the total number of problems being the more important: in essence, the 80-20 rule, where 20% of any organization’s problems may cause 80% of their troubles.
METHODOLOGY
As-Is
Given the foregoing, it seemed prudent to adopt a combined task/root cause analysis approach to facilitate the airworthiness and operations surveillance inspection activities analysis. The results of this approach is a review of the as-is state of current surveillance inspection activities. There are no attempts to make judgments, based on the data presented, as to appropriateness or inappropriateness, nor correctness or incorrectness, of the activity processes. Rather, this analysis focused on the documented FAA Orders followed by the inspectors, and supporting data.
The work process analysis included compilation of specific information from the airworthiness and operations inspection activities, identified in the Airworthiness Inspectors Handbook (FAA Order 8300.10) and the Air Transportation Operations Inspector’s Handbook (FAA Order 8400.10), review of pertinent Federal Aviation Regulations (FARs), and a cross check with other previously completed efforts. Decisions were made to limit the analysis to Federal Aviation Regulation (FAR) Part 121 (i.e., large air operators such as Delta and American Airlines) operators to provide focus to the document.
Figure 1 shows the 30 airworthiness activities described in the handbook, often discussed synonymously as different inspections. Only inspections defined as airworthiness surveillance activities were analyzed and are reflected in the figure as shaded boxes included in the as-is analysis. Those activities that were modified as a result of the re-eng process are reflected by the wider border. Figure 2 shows the 16 operations surveillance activities. All activity boxes are shaded indicating that all were included in the as-is operations analysis. The activities highlighted by the wider borders reflect those that were modified as a result of the re-eng process. The or connector shown on Figures 1 and 2 implies that when performing airworthiness surveillance activities, any one or more of the activities might be chosen.
Figure 1: Airworthiness Surveillance Activities
Figure 2: Operations Surveillance Activities
After amassing pertinent data, work process analyses were performed for each identified inspection activity. The results of these analyses were figures representing all tasks performed, including each element inspected, for each activity, and supporting matrices showing all inspection elements across activities. Following the completion of these activities, the Performance Tracking and Reporting System (PTRS the primary database where inspectors enter data resulting from surveillance activities) data analyses were performed to validate findings of the work process models by quantitatively evaluating the actual recorded work and comparing the results to the tasks outlined in the models. Data from the previous four fiscal years (1993-1996) were included in these analyses to insure consistent findings.
Re-Eng
The transition from the as-is to the re-eng work processes required a comparison across inspection activities within groups, as well as between groups. The comparisons provided the data required to eliminate select inspection element redundancies. This work was followed by a tabulation of as-is and re-eng inspection elements, and an analysis of keywords by inspection element. Keywords are used by inspectors to annotate, for tracking and analysis purposes, the essence of each comment entered as a result of an inspection.
The results of this methodology, from a work process model perspective, is a presentation of a strawman to show what could potentially be done with existing work processes to make those processes more effective, efficient, and analyzable. Further, the approach taken is the least invasive and, as a result, least disruptive to the existing process. The methodology used for the re-eng phase insured that:
á
all as-is elements accounted for in the previous effort were maintained in the re-eng effort, that is, no elements that existed prior to this effort have been eliminated as a result of this effort;á
redundant elements were eliminated where deemed appropriate;á
no elements were added to activities if they were not a part of that activity prior to this effort; andá
a traceable, defensible path exists between the handbooks referenced, the PTRS database, and the as-is work process models generated, and the re-eng results reported here.
RESULTS
Work Process
As-Is
For each of the activities shaded in Figures 1 and 2, task analyses were performed, matrices of inspection elements (i.e., specific items that are observed during an inspection) across activities developed to facilitate inspection element comparisons, and flowcharts generated.
When analyzing the results of the task analyses and resulting inspection element matrices, logical groupings of activities emerged, based on types of inspection items or elements included in that activity. Figures 3 and 4 show such groupings for the airworthiness surveillance activities.
Figure 3: Logical Groupings of As-Is Airworthiness Inspection Activities
The remaining grouping of airworthiness surveillance activities appeared to have distinct inspection elements:
Figure 4: Remaining As-Is Airworthiness Inspection Activities
Figures 5 and 6 (see legend below) present the resulting as-is work process model for the Airworthiness Ramp and Cockpit En Route Inspections. Visually, one can confirm the commonality of many inspection elements across activities. This type of information presentation, along with the matrices developed, provided a straightforward means of comparing and contrasting the inspection elements across activities, as well as clearly showing the similarities and differences, in basic process flow, across activities.
Figure 5: As-Is: Airworthiness Ramp Inspection
Figure 6: As-Is: Airworthiness Cockpit En Route
In fact, when analyzing the Airworthiness Ramp Inspection elements, one finds that of the 165 elements included in the Ramp Inspection, 100% of those elements are in common with those included in the Cockpit En Route Inspection, 94% are in common with those found in the Main Base, Sub Base, and Line Station Inspections, and 54% are in common with those found in the Cabin En Route Inspection (see Figure 7). In other words, 51% (84) of the Ramp Inspection elements overlap with exactly five other inspection activities, 43% (71) of the elements overlap with exactly four other inspection activities, and 6% (10) of the elements overlap with exactly one other inspection activity.
Figure 7: As-Is Airworthiness Ramp Inspection Element Analysis
Given the as-is airworthiness and operations work process results, there are several issues to consider when considering a re-eng work process. First, there appear to be areas of redundancy that could be more effectively structured. Second, the overall information in terms of number of activities, number of corresponding inspection elements, and relevant FARs, etc., far exceed what typical humans are able to manage. By streamlining the process, human performance should improve.
Re-Eng
Transitioning from the as-is to the re-eng work processes required a comparison across inspection activities within groups, as well as between groups. The comparisons performed provided the data required to eliminate select inspection element redundancies. The results of this effort are shown in Figures 8 (re-eng Airworthiness Ramp Inspection), and 9 (re-eng Airworthiness Cockpit En Route Inspection). Inspection elements have not actually been lost; rather, they have been maintained in the activity for which they best seem suited. The as-is Airworthiness Ramp Inspection (Figure 5) included exterior inspection elements as well as interior inspection elements. The interior inspection elements included in the Ramp Inspection were also included in the as-is Cabin En Route Inspection. Therefore, all elements relevant to the Cabin En Route Inspection were maintained in that activity, while select elements pertaining to the inspection of the cargo compartment that were common to both activities, but appeared more suited for inclusion in the Ramp Inspection, were maintained with the Ramp Inspection. Similarly, elements pertinent to the cockpit that were redundant between the Cockpit En Route Inspection (see Figure 6) and Ramp Inspections were allocated to only the Cockpit En Route Inspection, thus reducing the number of elements in the Ramp inspection.
Now, instead of performing, for example, a Cockpit En Route Inspection with 203 inspection elements covering areas interior (including cockpit and cabin areas) and exterior to the aircraft, the inspector would perform a Cockpit En Route Inspection with 55 elements. He or she might then elect to perform a Ramp Inspection with 108 elements and/or a Cabin En Route Inspection with 117 elements.
Figure 9: Re-Eng: Airworthiness Ramp Inspection
Figure 9: Re-Eng: Airworthiness Cockpit En Route
Looking across activities, the re-eng airworthiness and operations surveillance activities work process analysis results, showing the original number of inspection elements (as-is), the re-engineered number of inspection elements (re-eng), and the total (and %) element reduction, for the re-engineering effort are presented in Table 1.
Table 1: Overall Inspection Element Results |
||
Airworthiness |
Operations |
|
As-Is |
1696 |
1363 |
Re-Eng |
829 |
782 |
Element Reduction |
867 |
581 |
Percent Reduction |
51% |
43% |
All elements described in the as-is work process are retained in the re-engineered work process. Redundancies are eliminated by retaining the element in the most appropriate inspection, and eliminating it from the other inspections. Overall, for the airworthiness inspection activities, merely removing redundant inspection elements results in a reduction of elements across the re-engineered activities from 1324 to 457, for a total reduction of 867 elements (65%). When looking across all airworthiness inspection activities, regardless of whether a particular activity was re-engineered or not, the total as-is number of inspection elements was 1696, with the re-eng being 829, for an element reduction of 867 (51%). Similarly, for the operations inspection activities, removing redundant inspection elements results in a reduction of elements across the re-engineered activities from 987 to 406, for a total reduction of 581 (59%) elements. Across all operations inspection activities, regardless of whether a particular activity was re-engineered or not, the total as-is number of inspection elements was 1363, with the re-eng being 782, for an element reduction of 581 (43%).
This re-engineered process serves to better define and constrain a particular inspection activity into a more manageable set of elements, enhances traceability of data to the problem areas, and may facilitate the definition of an appropriate set of keywords as a means of comment coding. It is strictly a presentation of a strawman to show what could potentially be done with existing work processes to make those processes more effective, efficient, and analyzable. Further, the approach taken is the least invasive and, as a result, least disruptive to the existing process. There has been no attempt to add items that previously were not there. There has been no attempt to completely delete items already existing. And there have been no modifications made to add uniformity across activities. The results of this effort are a set of re-eng work process models that can be traced back to the as-is models, and data analyses that result from entries to the PTRS database.
Next, we investigated the process by which PTRS is used to record data in relationship to the elements and tasks as described in the work process analyses.
DATA ANALYSIS
Validation of Work Process Findings
To validate the observation of redundant elements across activities, we turned to surveillance data as recorded in PTRS. Specifically, PTRS data analyses were performed to validate findings of the work process models by quantitatively evaluating the actual recorded work and comparing the results to the tasks outlined in the models. Data from the previous four fiscal years (1993-1996) were included in these analyses to insure consistent findings. For those activities whose inspection elements overlap as shown in the work process diagrams, the recorded data tended to also overlap in keyword coding.
In particular, within the first airworthiness grouping, keywordsmanuals currency, procedures/methods/systems, programs, parts/materials, equipment/tools, other, and equipment/furnishings were frequently recorded. This tendency is depicted in Figure 10 which shows for the airworthiness activities in group 1 (Figure 3), the percent of FY96 records coded with certain keywords by each activity. In the figure, each pattern represents a different keyword as specified in the figure legend. Note the similarity of patterns across the activities, indicating that data is coded in a redundant manner.
Figure 10: Redundancy of Keywords Across Group 1 Airworthiness Activities
Similarly, keywords procedures/methods/systems and inspection systems are frequently recorded for the activities in the second airworthiness grouping which contains Spot Inspections and the Monitoring of Continuous Airworthiness Maintenance Programs. For the remaining group of airworthiness surveillance activities that appear to be unique in elements being inspected, the data were not redundant in keyword coding.
In the same way, the groupings based on element redundancy for the operations surveillance activities were supported by data analyses. The first operations grouping which contains Cabin En Route, Cockpit En Route, and Ramp inspections are commonly coded primarily with keywords manuals currency, manuals revisions/system, procedures, and MEL/CDL. Likewise, the Line Check Inspections and Pilot-in-Command Operating Experience Observations, also in this first group, frequently share keywords ability/proficiency and qualifications/currency. However, the second grouping which includes Ramp and Operator Trip Records, although overlapping in specified task items and elements, are not typically coded with redundant keywords. This may indicate the tendency to more frequently perform those items which are not redundant. Finally, those operations surveillance activities identified as unique in task items and elements were found to be typically unique in the keywords used to code problem areas.
In summary, the surveillance data as recorded serves to validate the findings of the work process analysis. However, previous data analyses have revealed that keyword usage follows the Pareto principle. That is, there are a "vital few" keywords that comprise the majority of the data. Specifically, it has been shown that approximately 15% of the keywords constitute 60% of the observed data. Because so few keywords are actually used, it is not surprising that when broken down by activity number, it is common to find the same keyword attached to many different activities. These findings may demonstrate either repetition in associated items and elements between activities or the inability of the current coding system to uniquely identify findings between activities. In either event, if the data collection system is to be used for measuring performance, it is important that the analyzability of the data be well understood. For this purpose, further studies were performed to investigate the process by which PTRS is used to record data in relationship to the elements and tasks as described in the both the as-is and re-eng surveillance analyses.
Data Collection and Recording
Accepted safety science methodology states that the purpose of measurement is to represent the characteristics of observations by symbols that are related to each other in the same way that the observed objects, events, or properties are related. This implies that in relation to the elements and tasks performed as part of the surveillance work process, a measurement system should be in place that represents the performance and findings of those tasks in a similar structure to which they are performed. That is, there should exist a well-defined correspondence between the tasks performed and the data that is recorded. Without such a clearly defined relationship, it is not reasonable to expect consistent data recording practices because individual interpretation becomes a necessity. While individual interpretation and creativity is an important aspect of problem solving and identification, individual interpretation applied to a measurement system limits statistical inference because the precision of any analysis is a function of the rules under which the data were assigned. As Tarrants (1980) states in The Measurement of Safety Performance, "the ideal criterion of safety performance should permit statistical inference techniques to be applied since, like most other measurable quantities dealing with human behavior, safety performance will necessarily be subject to statistical variation".
Given that a well-defined correspondence between the tasks performed and the data recorded is desirable, an analysis was performed to evaluate the ability of the PTRS system to record data which conforms to the elements and tasks performed. We began by considering the current method used to record surveillance findings in PTRS. Typically, tasks and elements of an activity are performed and observations are recorded by choosing an appropriate keyword. Obviously, consistent data recording would require that when recording findings from equivalent tasks, different inspectors or even one inspector on multiple occasions would identify the same keyword as appropriate for similar situations. To assess whether this is feasible, the correspondence between inspection elements and keywords was investigated. Each element and task as described in the handbook was searched for a word or phrase that corresponded to a particular PTRS keyword. The liberty to infer the meaning of words or phrases and their particular correspondence to existing keywords was not taken unless specifically made in the handbook.
Examination of the work processes as given in the handbooks revealed that most of the work processes are specified in a hierarchical manner, beginning with a general task and ending with a specific inspection element to observe. To establish as detailed a link as possible between the work performed and the keywords, an attempt was always made to assign a keyword at the finest or most specific level of the particular work process. For example, part of an operations Ramp Inspection is to observe the crewmembers - preflight - carry-on-baggage. In this case, the keyword that was linked to this task was carry-on-bags, not preflight. This exercise revealed that it is not possible to link elements to keywords for 14% of the airworthiness or 13% of the operations activity elements because the elements did not contain a word or phrase that corresponded to an existing keyword. After all possible connections had been identified, only 37% of the keywords were used as links to airworthiness activity elements, and only 52% of the keywords were used as links to operations activity elements. This is consistent with the previously noted tendency for the keywords, in practice, to adhere to the Pareto principle.
For those keywords that were connected to an element, it was common for the same keyword to be linked to many different elements and across various activities. This is not surprising given that the work process analyses showed that different activities often share the same or similar tasks. The consequence is that a keyword by itself does not clearly correspond to the work being performed by the inspectors. Without such a clearly defined connection, it is not reasonable to expect consistent data recording practices across inspectors because individual interpretation is necessary. Due to elimination of redundancy, it was expected that the re-engineering effort could be used to improve the clarity of the correspondence between elements and keywords.
The metric that was chosen to test for an improvement was the reduction in links, defined as the percent difference in the number of activities a keyword was linked between the as-is and the re-eng models. For example, the keyword crewmember knowledge, was linked to eight operations activities in the as-is model. However, in the re-engineered model, crewmember knowledge was linked to only three activities. Thus the metric reduction in links would be (8-3)/8 * 100%= 63%. This reduction in number of operations activities to which personnel related keywords are linked is shown as an example in Figure 11. The elimination of element redundancy reflected in the re-engineering, resulted in a 40% average reduction in the number of operations activities, and a 27% average reduction in the number of airworthiness activities, to which keywords were linked. Given these results, the re-engineered model does appear to have a positive influence on defining the correspondence between the current PTRS system for recording data and the work that is being performed. As previously discussed, clarification of this relationship translates to improved data quality at the most fundamental work process level, may also lead to greater consistency in data recording practices, and an improved ability to measure performance. In summary, the previous demonstrates the importance of considering the effect of work processes on data collection and recording. Failure to take such considerations may result in the development of a system with unreliable inputs.
Figure 11: Reduction in Links between Operations Activities and Personnel Keywords
Measuring Performance
To assess an operator’s safety performance, performance measures are often constructed. Because PTRS is the primary mechanism that allows for the characteristics of FAA surveillance observations to be measured, most implemented performance measures are constructed using PTRS data. However, under the current as-is model, redundancy in elements imply that performance measure calculations may be complicated because of the necessity of combining data from several different activities. Drawing data from several sources may potentially result in either the inclusion of extraneous or the omission of relevant data from the performance measure calculations. The following describes how recording data according to the re-engineered allocation of surveillance activity elements and tasks could potentially reduce both the unwanted inclusion and omission of data in performance measures.
Assuming that the elements and tasks were performed according to the re-engineered process, studies were conducted which showed that several types of performance measures may experience an increase in the amount of data available for calculation due to the inclusion of previously omitted data. That is, the re-eng models, with more focused activities, clearly defined all possible sources of relevant data. Thereby reducing the likelihood that data is omitted from performance measure calculations. In some circumstances, such as the absence of constant data collection errors, an increase in data may result in the reliability of the performance measure values also increasing.
Analyses also showed that some performance measures may not only experience an increase in data but may also eliminate extraneous data due to the re-engineering effort. That is, under the as-is model, performance measures may have included some extraneous data due to the associated activities including extraneous tasks and elements. In these cases, performance measures may be more likely to measure what they were intended to measure, thus increasing the validity.
As an example of how re-engineering may increase the reliability and validity of performance measures, consider the example depicted in Figure 12. Suppose one wished to construct a metric summarizing cabin information. Under the as-is model, this may require gathering data from Ramp, Cockpit, Main Base, Sub Base, Line Station, and Cabin Inspections because all contain tasks which assess the cabin. Because of these multiple sources of cabin data, the possibility exists to inadvertently omit relevant data (by failing to include data sources), and also to include irrelevant data (by lack of clarity in data recording). However, under the re-eng model, all elements and tasks pertinent to the cabin have been included only in the Cabin En Route Inspection. Thus, with a more focused activity, summarizing cabin information would require only one data source. This potentially could reduce both the unwanted omission and inclusion of data, increasing both the reliability and validity of a metric summarizing cabin information.
Figure 12: Measuring Cabin Performance Under the As-is and Re-Eng Models
In summary, it is possible to perform simple data analyses which serve to validate and assess the effect of re-engineering. By consideration of the re-eng model, improvements in the data collection, recording, and analysis capabilities were demonstrated and provide justification for implementing changes at the most fundamental work process level. The previous work also demonstrates that failure to consider the underlying work processes in system development may result in inefficient data collection and recording procedures, and lack of validity and reliability in subsequent performance measures.
CONCLUSIONS
Several conclusions can be drawn from the results of this effort. We will first consider the reallocation of redundant inspection elements realized in the re-engineering phase of the analysis, followed by the importance of this type of work in the design of decision support systems.
Overall, for the airworthiness inspection activities, merely removing redundant inspection elements results in a reduction of elements across those activities re-engineered from 1324 to 457, for a total reduction of 867 (65%) elements. When looking across all airworthiness inspection activities, regardless of whether a particular activity was re-engineered or not, the total as-is number of inspection elements was 1696, with the re-eng being 829, for an element reduction of 867 (51%).
Similarly, for the operations inspection activities, removing redundant inspection elements results in a reduction of elements across those activities re-engineered from 987 to 406, for a total reduction of 581 (59%) elements. Across all operations inspection activities, regardless of whether a particular activity was re-engineered or not, the total as-is number of inspection elements was 1363, with the re-eng being 782, for an element reduction of 581 (43%).
Assuming that the elements and tasks were performed according to the re-engineered process, we show that several performance measures may experience an increase in the amount of data available for calculation due to the inclusion of previously omitted data. In some circumstances, an increase in data may result in the reliability of the performance measure values also increasing. We also show that other performance measures may not only experience an increase in data, but may also have extraneous data eliminated due to the re-engineering effort. In these cases, the performance measures may be more likely to measure what they were intended to measure, thus increasing the validity.
Another analysis was performed to evaluate the ability of the PTRS system to record data which conforms to the inspection elements given in the as-is and re-eng models. Specifically, the opportunities to define a correspondence between keywords and inspection elements using common words and phrases was assessed. It was shown that it is not possible to link elements to keywords for 14% of the airworthiness or 13% of the operations activity elements because the elements did not contain a word or phrase that corresponded to an existing keyword. After all possible connections had been identified, only 37% of the keywords were used as links to airworthiness activity elements, and only 52% of the keywords were used as links to operations activity elements. For those keywords that were connected to an element, it was common for the same keyword to be linked to many different elements and across various activities. The consequence is that a keyword by itself does not clearly correspond to the work being performed by the inspectors. Without such a clearly defined connection, it is not reasonable to expect consistent data recording practices across inspectors because individual interpretation is necessary. Due to elimination of redundancy, it was expected that the re-engineering effort could be used to improve the clarity of the links between elements and keywords. In fact, the elimination of element redundancy reflected in the re-engineering, resulted in a 40% average reduction in the number of operations activities, and a 27% average reduction in the number of airworthiness activities, to which keywords were linked.
Results such as those achieved with the airworthiness and operations surveillance activity re-engineering are clearly significant when considering the number of activities for which each inspector is responsible. This type of reduction, achieved through the removal of redundancy, should result in more focused individual inspection activities, a more manageable scope of inspection elements, a clear demarcation between activities, and an improvement in the overall data recording and analysis capability.
The results of the re-eng approach, reported in this document, is strictly a presentation of a strawman to show what could potentially be done with existing work processes to make those processes more effective, efficient, and analyzable. Further, the approach taken is the least invasive and, as a result, least disruptive to the existing process. There has been no attempt to add items that previously were not there. There has been no attempt to completely delete items already existing. And there have been no modifications made to add uniformity across activities. The results of this effort are a set of re-eng work process models that can be traced back to the as-is models, and data analyses that result from entries to the PTRS data base.
The implications of the re-eng results for improved data quality are clear. We have identified some root causes of problems with data correctness, currency, completeness, and consistency originating with improper work process definitions. The problem is confounded through the data recording by use of keywords which do not link directly to activities. These problems are data driven, requiring solutions that are derived from analyzing fundamental processes and information, such as those shown by the re-eng effort. Attempting to mitigate problems originating with improper work process definitions by the application of technology may not result in measurable improvements.
As a follow-on activity, these results could serve as the foundation for the next generation (next-gen) safety critical process model that insures that inspection activities are tasks, directly related to safety criticality, that are designed such that the potential for human error is reduced and successful performance achieved, as reflected by a prioritization of inspection tasks by criticality of failure.
What ramifications do these results have for the design of a decision support system? Clearly, the decision support system should be designed to reflect the work to be done, that is, the tasks that the inspectors perform. These work process models should be key in the development of the systems’ operational requirements. Following operational requirement definition, functional requirements should be developed for the purpose of supporting the operational requirements. Further, the interface designed should be reflective of the work process models in that the underlying conceptual model, and look-and-feel of the system, should reflect the work to be done. Finally, the impact of work processes on data collection, recording, and analysis capability should be considered. If the quality of the data is poor, its inputs to safety-related decisions will not be reliable, and the utility of the system may be threatened. In essence, all aspects of the design and development process should use the results of an effort such as this as its foundation.
References
Air Transportation Operations Inspector’s Handbook, Order 8400.10, Department of Transportation, Federal Aviation Administration.
Airworthiness Inspector’s Handbook, Order 8300.10, Department of Transportation, Federal Aviation Administration.
Juran, J. M., & Gyrna, F. M. (1988). Juran’s Quality Control Handbook (4th edition). New York: McGraw-Hill.
Kirwan, B. & Ainsworth, L. K. (1992). A Guide to Task Analysis. Bristol, PA: Taylor & Francis.
Program Tracking and Reporting Subsystem (PTRS) Procedures Manual (PPM), Draft Copy, September 1995, US Department of Transportation, Federal Aviation Administration, Flight Standards Service, Washington, D.C. 20591.
Tarrants, W. E. (1980). The Measurement of Safety Performance. Garland STPM Press, New York.
Wilson, P. F., Dell, L. D., & Anderson, G. F. (1993). Root Cause Analysis: A Tool for Total Quality Management. Milwaukee, WI: ASQC Quality Press.
Human Factors in Requirements Engineering
Stephen Viller1, John Bowers2, Tom Rodden3
viller@comp.lancs.ac.uk, bowers@hera.psy.man.ac.uk, tom@comp.lancs.ac.uk
1,3Computing Department 2Department of Psychology
Lancaster University University of Manchester
Lancaster LA1 4YR Manchester M13 9PL
Fax: +44 1524 593608 Fax: +44 161 275 2588
Introduction
Work in the field of human error has typically focused on operators of safety-critical equipment, such as nuclear power plant controllers, and of the design of the human-machine interfaces in such settings. Limited consideration has been given to wider system issues. Similarly, researchers and practitioners in the field of Dependable Systems are concerned with the design of computer-based systems which are intended to be operated in situations where the consequences of failure are potentially catastrophic. For example, the failure of a safety-critical system may cause great harm to people, property, or the environment. The work reported on in this paper is motivated by the need to ‘push back’ these concerns with the operation and design of dependable systems to the process by which they are developed.
Errors in the Requirements Engineering (RE) process are widely considered to be the hardest to discover. Consequently, they tend to remain undetected for the longest time, require the greatest amount of re-work, and are the most expensive to rectify of all errors in systems development. Whilst efforts to detect and rectify errors in RE and the whole of the development process are a necessity, the nature and cost of errors in requirements makes a strategy of avoidance rather than detection a more attractive prospect. The benefits of such an approach are primarily that the amount of rework can be reduced to a minimum, along with related savings in cost and time to completion of the system.
There is a broadening consensus regarding the nature of RE as a social, as well as technical, process involving a variety of stakeholders engaging in diverse activities throughout [3, 5, 9, 16]. Many of the specific details of the process followed for a given product will often depend upon the nature of the product itself, the application domain, similarities and differences to existing products developed by the organization, and so on. When these variations are combined with an often intense production pressure to release products on time, the importance of human skill and judgement in managing the contingencies, and human flexibility and artfulness in making RE processes work (sometimes in spite of the methods followed [1, 20]) becomes readily apparent.
It is our contention that much needs to be learned from the human sciences to inform the development of future safe systems and to develop a more holistic approach to requirements engineering for dependable systems. This work is part of a wider approach to process improvement for RE processes, and has been conducted as part of the ESPRIT REAIMS project.
This paper presents a large review of literature from several human sciences which is relevant to the understanding of errors in RE processes which are attributable to human activity. This activity can be considered in terms of individuals working in isolation, as participants in social groups, and as members of organizations. The following sections consider research from these three perspectives and how it relates to the RE process.
Errors in individual work
The largest body of research on ‘human error’ [18] has its roots in cognitive psychology and cognitive understandings of peoples’ interaction with technology. Work in this area has typically focused on workplace settings such as nuclear power plant control rooms and on operational risks and operator errors in such environments. Rather less work on human error specifically concerns the use of computer-based systems, and there is even less devoted to the process of their development.
A major distinction to arise from this work is between different ‘levels’ of cognitive activity: e.g. skill-based, rule-based, and knowledge-based in Rasmussen’s formulation [17]. These in turn lead to a number of error classes: skill-based slips and lapses, rule-based and knowledge-based mistakes. Skill-based slips and lapses happen during routine, familiar work, which requires little attention in order to be achieved. In RE, this would be typified by mundane activities, involving everyday skills (e.g. typing, reading, filing, etc.). Rule-based mistakes are related to errors in the plan of action when working in previously encountered situations. They can result from the application of ‘bad’ rules, or the misapplication of ‘good’ rules. In RE, the application of generic solutions can be prone to this type of error. Knowledge-based mistakes arise when working in novel situations, where no existing rule or plan can be applied and attempts are made to apply analogous rules which have worked in similar situations. This describes a great deal of RE work where either there is no previous system which is relevant to the current development, or where the personnel involved are inexperienced in the domain of application.
Violations
The distinction between violation and error has been debated, but it hinges on the intentional disobeyance of a rule or plan. Many such actions are violations in name only, because people will often disobey a bad rule in order to fix it. They can be classified in a similar manner to errors, according to whether they take place at a skill-, rule-, or knowledge-based level. Violations frequently occur in RE as short-cuts are taken in order to meet deadlines or engineers artfully present their work in project reviews or reckon with other constraints and contingencies [1, 20].
Group process losses
There is a vast and diverse literature in the field of social psychology concerned with the effects of working in social and group settings on collective and individual performance. Space restrictions here prevent any more than a brief mention of a few well documented phenomena. Social facilitation [13] refers to the change in an individual’s performance when others are present observing. Work on performance in interacting groups has studied the relationship between the nature of the task, the relative performance of individuals in the group, and of the group as a whole [22, 23] leading to recommendations for the best strategy to encourage for different tasks. Group leadership is often cited as a very important factor in whether a team succeeds or fails [6]. Another body of work is concerned with the influence that team members can have on conformity and consensus due to their perceived status in terms of seniority or expertise[24]. The study of minority influence is concerned with the extent to which minority opinion in a group can sway the decisions that the group as a whole take [12]. Finally, investigations into the nature of group decision making have examined phenomena such as group polarization [7] and groupthink [8]. To the extent that RE is typically a group or team activity, these considerations are potentially relevant to its assessment for dependability.
Organizational safety
Empirical evidence already exists to demonstrate that organizational issues are important in RE [4, 11]. There also exist a number of sociological studies of organizations which have much to say on safe and reliable operations [16, 25]. Reason [18] uses the term latent organizational failures to describe the concept that failures of an organizational nature can remain dormant until triggered by unsafe acts (active failures) in combination with inadequate defences, thus leading to an accident. Perrow [15] classifies systems according to their coupling, which may vary from loose to tight, and interactions which may vary from linear to complex. When these two dimensions are considered together, it is possible to make recommendations for the organizational style best suited to cope with potential accidents in different industries. According to Perrow, for the tightly coupled, complex interactions combination, there is an inherent contradiction in the organizational style required, which leads to the conclusion that in such situations accidents should be considered to be normal, since they are inevitable. This classification has been supported and updated more recently to include the effect of computer control [14]. ‘Normal Accident Theory’ has been compared elsewhere with the work of a number of researchers who are more optimistic about the possibility of organizations operating safely in hazardous situations [21]. This work—sometimes grouped under the term ‘High Reliability Theory’—has led to a number of recommendations for good practice in organizations in high risk domains [10, 19]. These recommendations aim to improve the likelihood of an organization operating in a reliable manner, and include:
• organization leadership should prioritise safety;
• high levels of redundancy should exist in personnel and technology;
• decentralized authority, continuous training, and strong organizational culture of safety should be encouraged; and
• organizational learning should take place through trial-and-error, simulation, and imagination.
Summary
In summary, there is a vast amount of literature from a broad spectrum of disciplines which is relevant to human reliability in processes, and therefore to the dependability of the RE process. The work reported here was conducted as the initial stage of the development of a process improvement method for RE processes, especially for the development of dependable systems. This method, called PERE [2] (Process Evaluation for Requirements Engineering) has been developed as part of the ESPRIT-funded REAIMS (Requirements Engineering Adaptation and Improvement Strategies for Safety and Dependability) Project.
The point we wish to stress here is that as software systems become more pervasive and the associated issues of safety and dependability become more critical, we must consider a broader interpretation of dependability. This broadening of dependability essentially requires us to incorporate the existing work within the human sciences on errors. Much of this work will require some examination and interpretation and this in turn provides a considerable research challenge for Requirements Engineering.
References
1. Anderson, R., Button, G. and Sharrock, W., Supporting the design process within an organisational context. In Proceedings of ECSCW’93 (Milan, Italy, 1993) Kluwer, pp. 47-59.
2. Bloomfield, R., Bowers, J., Emmet, L. and Viller, S., PERE: Evaluation and Improvement of Dependable Processes. In Safecomp 96—The 15th International Conference on Computer Safety, Reliability and Security (Vienna, 1996) Springer Verlag.
3. Bowers, J. and Pycock, J., Talking through design: requirements and resistance in cooperative prototyping. In Proceedings of CHI’94 (Boston, MA, 1994) ACM Press, pp. 299-305.
4. Emam, K.E. and Madhavji, N.H., A field study of requirements engineering practices in information systems development. In Proceedings of RE’95 (York, UK, 1995) IEEE Computer Society Press, pp. 68-80.
5. Goguen, J.A., Social issues in requirements engineering. In Proceedings of RE’93 (San Diego, CA, 1993) IEEE, pp. 194-195.
6. Hemphill, J.K., Why people attempt to lead. In Leadership and Interpersonal Behaviour Petrullo, L. and Bass, B.M., Eds., Holt, Rinehart & Winston, New York, 1961, pp.
7. Isenberg, D.J., Group polarization: a critical review and meta-analysis. Journal of Personality and Social Psychology 50, (1986) pp. 1141-1151.
8. Janis, I.L., Victims of Groupthink. Houghton Mifflin, Boston, MA, 1972.
9. Jirotka, M. and Goguen, J.A., Eds., Requirements Engineering: Social and Technical Issues. Academic Press, London, 1994.
10. La Porte, T.R. and Consolini, P.M., Working in practice but not in theory: theoretical challenges of ‘high reliability organizations’. Journal of Public Administration Research and Theory 1, 1 (1991) pp. 19-47.
11. Lubars, M., Potts, C. and Richter, C., A review of the state of the practice in requirements modelling. In Proceedings of RE’93 (San Diego, CA, 1993) IEEE Computer Society Press, pp. 2-14.
12. Maass, A. and Clark, R.D., Hidden impact of minorities—15 years of minority influence research. Psychological Bulletin 95, 3 (1984) pp. 428-450.
13. Manstead, A.S.R. and Semin, G.R., Social facilitation effects: mere enhancement of dominant responses? British Journal of Social and Clinical Psychology 19, (1980) pp. 119-136.
14. Mellor, P., CAD: Computer-aided disaster! High Integrity Systems Journal 1, 2 (1994) pp. 101-156.
15. Perrow, C., Normal Accidents. Basic Books, New York, 1984.
16. Quintas, P., Ed., Social Dimensions of System Engineering: People, Processes, Policies and Software Development. Ellis Horwood, London, 1993.
17. Rasmussen, J., Skills, rules, knowledge; signals, signs and symbols; and other distinctions in human performance models. IEEE Transactions on Systems, Man and Cybernetics SMC-13, 3 (1983) pp. 257-266.
18. Reason, J., Human Error. Cambridge University Press, Cambridge, UK, 1990.
19. Roberts, K.H., New challenges in organizational research: high reliability organizations. Industrial Crisis Quarterly 3, 2 (1989) pp. 111-125.
20. Rodden, T., King, V., Hughes, J. and Sommerville, I., Process modelling and development practice. In Proceedings of the Third European Workshop on Software Process Technology, EWSPT’94 1994) Berlin: Springer-Verlag, pp. 59-64.
21. Sagan, S.D., The Limits of Safety: Organizations, Accidents, and Nuclear Weapons. Princeton University Press, Princeton, NJ, 1993.
22. Steiner, I.D., Group Processes and Productivity. Academic Press, New York, 1972.
23. Steiner, I.D., Task-performing groups. In Contemporary Topics in Social Psychology Thibaut, J.W., Spence, J.T. and Carson, R.C., Eds., General Learning Press, Morristown, NJ, 1976.
24. Van Avermaet, E., Social influence in small groups. In Introduction to Social Psychology Hewstone, M., Stroebe, W., Codol, J.-P. and Stephenson, G.M., Eds., Basil Blackwell, Oxford, 1988, pp. 350-380.
25. Westrum, R., Technologies and Society: The Shaping of People and Things. Wadsworth Publishing Company, Belmont, CA, 1991.
PERE: Process Improvement through the Integration of Mechanistic and
Human Factors Analyses.
Luke Emmet , Robin Bloomfield , Stephen Viller*, John Bowers
Adelard, |
Department of Psychology bowers@psy.man.ac.uk |
*Computing Department |
1 Introduction
The importance of human factors in the development process is widely recognised. Analysis of many incidents and accidents point to human error not just in the operational stage, but also in the earlier stages of the systems development such as the requirements engineering and design phases. These problems are further compounded in the context of safety-critical systems where getting the requirements or design wrong may have disastrous consequences. Within the safety-critical community there have been a number of approaches to improving systems. These approaches concentrate either on the product (FTA, ETA, Hazops etc.) or on the development process (e.g. Bootstrap, SEI CMM, ISO 9000). Both these classes of approach take a mechanistic view, by, for example, focusing on information flows, architecture, causality and so on. As a consequence, they tend to be weak in dealing with vulnerabilities arising specifically from human factors.
To address the need for a more human centred approach, within the REAIMS project, we have developed PERE (Process Evaluation in Requirements Engineering) which aims to integrate both mechanistic and more human-centred improvement approaches to development processes. PERE (see also [1]) was originally developed to identify vulnerabilities in the Requirements Engineering (RE) process, although many if not all of the techniques developed equally well apply to any reasonably well defined development process.
Improving the development process
Generally, system engineers and project managers are not widely aware of the human factors techniques and how to apply them. There is a need to make this human factors "knowledge" more widely available within the dependable systems community. PERE aims to do this by providing techniques which integrate well with traditional mechanistic approaches.
Development process losses due to human factors
There is a wealth of human factors literature e.g. [2] that impacts on the design and execution of development processes, and also human factors techniques that can be used to aid analysis (e.g. ethnographical techniques, task analysis, video analysis and so on). Development process losses may be broadly characterised in terms of the source of the error:
á
individual/cognitive—A large amount of research into "human error" [2] has emerged from cognitive approaches to the understanding and modeling of individual failures. This work has generated important distinctions such as that between skill-based, rule-based and knowledge-based "levels" of cognitive activity. Other individual process failures can arise due to procedural violations if the procedures are overly prescriptive, poorly defined or do not support the processes actually followed.á
group/social—Social psychological research has concentrated on the effects of working in social and group settings. This wide and diverse body of research (e.g. [3]) includes work on phenomena such as social facilitation, group performance, group leadership, conformity and consensus, the effects of minority opinion, group polarisation and groupthink. Given the social context of systems development, errors arising from the social nature of work can be an important source of process lossesá
organisational/cultural—It is increasingly recognised that the organisational context and safety-culture surrounding a process is a further determinant of the safety of that process. For example, latent organisational failures may lie dormant until some active trigger event coupled with insufficient defences precipitates an accident.
A more comprehensive literature review of the Human Factors in Requirements Engineering may be found in the companion extended abstract submitted to this workshop [4].
2 PERE
PERE is an integrated process improvement approach that combines two complementary viewpoints (see Figure 1) onto the development process under analysis and uses a weaknesses-defenses approach to identifying and mitigating process weaknesses. PERE makes no commitment to a particular mechanistic view, thereby allowing a range of analytical techniques.
Figure 1 Overview of PERE
Method description
The underlying idea is that both a mechanistic and human factors analysis of the process can be carried out in an integrated and complementary manner.
á
Mechanistic viewpoint—this viewpoint was developed from attempts to generalise Hazops and Sneak Circuit Analysis, along with a structuring approach taken from object-oriented analysis. A process model is built, that describes the process components, interconnections and working material, and classifies the process components into basic component classes. This model is then systematically anaylsed for generic and specific weaknesses associated with the components classes and their attributes.á
Human factors viewpoint—this viewpoint takes as its input the process model from the mechanistic viewpoint, and conducts a systematic analysis of those components, interconnections and working materials that are considered to be particularly vulnerable to weaknesses due to human factors. The analyst is guided through the analysis by a series of structured questions that scope and focus the human factors analysis. The weaknesses and protections are further elaborated in the PERE Human Factors Checklist which contains a summary of relevant, non-controversial human factors material, and pointers out into the literature.Advantages of PERE
á
Practical, systematic approach—PERE provides practical guidance through the analysis by means of decision trees and checklists which guide and scope the analysis. This makes the analysis feasible in that the analyst does not have to look for every human factor weakness within every process component.á
Sensitive to actual process improvement needs—Real industrial processes are inherently socio-technical processes comprised of with people working with technology. Any proposed improvement approach must therefore consider both aspects.á
Knowledge dissemination—Some of the human factors "expertise" is provided by the PERE method, which provides a means of "giving away" important human factors knowledge across a wider based of users.
3 Application experience
Full documentation for PERE exists. Application experience to date has primarily been focused on improving development processes within the REAIMS community. The industrial applications of PERE have included a wide variety of different processes within the safety-critical domain:
á
A Software Specification process—PERE was used to analyse the use of the B method in developing safety critical software.á
The Standards Process—The standards development process is a ripe candidate for process improvement (some international standards may take up to 10 years per document). We found that a standards process that focused on document production at the expense of consensus building could result in bad or late standards.á
Corporate Memory process—PERE was used to aid the development of another REAIMS module, MERE (Managing Experience in Requirements Engineering), which is a process for supporting learning from experience within an organisation across related products and processes.
4 Further work
Although currently the mechanistic analysis is conducted in a qualitative manner by the PERE analyst, in conjunction with Edinburgh University, we are investigating supporting the mechanistic analysis through the use of formalism and quantitative modeling.
In particular, we are looking at modeling the process using a process algebra (e.g. PEPA [5]—Performance Evaluation Process Algebra), which allows processes to be modeled algebraically, and the compositional behaviour of the process as a whole to be inferred from the connections and synchronisations of its constituent subprocesses.
[1] Bloomfield, R., Bowers, J., Emmet, L. & Viller, S. "PERE: Evaluation and Improvement of Dependable Processes" in E. Shoitsch (ed) SAFECOMP96—The 15th International Conference on Computer Safety, Reliability and Security Vienna, Springer Verlag 1996.
[2] Reason, J., Human Error. Cambridge University Press, Cambridge, UK, 1990.
[3] Steiner, I.D., Task-performing groups. In: Contemporary Topics in Social Psychology Thibaut, J.W., Spence, J.T. and Carson, R.C., Ed., General Learning Press, Morristown, NJ, 1976.
[4] Viller, S., Bowers, J., Rodden, T. "Human Factors in Requirements Engineering" extended abstract submitted to "Workshop on Human Error and Systems Development" Glasgow Accident Analysis Group, Glasgow 1997
[5] Hillston, J. "Compositional Markovian Modelling Using a Process Algebra" in Proceedings of the Second International Workshop on Numerical Solution of Markov Chains held in Raleigh, North Carolina, January 1995. These proceedings are published by Kluwer Academic Press, with the title "Computations with Markov Chains"
A Co-operative Scenario based Approach to Acquisition and Validation of
System Requirements : How Exceptions can help!
Neil Maiden1, Shailey Minocha1, Michele Ryan1, Keith Hutchings2 and Keith Manning1
1Centre for HCI Design |
2Philips Research Laboratories |
City University |
Cross Oak Lane |
Northampton Square |
Redhill |
London EC1V 0HB |
Surrey RH1 5HA |
Tel: +44-171-477 8412
Fax: +44-171-477 8859
E-Mail: [N.A.M.Maiden, S.Minocha]@city.ac.uk
S
cenario based requirements analysis is an inquiry based collaborative process which enables requirements engineers and other stakeholders to acquire, elaborate and validate system requirements. A scenario, in most situations, describes the normative or expected system behaviour during the interactions between the proposed system and its environment. To account for non-normative or undesired system behaviour, it is vital to predict and explore the existence or occurrence of ‘exceptions’ in a scenario. Identification of exceptions and inclusion of additional requirements to prevent their occurrence or mitigate their effects yields robust and fault-tolerant design solutions. In this paper, we outline the architecture of a toolkit for semi-automatic generation of scenarios. The toolkit is co-operative in the sense that it aids a requirements engineer in systematic generation and use of scenarios. The toolkit provides domain knowledge during requirements acquisition and validation of normative system behaviour. It also provides systematic guidance to the requirements engineer to scope the contents of a scenario. Furthermore, we have identified three kinds of exceptions: generic, permutation and problem exceptions, and have derived complex taxonomies of problem exceptions. We propose to populate the toolkit with lists of meaningful and relevant ‘what-if’ questions corresponding to the taxonomies of generic, permutation and problem exceptions. The exceptions can be chosen by the requirements engineer to include them in the generated scenarios to explore the correctness and completeness of requirements. In addition, the taxonomies of problem exceptions can also serve as checklists and help a requirements engineer to predict non-normative system behaviour in a scenario.
1. Introduction
Scenarios, in our context, are descriptions of required interactions between a desired system and its environment. Scenario-based requirements engineering helps requirements engineers and other stakeholders develop a shared understanding of the system’s functionality. Scenarios, derived from a description of the system’s and stakeholder’s goals, capture the system’s expected or normative behaviour. However, to ensure robust and flexible design solutions, it is essential to investigate the occurrence of ‘exceptions’ in the system and its environment. The exceptions are sources of non-normative or exceptional system behaviour as they prevent the system from delivering the required service.
The ESPRIT 21903 CREWS (Co-operative Requirements Engineering With Scenarios) long-term research project proposes the use of scenarios for both requirements acquisition and validation. Furthermore, it identifies the presence and occurrence of exceptions and emphasises the importance of exploring these exceptions during scenario analysis to ensure correct and complete requirements. First, to help requirements engineers generate a limited set of salient scenarios, this paper describes the architecture of a toolkit for semi-automatic generation of scenarios. Next, we have identified three basic types of exceptions: generic, permutation and problem. The generic and permutation exceptions are the exceptions that arise in the basic event-action-sequence of a scenario or a combination of scenarios. Problem exceptions arise due to the interactions of a software system with its social, operational and organisational environments and provide additional knowledge with which to explore the non-normative behaviour. Furthermore, we have derived complex taxonomies of problem exceptions. We propose to populate the toolkit with lists of ‘what-if’ questions corresponding to these taxonomies of generic, permutation and problem exceptions. The requirements engineer can select the relevant exceptions in the toolkit to include them in generated scenarios in order to guide the process of scenario-analysis with other stakeholders. In addition, a requirements engineer can use the taxonomies of problem exceptions as checklists during scenario analysis or in any other technique of requirements analysis.
The architectural design and computational mechanisms of the CREWS toolkit build on results from the earlier ESPRIT 6353 ‘NATURE’ basic research action (Jarke et al. 1993). NATURE identified a large set of problem domain templates or abstractions, or Object System Models (OSMs) which we discuss later on. Each OSM encapsulates the knowledge of normative system-behaviour of all application-domains which are instances of that OSM or problem domain template. A scenario of an application domain which is derived from the NATURE’s OSM, thus, has normative information content. The identification of exceptions and their inclusion in a scenario to explore the non-normative system behaviour, as proposed in CREWS, contributes to the non-normative content of a scenario.
First, in Section 2, we discuss several definitions of scenarios and explore their role in requirements engineering. In Section 3, we discuss various types of exceptions which are useful during scenario analysis. We then discuss problem exceptions and present a classification of them. We present a subset of the six taxonomies of problem exceptions that we have derived as answers to ‘what can go wrong ?’ question across the six dimensions of the classification framework. The toolkit’s architecture and the process of scenario generation along with an example are detailed in Section 4. Finally, we present our present directions for future research work in Section 5.
2. Scenarios and Scenario - based Requirements Engineering
Scenarios have been found to be useful in many disciplines. This is interesting to us given the multi-disciplinary nature of requirements engineering. Examples, scenes, narrative descriptions of contexts, mock-ups and prototypes are all different terms for scenarios in the areas of human computer interaction (HCI), requirements engineering and information systems. In HCI, a scenario is often defined as a detailed description of a usage context which helps the designer to explore ideas, consider the appropriateness of design and user support, and other aspects of the environment (Carroll 1995). Scenarios, in the area of information systems, have been defined as partial descriptions of system and environment behaviour arising in restricted situations (Benner et al. 1992). In the context of requirements engineering (Hsia et al. 1994), scenarios have been defined as ‘possible ways to use the system to accomplish some function the user desires’. A similar definition is described by Potts et al.(Potts et al. 1994): ‘particular cases of how the system is to be used. More specifically, a scenario is a description of one or more end-to-end transactions involving the required system and its environment’. Scenario-analysis helps to evaluate design alternatives, validate designs, notice ambiguities in system requirements, and to uncover missing features or inconsistencies. In object-oriented analysis (Jacobson et al. 1992), scenarios have been defined as ‘use-cases’, a use-case being a sequence of transactions between an ‘actor’, who is outside the system, with the system. The use-case approach focuses on the description of the system interactions with its environment.
We can determine the basic characteristics of a scenario from these other disciplines. A scenario is, in essence, a description of required interactions between a system to be built and its environment to achieve some purpose. It can be seen as a behavioural requirement, albeit one which is external to the system to be built, and does not relate to its internal state changes that are unforeseen to the environment. Advantages of such scenarios are numerous. They can be used to anchor communication and negotiation amongst requirements engineers and other stakeholders for acquiring and clarifying requirements. A typical scenario analysis session involves a walkthrough by the requirements engineers and stakeholders to validate the task description or functionality simulated in a scenario.
A scenario captures a ‘basic course of events’. It represents a normative usage-situation of a system, which can be a normal sequence of tasks that a user performs to achieve a desired goal. Alternatives or variants to this basic course of events, and errors that can occur are described as ‘alternative courses’ (Jacobson et al. 1992). Thus every scenario may have an alternative course, that is, a non-normative state (condition), or, a non-normative event (behaviour) may occur in a scenario. The non-normativeness of a usage context or a scenario implies an inappropriate, or undesirable, or unsafe state or behaviour of a system. Each non-normative state or event, critical or non-critical, is an effect or consequence of an underlying cause or multiple causes existing in the system or in the surrounding environment. Each cause of an inappropriate system performance may be composed of two or more necessary conditions or exceptions.
Exceptions must be explored during requirements analysis as this can help in clarifying and elaborating requirements, and identifying additional or missing requirements for robust design alternatives. The new requirements/constraints that arise to eliminate the exceptions or mitigate their effects on the system performance should be included in the system specifications. This will help achieve completeness of requirements as the system specifications would then have the desired system goals as well as the constraints within which the system may operate while achieving these objectives. These constraints would arise from quality considerations (including safety), user interface guidelines, data-input limitations (data-entry validations), and performance considerations (such as system response times).
Despite the extensive use of scenarios in requirements acquisition (Jacobson et al. 1992) and validation (Sutcliffe 1997), it has been reported (Gough et al.1995) that generation of a useful set of scenarios is tedious and time-consuming. Few guidelines exist to aid definition of the structure and contents of a scenario. A requirements engineer must have a considerable understanding and knowledge of the problem domain and the scope of scenario analysis to efficiently generate scenarios. Also, without methodical guidance, it is difficult to detect the non-normative behaviour or the presence and effect of exceptions in a scenario during the inquiry process of scenario analysis.
It is our aim, in this paper, to demonstrate how the toolkit proposed in CREWS uses NATURE’s OSMs to provide systematic guidance to the requirements engineer to generate scenarios detailing normative system behaviour. The identification of categories of exceptions and proposed taxonomies of problem exceptions, as a part of research work in CREWS, can further facilitate the requirements engineer to append exceptions to the generated scenarios through the toolkit.
3. Types of Exceptions and Taxonomies of Problem Exceptions
Each scenario describes one or more threads of ‘normative’ behaviour of a software system and consists of agents (human or machine), actions having start events and end events, stative pre- and post-conditions on actions, objects, their states and state transitions and a goal state. We have identified two types of exceptions that can be identified using the basic semantics of a scenario. These are generic and permutation exceptions. The third category of exceptions that we have identified are the problem exceptions. Problem exceptions are those exceptions that arise due to the software system’s interaction with the external environment, that is, with the social environment (humans and their interactions), with other software or hardware systems in a distributed environment, or with the organisational environment, including business processes and goals. The identification of problem exceptions gives an integrated exploration of exceptions that can arise in the environments around the software system.
Generic exceptions are those exceptions that relate to the basic components of a scenario simulating a behavioural requirement. A sample set of what-if questions listing the generic exceptions to explore the non-normativeness in the event-action sequence of a scenario is: action is not started by a start-event ?; action is not completed, that is, it does not have an end-event ?; action does not result in state-transition of the key object ?; stative pre-conditions are not satisfied ?; stative post-conditions are not satisfied ?; or the goal state is not achieved ?; etc.
When different scenarios (permutations of scenarios) are combined or linked to one another, several exceptions can arise in terms of the mappings between the basic components of a scenario, that is, actions, agents, key objects, events, states, etc. These exceptions are termed permutation exceptions. Permutation exceptions can be identified, for example, when one analyses the temporal semantics of two scenarios, that is, comparing the event-action sequence in the two chains in terms of time. A representative set of what-if questions to guide the identification of such exceptions is: an event that should precede an event happens later ?; two agents perform the same action at the same time ?; the start-events of the actions in two scenario chains are the same and happen at the same time ?; the state-transition of the same key object takes place at the same time involving two different agents ?, etc.
A problem exception is a state (condition) or an event that is necessary but not sufficient for the occurrence of an undesired or non-normative behaviour of the system. It can exist within a system or can occur during the execution of the system or it can exist in the environment of the system. It is ‘something wrong’ in the system (including the operator or ‘user’) or the environment. It could be a hardware component fault, a design fault in hardware or software, a software fault, an operator (human) or an organisational condition or action, or an undesirable feature of the human-computer interface.
If a problem exception is regarded as being sufficient for the occurrence of an undesired behaviour, it is said to be the cause. In a complex system, it is difficult to identify a single cause (or a problem exception) for its malfunctioning or unplanned behaviour. This is because the existence of one or more problem exceptions give rise to the occurrence of another set of problem exceptions by propagation and chain reaction through the system interfaces generating an undesired state of the system.
Earlier work (Hsi and Potts 1995) concentrated on identifying the obstacles during scenario analysis where obstacles imply as those conditions that can result in the non-achievement of the goal state in a scenario. We, in contrast, have a broader scope for problem exceptions which could be missing data entry validation checks, or non-adherence to user interface guidelines, hardware or software faults, design flaws or errors, hazards, obstacles, critical incidents, etc. leading to system failures, mishaps, loss events, accidents, etc. We now propose a classification of problem exceptions. The classification is general in the sense that, while populating it, we plan to cover the whole spectrum of software systems and their environments - from safety critical systems such as avionics and nuclear power plants, or process control environments, to non-safety critical office systems. This is an interesting research challenge we have undertaken, and success is by no means guaranteed.
Classification Framework of Problem Exceptions
There are several taxonomies of problem exceptions in the areas of cognitive science and engineering, safety engineering, and usability engineering in the literature. It is difficult to categorise their domains over orthogonal axes as some of the studies overlap, especially, taxonomies of human error and human mental models (Rasmussen and Vicente 1989), (Norman 1988), (Reason 1990), (Hollnagel 1993). We are making an attempt to bring all these diverse themes of work together in a single classification framework. Secondly, our aim is to apply the derived taxonomies of problem exceptions under this general classification scheme to the area of requirements engineering through scenarios.
Any scenario, which represents a sequence of actions, can involve human agent(s) (H) and machine agent(s) (M). A human agent may also communicate with another human agent and a machine may pass information to another machine in a distributed system. The minimum set of interactions between H and M are: H, M, HM, HH and MM. Each interaction pattern in this set can give rise to an inappropriate or undesired system performance. Based on this set of possible interactions, we identify five types of exceptions: Human exceptions, Machine exceptions, exceptions that arise due to Human-Machine Interaction, exceptions that arise due to Machine-Machine Communication or exceptions that arise due to Human-Human Communication. Apart from these five categories, we have identified a sixth type - Organisation exceptions, that is, exceptions that arise due to organisational structure or social conditions.
A human agent, as an integral component in the social environment of the software system, takes decisions, performs actions, etc. During this interaction, a deviation of normal behaviour of a human agent may result in an inappropriate system performance. The causes of deviations in the human agent’s actions or behaviour are called human exceptions. These exceptions have been termed as the causes of human errors in cognitive engineering (Reason 1987), (Reason 1990), (Norman 1988) and human factors engineering (Sutcliffe and Rugg 1994). For example, human exceptions may arise due to insufficient knowledge, memory lapses, incorrect mental model, etc. When the failure of a machine due to conditions such as power failures, or hanging, or getting disconnected from the network, etc. give rise to an inappropriate system performance, then such technical causes are called as machine exceptions or technical exceptions. These are the sometimes considered as causes of technical failures in the literature (Leveson 1995). Design errors, or non-adherence to user interface guidelines, etc. may give rise to situations when a human agent is unable to take a decision or diagnose a problem while interacting with a machine. This results in an undesired system performance. Such exceptions are termed as arising due to human-machine interaction mismatches (Nielsen 1993), (Rasmussen et al. 1994), (Hollnagel and Kirwan 1996). These may arise due to usability problems, poor feedback mechanisms, or inadequate error-recovery mechanisms. When a human agent is not able to communicate appropriately or as desired to another human agent, either through speech, documents, or any other media and which may cause an unacceptable system performance, then such causal factors are called human-human communication exceptions (Reason 1990). Exceptions due to human-human communication may arise due to communication mismatch between peers as a result of unclear task allocation or lack of co-ordination at management level. The importance of verbal communication between controllers in the London Ambulance Service’s control room was overlooked in the infamous computer-aided despatch system disaster, and was one reason for the system’s ultimate failure (Dowell and Finkelstein 1996). A situation may arise when a machine is unable to communicate correctly to another machine in a network or distributed system. This may be caused due to an exception of the type - machine-machine communication.
Causal Relations of Problem Exceptions
The presence of one exception may give rise to another and so on, triggering a chain of non-normative events leading to an undesirable or inappropriate system performance. We illustrate this causal relations of problem exceptions leading to an unplanned behaviour through an example.
Consider the case when an operator of a process control system is not able to optimally control certain parameters (human exception) as s/he does not have the access rights for some functions/information of the machine (equipment), or is not trained enough to perform the allocated job, or is working in stressful conditions. These reasons of an operator’s inability to perform an allocated responsibility actually reflects on the role allocation structure and planning of the organisation (organisation exception). The human error caused by the cumulative effect of human exception (source) and organisation exception (trigger) may lead to the malfunctioning of the control system (machine exception) and, ultimately, cause a system breakdown. In this case, the sequence of events can be represented as:
Human Exception (source) Organisation Exception (trigger) Human Error Machine Exception System Breakdown.
This aim of this example is not to demonstrate a generic path for causal relations of problem exceptions but it illustrates the occurrence of interaction between the different types of problem exceptions. The example has two important messages: First, if a problem exception is known, the requirements engineer can explore its effect(s) on system behaviour by exploiting these causal relations to simulate the causal chain of events in the normative task-flow in a scenario. For example, when power failure is identified as a problem exception, the requirements engineer will explore all that can be caused due to a power failure in the scenario of the application domain in question. Following this causal explanation and determining the possible effects of power failure and the severity of these effects, the requirements engineer will add requirements to the scenario description to eliminate the occurrence or to diminish the effect of power failure on the system’s environment. In a safety-critical system, in order to avoid a power failure, an additional and significant requirement would be to include the availability of uninterrupted power supply machines in the system specifications. However, in the context of a book-lending library, this problem exception may be tackled by including the availability of paper-forms or other mechanisms where the library staff members can manually issue books, fine borrowers, or borrowers can reserve books, etc.
Alternatively, if the consequences are known from any previous histories of undesirable system behaviour, the requirements engineer can start from the consequences to determine the cause(s) or problem exceptions by following the causal path of events. These techniques of forward and backward searches, as illustrated here, can be integrated in the method of scenario analysis. This approach of causal analysis is very similar to hazard analysis techniques such as HAZOP in safety engineering (Leveson 1995). We are currently developing a method to incorporate causal analysis within scenario based requirements engineering.
Taxonomies - Populating the Classification framework
We have derived a set of taxonomies as answers to ‘what can go wrong’ question along the six dimensions of the classification framework. We have considered the taxonomies available in the literature while populating our classification scheme of problem exceptions. We present a sample of these taxonomies in Table 1. A more detailed and complete set of taxonomies is beyond the scope of this paper. To supplement and validate the derived taxonomies, we are conducting field studies through knowledge elicitation techniques (Maiden and Rugg 1996) to gather information from experienced requirements engineers, system designers, and also end-users. The taxonomies are proposed to be populated in the toolkit. The requirements engineer can select the relevant exceptions in the generated scenarios to guide the inquiry process of scenario analysis. The exceptions will enable the requirements engineer to ask the ‘right’ questions from other stakeholders to either predict or investigate any unplanned system behaviour. Additionally, these taxonomies will serve as checklists for the requirements engineer to guide thinking and stimulate the thought process to uncover ‘new’ requirements and clarify known requirements. This will help detect incompleteness or ambiguity in requirements.
Exception Type |
Category |
Sources of Exceptions |
Human |
Physiological |
Work Environment - Noise, lighting, work timings, shift arrangements, temperature, ventilation |
Stress - Reactions to stress |
||
Attention capacity - over attention or inattention, perceptual confusion |
||
Adaptation - reaction to changes in system and environment |
||
Mental Load - tired, stressful |
||
Anatomical |
Physical Health - disability, sick or injured, poor physical co-ordination, fatigue |
|
Cognitive |
Mental Model of the system - incorrect mental model, incomplete task knowledge |
|
Causal Reasoning - delayed feedback from the system, perceptive power for the consequences |
||
Diagnostic Capability - depending on diagnostic support from the system, task knowledge |
||
Psychological |
Morale - management policies and attitudes |
|
Motivation - boredom due to repetitive tasks |
||
Disturbance - environmental distractions due to noise, lighting, work place set-up, etc. |
||
Machine |
Hardware / Peripheral equipment |
Power supply - failure or fluctuations |
Peripheral devices / instruments - faulty or inaccessible |
||
Communication - faulty network connectivity, transmission line failures |
||
Work environment - inadequate or non-availability of support staff |
||
Human Machine Interaction |
Screen layout |
Widget layout - improper choice or unsuitable icons, user interface controls, user interface cues or metaphors for the task and user |
Information Presentation - over load and poor spread of information, inconsistent, wrong choice of colours, colour combinations and fonts, non-conformance to human factor guidelines, data entry validations, improper dialogue design and navigational flow, slow system response times in information retrieval, unsuitable decision aids for the task and user |
||
Design guidelines - non-conformance to platform-dependent GUI guidelines, internationalisation requirements if applicable ? |
||
Error Handling |
Feedback - non-indicative warning messages, alerting techniques such as alarms, flashing and reverse video, delayed response times |
|
Error recovery mechanisms - absent or slow |
||
Input / Output devices |
Keyboards, pointing devices, sound, monitors, display panels, indicators, etc. - faulty or unsuitable |
Table 1
A Sample Set of Problem Exceptions
4. CREWS Toolkit
As a part of the ESPRIT 6353 ‘NATURE’ basic research action (Jarke et al. 1993), a large set of problem domain templates or abstractions, known as Object System Models (OSMs), have been identified to provide domain-specific guidance to requirements engineers. Each model describes the fundamental behaviour, structure, goals, objects, agents, constraints, and functions shared by all instances of one problem domain category in requirements engineering. These models are similar to analysis patterns (Coad et al.1995), problem frames (Jackson 1995), or clichŽs (Reubenstein and Waters 1991). However, NATURE has produced the first extensive categorisation of requirements engineering problem domains from domain analysis, case studies, software engineering books, etc., in the form of over 200 OSMs with 13 top-level OSMs held in a hierarchical object oriented deductive database. The 13 top-level OSMs are resource returning, resource supplying, resource usage, item composition, item decomposition, resource allocation, logistics, object sensing, object messaging, agent-object control, domain simulation, workpiece manipulation and object reading. As an example, car rental, video hiring or book lending libraries are applications that belong to the problem domain of resource hiring which is a specialisation of the resource returning OSM. The proposed OSMs in NATURE have been validated through empirical studies (e.g. Maiden, et al. 1995), and tools have been constructed to use them in requirements structuring (Maiden and Sutcliffe 1993), critiquing (Maiden and Sutcliffe 1994) and communication, and requirements reuse (Maiden and Sutcliffe 1992).
In CREWS, it is proposed to use the OSMs to provide guidance for scenario-based requirements acquisition and validation, and, in particular, as the basis for automatic generation of the core or initial scenarios. Generation identifies permutations of OSM features to generate a set of possible scenarios. The fundamental components of both OSMs and scenarios are agents, events, objects, states and state transitions (Potts et al. 1994). These can be manipulated, as a set, to determine different permutations, or scenarios, for a problem domain. Each individual permutation is called a scenario chain and, is, in essence, a single thread of behaviour in the software system. It is described using agents, events, objects, states, actions and state transitions, all of which are semantics of an OSM. The permutations can be extended using exceptions to define unforeseen situations and events in problem domains. Furthermore, features in OSMs are interconnected, thus enabling the imposition of useful constraints on scenario generation. A computational mechanism to generate these permutations has been designed to generate scenarios in this manner. It is being implemented in the CREWS toolkit along with the facility to append exceptions to the scenarios.
We are currently developing a throw-away prototype of the CREWS toolkit. We plan to conduct tests in the industry using a scenario-based requirements elicitation technique (Sutcliffe 1997) to determine additional requirements for the toolkit in order to scope the structure, content and forms of presentation. In addition, we will be conducting usability studies for the look-and-feel of the user interface and navigational flow of the toolkit. The prototype is currently being reviewed internally in our centre along with its iterative development. We aim to submit our results of our internal reviews and ‘real user’ testing of the prototype in the near future.
We now illustrate the process of scenario generation using the NATURE’s OSMs and present the screen layouts of the prototype of toolkit to demonstrate it. Detailed treatment of the mechanism of scenario generation and the toolkit’s architecture is available in (Maiden 96). In this paper, we focus on the facility of adding exceptions to the scenarios through the toolkit to analyse the non-normative system behaviour, both in terms of its event-action analysis within a scenario as well as it’s interactions with the external environment. The screen layouts in Figures 4 - 11 are from the prototype of the toolkit.
Example
Consider the example of an application domain as a book-lending library. In the toolkit, first, the application domain facts are captured through an interactive dialogue with the requirements engineer (the ‘user of our toolkit’). The requirements engineer enters the details of agents, events, actions, etc. through a dialogue to provide domain-specific information. These inputs are matched with the stored OSMs in the database to retrieve them. The retrieval yields three OSMs: Resource Hiring, Resource Repairing and Object Sensing. Each OSM has scenario chains in the database. The requirements engineer selects the OSM of resource hiring. There are four core or initial scenarios in this application domain: Resource-Loan, Resource-Return, Resource-Reserve and Resource-Unreserve which are retrieved and are shown on the display when the user selects an OSM (Figure 4). The requirements engineer has the option to choose one or more scenario chains to add exceptions, parameters for scenario generation such as constraining the number of scenarios to be generated, etc., and the agent interaction patterns. The agent-interaction patterns map the agents, machine or human, to the agents in the scenario chain. Agent types and patterns of interaction are critical for scenario generation. Object system models include abstractions of agents but say little about them because agent types and interactions are not facts which discriminate between categories of problem domain.
Figure 4 Retrieval of OSMs and Scenario Chains
Let us consider the initial scenario chain: Resource - loan. In natural language this initial scenario reads ‘lender lends a resource to the borrower’. An example: in the library domain, the resource is a book and a borrower requests for the issuing of a book from the lender who is a library staff member. (Here, we are considering an example of a library where the lender brings the book to the library desk and the library staff member interacts with the computer system (machine) to issue the book to the lender). On selecting a scenario chain, the requirements engineer has the flexibility to choose either the generic exceptions, or the permutation exceptions, or to choose the problem exceptions to include the exceptions to a scenario chain or combination of scenario chains, or to enter the parameters for scenario generation, or to map the agent interaction patterns.
If the requirements engineer chooses the option to map the agent interaction patterns. Lender, Borrower and Other Agents are the agents in the initial scenario chain. They would be mapped to human and machine agents as follows (Figure 5):
Borrower: Human agent ( in real-world, a student or staff member in a university library);
Lender: Human agent (librarian);
Other Agent: Machine agent (Computer system).
Figure 5 Choosing Agent Interaction Patterns
Next, say, the requirements engineer chooses to add generic exceptions by selecting the from the list of what-if questions in the toolkit (Figure 6).
Figure 6 Choosing Generic Exceptions
The requirements engineer can select two or more chains to add permutation exceptions to a combination of scenario chains, that is, permutations of scenario chains (Figure 7). There is a flexibility of adding the permutation exceptions to permutations of same scenario chains or permutations of different scenario chains which have a related and dependent event-action sequence.
Figure 7 Choosing Permutation Exceptions
Figure 8 shows the sample of taxonomies of problem exceptions that the requirements engineer can choose from to include them in the scenarios.
Figure 8 Choosing Problem Exceptions
The number and content of the generated scenarios would be constrained by the parameters entered by the requirements engineer (Figure 9). The content would also depend upon the requirements engineer’s choice of what-if questions for the exceptions. The scenarios are generated from the toolkit after the requirements engineer’s initiates the generation process and are presented in the form of a list (Figure 10). The toolkit will provide an option to the user to view any individual scenario as a sequence diagram or as a structured description in terms of its constituents: basic semantics and exceptions appended to it.
Figure 9 Choosing Permutation Options for the Scenario Generation Mechanism
Figure 10 Generated Scenarios
5. Future Research Work
One of our immediate research goals is to identify taxonomies of problem exceptions for application domains which are instances of the 13 top-level OSMs from the general classification framework. We intend performing empirical studies and using knowledge elicitation techniques (Maiden and Rugg 1996) to assemble and validate the taxonomies in such application domains. We also propose to suggest corresponding generic requirements or guidelines to the requirements engineer for these derived taxonomies. The toolkit would then be populated with these application-domain-specific problem exceptions and their generic requirements. For example, a generic requirement to cater for an exception due to human machine interaction could be ‘Design for error tolerance’. This means: (a) make errors observable, (b) provide error-recovery mechanisms. A requirements engineer would map these generic requirements into actual requirements in a scenario for an effective human machine interaction: (a) provide feedback by alarms and give warning displays, (b) provide reverse (compensating) actions. The generic requirements corresponding to application-domain problem exceptions can, thus, aid the requirements engineer to identify new and complete requirements.
We also plan to propose some taxonomies by populating the classification framework across several other dimensions such as cause - effect (consequence), or severity-likelihood, or failure-mode - effect analyses of problem exceptions. This may involve including some of the taxonomies existing in the literature: requirements engineering (Fields et al. 1995): logic errors in the software due to incorrect requirements; safety engineering (Hollnagel 1993), (Leveson 1995): studies in safety-critical systems; hazard analysis; accident analysis, task and human error analysis, etc.; usability and human factors engineering (Nielsen 1993) and cognitive engineering (Norman 1988), (Rasmussen and Vicente 1989), (Reason 1990), (Rasmussen et al. 1994): ecological interface design, models of human error, task models, human-task mismatches, diagnostic support, decision-support and identification of decision requirements, etc.
In addition, we are currently developing a method for scenario analysis which would complement the traditional task analysis techniques. This method would involve causal analysis (Rasmussen 1991), (Rasmussen et al. 1994), as a part of scenario analysis, to explore the occurrence of problem exceptions. It would involve forward and/or backward search methods of causal analysis. The forward search approach would involve identifying or predicting the problem exceptions following a causal path upstream along the flow of events in a task, that is, given the causes, determine the consequences. The backward search approach would involve investigating any previous histories of undesirable performances or failures and identifying the causal factors or problem exceptions, that is, determine the causes from the effects. To further systematise and elaborate the method, we are also looking into other accident or hazard analysis techniques: Fault Tree Analysis used in aerospace, electronics and nuclear industries (Leveson 1995), HAZOP analysis (Leveson 1995) which is a hazard analysis technique used in the chemical process industry, and the automation of HAZOP and its application to software requirements specification through Deviation Analysis (Reese 1995). A comprehensive approach to scenario analysis guided by our proposed method will help in determining any missing requirements or possible flaws in design due to incomplete requirements which can contribute to the likelihood of an inappropriate system performance.
References
Benner, K.M. Feather S., Johnson W.L. and Zorman, L.A. (1992) ‘Utilising Scenarios in the Software Development Process’, IFIP WG 8.1 Working Conference on Information Systems Development Process, 117-134.
Carroll J.M. (1995) ‘The scenario perspective on System Development’, in Scenario-based Design: Envisioning work and Technology in System Development, Ed. J. M. Carroll.
Carroll J.M., Mack R.L., Robertson S.P. and Rosson M.B. (1994) ‘Binding objects to scenarios of use’, International Journal of Human-Computer Studies, 41, 243-276.
Coad P., North D. and Mayfield M. (1995) ‘Object Models: Strategies, Patterns and Applications’, Englewood Cliffs, Prentice Hall.
Dowell J. and Finkelstein A.C.W. (1996) ‘A Comedy of Errors: the London Ambulance Case Study’, Proceedings 8th International Workshop on Software Specification and Design, IEEE Computer Society Press, 2-4.
Fields, R.E., Wright, P.C., and Harrison, M.D. (1995) ‘A Task Centred Approach to Analysing Human Error Tolerance Requirements’, Proceedings 2nd IEEE Symposium on Requirements Engineering, IEEE Computer Society Press, 18-26.
Gough P.A., Fodemski F.T., Higgins S.A. and Ray S.J. (1995) ‘Scenarios - an Industrial Case Study and Hypermedia Enhancements’, Proceedings 2nd IEEE Symposium on Requirements Engineering, IEEE Computer Society Press, 10-17.
Hollnagel E. (1993) ‘Human Reliability Analysis Context and Control’, Academic Press.
Hollnagel E. and Kirwan B. (1996) ‘Practical Insights from Studies of Operator Diagnosis’, Proceedings 8th European Conference on Cognitive Ergonomics, EACE, 133-137.
Hsi I. and Potts C. (1995) ‘Towards Integrating Rationalistic and Ecological Design Methods for Interactive Systems’, Georgia Institute of Technology, Graphics, Visualisation and Usability Centre Technical Report, 1-15.
Hsia P., Samuel J., Gao J., Kung, D., Toyoshima, Y. and Chen, C. (1994) ‘Formal Approach to Scenario Analysis’, IEEE Software, 11, 33-41.
Jacobson I., Christerson M., Jonsson P., and Overgaard G. (1992) ‘Object-Oriented Software Engineering: A Use-Case Driven Approach’, Addison-Wesley.
Jackson M. (1995) ‘Software Requirements and Specifications’, ACM Press/Addison-Wesley.
Jarke M., Bubenko Y., Rolland C., Sutcliffe A.G. and Vassiliou Y. (1993) ‘Theories Underlying Requirements Engineering: An Overview of NATURE at Genesis’, Proceedings 1st IEEE Symposium on Requirements Engineering, IEEE Computer Society Press, 19-31.
Leveson N.G. (1995) ‘Safeware: System Safety and Computers’, Addison-Wesley Publishing Co.
Lewycky, P. (1987) ‘Notes towards understanding of accident causes’, Hazard Prevention, 6-8.
Maiden N.A.M. (1996) ‘Scenario-based requirements acquisition and validation’, submitted to Journal of Automated Software Engineering.
Maiden N.A.M. and Sutcliffe A.G. (1996) ‘Analogical Retrieval in Reuse-Oriented Requirements Engineering’, Software Engineering Journal, 11, 281-292.
Maiden N.A.M. and Rugg, G. (1996) ‘ACRE: Selecting methods for Requirements Acquisition’, Software Engineering Journal, 11, 183-192.
Maiden N.A.M., Mistry P. and Sutcliffe A.G. (1995) ‘How People categorise Requirements for Reuse: a Natural Approach’, Proceedings 2nd IEEE Symposium on Requirements Engineering, IEEE Computer Society, 148-155.
Maiden N.A.M. and Sutcliffe A.G. (1994) ‘Requirements Critiquing Using Domain Abstractions’, Proceedings IEEE Conference on Requirements Engineering, IEEE Computer Society Press, 184-193.
Maiden N.A.M. and Sutcliffe A.G. (1993) ‘Requirements Engineering by Example: An Empirical Study’, Proceedings IEEE Symposium on Requirements Engineering, IEEE Computer Society, 104-112.
Maiden N.A.M. and Sutcliffe A.G. (1992) ‘Exploiting Reusable Specifications Through Analogy’, Communications of the ACM, 34, 55-64.
Nielsen, J. (1993) ‘Usability Engineering’, Academic Press, New York.
Potts C., Takahashi K. and Anton A.I. (1994) Inquiry-Based Requirements Analysis’, IEEE Software, 11, 21-32.
Norman D.A. (1988) ‘The Psychology of Everyday Things’, Basic Books, New York.
Rasmussen J. (1991) ‘Event analysis and the problem of causality’, in J. Rasmussen, B. Brehmer and J. Leplat, editors. Distributed Decision Making: Cognitive Models for Co-operative Work, John Wiley & Sons, New York, 247-256.
Rasmussen J., Pejtersen A.M. and Goodstein L.P. (1994) ‘Cognitive Systems Engineering’, John Wiley & Sons, Inc.
Rasmussen J. and Vicente K.J. (1989) Coping with Human Errors through System Design: Implications for ecological Interface Design’, Int. J. Man-Machine Studies, 31, 517-534.
Reason J. (1990) ‘Human Error’, Cambridge University Press.
Reason J. (1987) ‘A Preliminary Classification of Mistakes’, in J. Rasmussen, K. Duncan, and J. Leplat, editors. New Technology and Human Error, John Wiley & Sons, New York, 45-52.
Reese J.D. (1995) Software Deviation Analysis, Ph.D. thesis, University of California, Irvine, California.
Reubenstein H.B. and Walters R. C. (1991) ‘The Requirements Apprentice: Automated Assistance for Requirements Acquisition’, IEEE Transactions on Software Engineering, 17, 226-240.
Sutcliffe A.G. and Rugg G. (1994) ‘A taxonomy of error types for failure analysis and risk assessment’, Technical Report no. HCID/94/17, Centre for HCI Design, City University, London.
Sutcliffe A.G. (1997) ‘A Technique Combination Approach to Requirements Engineering’, Proceedings IEEE International Symposium on Requirements Engineering, 65-74.
The Risks of Automation:
Some Lessons from Aviation, and Implications for Training and Design
Paul J. Sherman, William E. Hines, and Robert L. Helmreich
The University of Texas at Austin
In December 1995, the crew of an American Airlines Boeing 757 approaching Cali, Columbia, attempted to route the aircraft toward their destination by entering into the flight management computer (FMC) a substring of the code for a Cali navigational beacon. The computer’s database of navigational beacons contained two very similar codes, one for the beacon near Cali, one for a beacon at the Bogota airport. Presented by the FMC with a list of nearby beacons matching the entered code, the crew initiated an overlearned behavior, selecting the computer’s first presented alternative–an overwhelming majority of the time, the first presented alternative is the crew’s intended choice. Unfortunately, this time the FMC had presented the Bogota beacon first. The flight management computer dutifully began to turn the aircraft toward Bogota. The input error went undetected. Shortly after this, the aircraft crashed into the side of a mountain, killing all on board.
Automation In Aviation: A Partially-fulfilled Promise
Automation, defined as the replacement of a human function, either manual or cognitive, with a machine function (Wiener, Chidester, Kanki, Palmer, Curry & Gregorich, 1991), has been deployed in complex, team-oriented processes such as aviation, industrial, maritime, and surgical endeavors with the intent to prevent human error, aid team members in accomplishment of tasks, and to increase efficiency (Drury, 1996; Lee & Sandquist, 1996; Meshkati, 1996). Since the widespread application of automation in aviation beginning in the early 1980’s, the overall accident rate has decreased. However, this overall decrease masks the existence of a trend in incidents and accidents that is wholly or partly attributable to air crew’s interaction with automated systems (Billings, 1997). In several instances, automated systems have behaved in accordance with their design specifications but were used at inappropriate times. In other cases, operators have failed to detect or the automation has failed to inform crews of system malfunctions (Billings, 1997).
Despite the best intentions of aircraft and software designers, automation cannot completely eliminate human error, and may in some cases exacerbate it (Wiener, 1993b). For example, when it becomes necessary for an operator to disengage automation and carry out a process manually, as is often the case in aviation, the transition from automated to manual control can add to operators’ workload in an already task-saturated situation, increasing the likelihood of errors occurring. In a series of laboratory studies using college students and professional pilots performing flight tasks in low-fidelity simulators, Parasuraman and colleagues demonstrated that automation use can lead to complacency in monitoring and lessened awareness of automation failure (Parasuraman, Mouloua, Molloy, & Hilburn, 1993), especially when it is very reliable (Parasuraman, Molloy, & Singh, 1993), and when it is used for extended periods (Hilburn, Molloy, Wong, & Parasuraman, 1993). Additionally, if automated control is used often and the manual tasks are complex, operators’ manual skills may atrophy between periods of manual control (Bainbridge, 1983). Research in aviation has shown that almost 90% of pilots flying one type of highly automated commercial aircraft reported manually flying for a portion of the flight, in order to preserve basic flying proficiency (Wiener, 1989; see also McClumpha, James, Green, & Belyavin, 1991). To quote from a captain with nearly 20 years of experience in commercial flying:
I personally have always clicked off the automation in order to keep my personal flying skills sharp. This practice, until recently, has been in violation of the company written procedures and training philosophyÉ [Among other pilots] I have observed lowered basic skills, including roughness on the controls, lowered situational awareness, and falling behind the airplane when hand flying (Sherman, 1997).
The difficulties inherent to operator-automation interaction become more challenging when automation is utilized in a team-based (as opposed to an individual-based) operation. As automation supplants teams of operators performing multiple tasks, the intangible benefits stemming from the interaction of humans in a complex system (i.e., the opportunity for richer, more flexible information transfer) are reduced (Danaher, 1980), leaving opportunity for commission of errors. As research and real-world incidents and accidents show, an automated system can not always prevent operators from using it incorrectly or providing it with incorrect input (Wiener, 1993a; Billings, 1997). As a result, if operators do not verbalize and confirm their inputs to an automated system, then ‘setup’ errors can beget performance errors.
An Expanded View of Automation Roles and Responsibilities
Study of these issues in aviation has led researchers and industry groups to consider flight deck automation as a part of the air crew team (Air Line Pilots Association [ALPA], 1996; Billings, 1997; Woods, 1996). In this view, automation is an "electronic crew member" that performs group process functions and interacts with live crew members (Helmreich, 1987, p. 70). This understanding of automation strongly suggests that the manner in which air crews work with their "electronic peer" (Helmreich, Chidester, Foushee, Gregorich & Wilhelm, 1990, p. 13) has become similar to the manner in which crews work with each other--in order to prevent the commission of error, crews must assess and verify the performance of automation, and ensure that it is not committing an error or creating the conditions for the occurrence of error. Considered in this way, it becomes obvious that a well-designed system should also fulfill its team roles. Ideally, an automated system should keep operators informed of its present and future actions and notify the operators in the event of any and all abnormalities (Air Line Pilots Association [ALPA], 1996; Billings, 1997; Federal Aviation Administration [FAA], 1996). However, automated systems, unlike human team members, are not always capable of informing team members when they are executing an action at an inappropriate time, committing an error, or creating conditions for commission of error.
Fortunately, global principles of aircraft automation training and design influenced by these and other observations have been formulated to guide the future application of automation. Human-centered or crew-centered automation (ALPA, 1996; Billings, 1997), is a set of recommendations asserting that automation should be designed and utilized to support operators in successfully accomplishing a process, complementing human abilities and compensating for human limitations, and remaining subordinate to operators. Accordingly, automation should be designed and used taking into account the overall combined performance of the operators and the automated system, instead of merely optimizing the performance of isolated portions of the total system (ALPA, 1996). Recalling the previously-mentioned problems that are associated with use of automation, it is evident that applying these principles to the design and use of systems may be an effective way to avoid and/or mitigate the commission of error in future endeavors.
It is argued here that the design and use of automated systems should be guided by the specific principles of crew-centered automation and the broad, systems view of error management, a view of team-based performance that acknowledges the ubiquity of human error and promotes the development of team-based strategies to avoid, trap, and mitigate the effect of errors (Merritt & Helmreich, in press; Reason, 1990, Wiener, 1993a). To this end, we provide some empirical justification for the application of crew-centered automation and error management principles by presenting data from ongoing studies of commercial pilots’ attitudes toward use of automation, as well as measures of team performance in highly automated aircraft during actual line operations.
Making the Case: Data from Pilots and Expert Observers
In an ongoing study of air crew performance in actual line operations, our research group, the University of Texas at Austin Aerospace Crew Research Project, has gathered team performance data on more than 2,600 flight segments, in four major U.S. airlines and numerous fleets. Almost half of these were observations of crews in highly automated aircraft (‘highly automated’ refers to an aircraft that has a flight management system, or FMS, capable of both lateral and vertical navigation). Expert observers trained in evaluating air crews’ human factors and technical skills gathered ratings of overall as well as specific aspects of team performance. A team of researchers from the University of Texas gathered these data in each airline; pilots from each airline gathered data in their respective airlines only. The specific aspects of performance that were rated include concepts such as team formation and maintenance, leadership and followership, communication and coordination of intent and plans, as well as six items tapping crews’ use of automation (i.e., whether crews communicate changes to automated system parameters, remain vigilant for automation failures, and use different automation capabilities at appropriate times). Crews were rated on a four-point scale, with 1="unacceptable"; 2="minimally acceptable", 3="standard", and 4="exemplary". For most items, ratings can be assigned across four phases of flight (predeparture, takeoff/climb, cruise, and approach/landing).
Sub-optimal team performance was observed in all aircraft types, and at all air carriers. Analyses across all carriers show that approximately 30% of crews from automated aircraft earned one or more below-standard ratings (i.e., a rating of 1 or 2) on some aspect of automation use at some point during the flight (Helmreich & Hines, 1997). This percentage ranged from 18% to 35% across organizations. Further examination reveals that crews typically received below-standard ratings because they failed to inform one another of changes to automated system parameters (e.g. entering a waypoint or altering a descent profile without verbalizing the action), used automation at inappropriate times, or did not remain sufficiently aware of what actions the FMS was initiating (Helmreich, Hines, & Wilhelm, 1996; Wilhelm, Hines, & Helmreich, 1996).
These data clearly suggest that a fairly large minority of pilots either are not fully aware of or discount the risks of using automation in a dynamic environment. This assertion is supported by associated survey data. Our research group has administered the Flight Management Attitudes Questionnaire (FMAQ; Merritt, Helmreich, Wilhelm, & Sherman, 1996), a cross-cultural survey tool that measures flight crews’ attitudes toward leadership and followership, crew coordination, interpersonal communication, and stress, in over 30 airlines across 20 nations. In a study of 5,879 pilots across 12 nations that described attitudes toward and preference for automation, it was found that 4% to 21% of pilots, depending upon their nation of origin, did not endorse ensuring that other crew members acknowledge their FMC entries. Approximately 10% to 50% of pilots did not endorse increased cross-monitoring of crew actions on the automated flight deck. Furthermore, only 36% to 66% of pilots (again depending upon their nation of origin) endorsed the avoidance of high-risk activities such as attempting to ‘reprogram’ (pilots’ term for re-entering data) an approach during high workload situations. Finally, a considerable number of pilots (ranging from 24% to 50% across nations) agreed that there are modes and features of the FMS that they do not completely understand (Sherman, Helmreich, & Merritt, in press).
Applying the Data
The data presented here should not be taken solely as a criticism of air crews; rather, they should be viewed as a critique of training for use of automated systems, the principles governing systems design, as well as air crews’ use of automation. Pilots’ knowledge, skills, and habits are to a large extent the products of the training that they have received. If automated flight deck training does not explicitly teach that error is ubiquitous, automation should not be blindly trusted, and should be used when it maximizes the synergy between automation and operators, then pilots cannot be solely responsible for some of the deleterious outcomes of automation use.
Ideally, when pilots transition to an automated flight deck, they should be trained not only in how to use automated systems; they should also be trained in when to use (and when to not use) automation. Furthermore, this training should be informed by the most current empirically derived information regarding human-automation interaction. The data reviewed above suggest specific areas that could be addressed in training.
Several air carriers (including some of those contributing data to the studies described above) do have some form of automation use ‘philosophy’ that addresses some of the issues described above, and many of these carriers attempt to disseminate this information in air crew training. Judging from the data, the organizations have met with limited success in transferring their philosophies into actual practice. As of this writing, the most comprehensive effort to implement an empirically informed philosophy of automation use is underway at American Airlines (AAL, 1996). We encourage these efforts, and hope that other organizations will pursue a similar course of action.
These data also have implications for the design of automated systems. Judging from the universality of the performance and attitude issues described above, it seems that the automated systems currently in use in aviation are not completely successful in optimizing the performance of both the operators and the systems they control. Automated systems designers, too, have a responsibility to ensure that the design principles they employ take into account both the weakneses and the strengths of human performance, and attempt to optimize the performance of both the automation and the operators. As Wiener and Curry (1980) observe, more attention must be paid to the effects automation can have on users’ behavior. Fortunately, human factors research over the past decade has led to the development of empirically informed guidelines for the design and deployment of automation in dynamic environments. Space considerations preclude full descriptions of their derivation and content here; but a summary of their requirements is provided. In short, automation of a function should do the following (Billings, 1997):
á
Let operators retain sufficient command of the operationá
Allow the operators to remain sufficiently involved (i.e., play an active role) such that they can recognize and ameliorate deviations from intended conduct of an operationá
Keep the operators appropriately informed regarding automated systems behaviorá
Allow the system behavior to be sufficiently predictable, andá
Guard against error by monitoring the operators.To these, we would add the requirement that automation should also guard against error by aiding crew cross-monitoring tasks to the greatest extent possible. This requirement would make explicit the need for automated systems to contribute positively to the maintenance of team performance.
In sum, before an automated solution is applied to a task, systems designers should follow the recommendations of ALPA (1996) and Billings (1997, p. 245), and ask themselves the following questions: Why is this function being automated? Will automating the new function improve the system capabilities or flight crew [situation] awareness? Would not doing so improve the operators’ involvement, information, or ability to remain in command?
Conclusion: What Is To Be Done?
The data amassed thus far show that automation does not completely eliminate error (and may in fact contribute to its occurrence), and strongly suggest that the design and deployment of future automated systems should be accomplished with guidance from the principles of crew-centered automation and error management. However, if design and deployment are to be truly informed by these principles, guidelines specific to each design and deployment situation must be promulgated. In order to accomplish this, further study of operators, automation, and their interaction must be carried out by designers, trainers, and user groups. Ideally, multiple methods should be used to converge upon issues specific to the domain in which automation is deployed.
These tasks are obviously beyond the scope of the present work, and indeed probably beyond the scope of any one research group’s mission. We hope, however, that in providing an empirical justification for the application of these principles in one domain where automation is used in a dynamic environment, we will have helped demonstrate the necessity for a more informed approach to the automation of tasks.
References
Air Line Pilots Association (1996, September). Automated cockpits and pilot expectations: A guide for manufacturers and operators. Air Line Pilots Association, Herndon, VA: Author.
American Airlines (1996, October). Advanced Aircraft Maneuvering Program. Dallas, TX: Author.
Bainbridge, L. (1983). Ironies of automation. Automatica, 19, 775-779.
Billings, C.E. (1997). Aviation automation: The search for a human-centered approach. Mahwah, NJ: Lawrence Erlbaum Associates.
Danaher, J.W. (1980). Human error in ATC systems operations. Human Factors, 22, 535-545.
Drury, C.G. (1996). Automation in quality control and maintenance. In Parasuraman, R. & Mouloua, M. (Eds.), Automation and Human Performance (pp. 407-426). Mahwah, NJ: Lawrence Erlbaum Associates.
Federal Aviation Administration (1996). FAA Human Factors Team report on the interfaces between flightcrews and modern flight deck systems. Washington, DC: Author.
Helmreich, R.L. (1987). Flight crew behaviour. Social Behaviour, 2, 63-72.
Helmreich, R.L., Chidester, T.R., Foushee, H.C., Gregorich, S., & Wilhelm, J.A. (1990). How effective is cockpit resource management training? Flight Safety Digest, 9(5), 1-17.
Hilburn, B. Molloy, R. Wong, D., and Parasuraman, R. (1993). Operator versus computer control of adaptive automation. In R.S. Jensen and D. Neumeister (Eds.), Proceedings of the Seventh International Symposium on Aviation Psychology (pp. 161-166). Columbus, Ohio: Ohio State University.
Lee, J.D., and Sanquist, T.F. (1996). Maritime automation. In Parasuraman, R. & Mouloua, M. (Eds.), Automation and Human Performance (pp. 365-384). Mahwah, NJ: Lawrence Erlbaum Associates.
McClumpha, A.J., James, M., Green, R.G., & Belyavin, A.J. (1991). Pilots' attitudes to cockpit automation. Proceedings of the Human Factors Society 35th Annual Meeting, 107-111.
Merritt, A.C., & Helmreich, R.L. (in press). CRM: I hate it, what is it? (Error, stress, and culture). Proceedings of the Orient Airlines Association Air Safety Seminar, Jakarta, Indonesia, April 23, 1996.
Merritt, A.C., Helmreich, R.L., Wilhelm, J.A., & Sherman, P.J. (1996). Flight Management Attitudes Questionnaire 2.0 (International) and 2.1 (USA/Anglo). (Technical Report 96-04). Austin, TX: University of Texas.
Meshkati, N. (1996). Organizational and safety factors in automated oil and gas pipeline systems. In Parasuraman, R. & Mouloua, M. (Eds.), Automation and Human Performance (pp. 427-448). Mahwah, NJ: Lawrence Erlbaum Associates.
Parasuraman, R. Mouloua, M. Molloy, R. & Hilburn, B. (1993). Adaptive function allocation reduces performance costs of static automation. In R.S. Jensen & D. Neumeister (Eds.), Proceedings of the Seventh International Symposium on Aviation Psychology (pp. 178-181) Columbus, Ohio: Ohio State University.
Parasuraman, R., Molloy, R., & Singh, I. (1993). Performance consequences of automation-induced "complacency". The International Journal of Aviation Psychology, 3(1), 1-23.
Reason, J. (1990). Human error. New York: Cambridge University Press.
Sherman, P.J. (1997). Aircrews’ evaluations of flight deck automation training and use: Measuring and ameliorating threats to safety. Unpublished doctoral dissertation. The University of Texas at Austin.
Wiener, E.L. (1989). Human factors of advanced technology ("glass cockpit") transport aircraft. (NASA Contractor Report 177528). NASA-Ames Research Center, Moffett Field, CA.
Wiener, E.L. (1993a). Intervention strategies for the management of human error. (NASA Contractor Report 4547). NASA-Ames Research Center, Moffett Field, CA.
Wiener, E.L. (1993b). Crew coordination and training in the advanced technology cockpit. In E.L. Wiener, B.G. Kanki, & R.L. Helmreich (Eds.), Cockpit resource management (pp. 199-229). San Diego: Academic Press. (1989).
Wiener, E.L., Chidester, T.R., Kanki, B.G., Palmer, E.A., Curry, R.E., & Gregorich, S.E. (1991). The impact of cockpit automation on crew coordination and communication: Overview, LOFT evaluations, error severity, and questionnaire data (NASA Contractor Report 177587). NASA-Ames Research Center, Moffett Field, CA.
Wiener E.L., & Curry, R.E. (1980). Flight deck automation: Promises and problems. Ergonomics, 23, 995-1011.
Woods, D.D. (1996). Decomposing automation: Apparent simplicity, real complexity. In Parasuraman, R. & Mouloua, M. (Eds.), Automation and Human Performance (pp. 3-18). Mahwah, NJ: Lawrence Erlbaum Associates.
Human Error Analysis To Guide System Design
FrŽdŽric Vanderhaegen and Cristina Iani
Laboratoire de MŽcanique et d'Automatique Industrielles et Humaines, URA CNRS 1775
UniversitŽ de Valenciennes et du Hainaut-CambrŽsis
B.P. 311 - Le Mont Houy - 59304 VALENCIENNES Cedex - FRANCE
E-mail: {vanderhaegen, iani}@univ-valenciennes.fr
The paper reviews aspects of system development. It focuses on human error analysis which allows both off-line error prevention by specifying in details the future system and on-line error prevention by proposing human support tools. A model of unreliability is proposed to describe both human and machine dysfunctions. It is applied to the railway system for which accident analysis identifies critical and sensitive areas of unreliability. This identification aims at defining practical solutions to design more reliable systems.
In complex systems human behaviour and human errors play a major role in many accidents’ occurrences. Therefore, to improve safety and reliability all factors related to human that may influence error occurrence have to be an integral part of risk assessment and system design and realization. The consideration of human factors such as perception limits which influence information gathering and processing, decision making, stress and workload may help to develop measures to prevent incidents from occurring and to limit error consequences, but it is also necessary to integrate in the safety analysis characteristics of human behaviour such as its variability and adaptability to the environment and critic elements such as the information flow that characterize the interaction between different system components.
Moreover, in the design process, safety analysis has to consider both human and technical components as not only reliable but also fallible. Therefore, a priori safety analysis methods aim at optimizing system dependability and defining the safest allocation of role between human and machine. Even though they also aim at identifying and eliminating parameters that might cause human error, they have to be used to design human support tools for decision, action or recovery. Indeed, generally, human error studies for system design stop at the ergonomics level, without taking into account the continuous possible risk for human to repeat errors despite ergonomic design. Resulting analysis have then to be used in different ways:
• For off-line error prevention: training programme, awareness programme, specification phase in the system development.
• For one-line error prevention: error avoidance centered support tools, task distribution centered support tools, error recovery centered support tools.
The paper focuses on the interest of the human error study and analysis in order to orient the design process for future human-machine system. Therefore, it focuses on an off-line error prevention approach for system specification which aims at proposing on-line error prevention support tools. It presents the different system development steps and proposes a model of unreliability which can be helpful to identify critical and sensitive areas of an unreliable system and to solve them with both off-line and on-line preventive solutions. The case of railway system illustrates the interest of this model.
Development of a system
The main objective for system development is the designing of dependable system. After defining dependability, a methodology for system development and more precisely for system specification are proposed. The paper focuses on off-line error prevention approaches which specify on-line error prevention support tools.
System dependability
Dependability is defined as the ability for a system component to satisfy one or several required functions according to specific conditions (Villemeur, 1988). A dependable system is then a system which follows up objectives for which it was designed respecting both required safety and productivity levels. This definition implies that the system runs without accident nor incident which can damage the safety or the productivity. Dependability is generally characterized by four distinct features (Villemeur, 1988; Laprie, 1995):
• The safety to avoid critical or catastrophic consequences for humans, machines and environment. It is the ability of a system component to avoid this kind of events in given conditions.
• The availability related to the fact that a system has to be ready for use. It is the ability of a system component to be able to realize a required function, in given conditions, in a given time or during a given interval of time.
• The reliability to maintain a continuous service. It is the ability of a system component to realize a required function in given conditions and during a given interval of time.
• The maintainability related to repairing and evolution of system component. It is the ability of a system component to be maintained or repaired, during a given interval of time in order to be able to realize a required function.
General aspects for system development
A dependable system can be designed following different development models: the cascade model, the V model, the spiral model or other adapted models (Boehem, 1988). As example, the V model proposes to decompose the system development into two main interactive phases with retroactive steps, Figure 1 (Jalant, 1990):
• The top-down phase which includes a preliminary step for the system requirements and the specification step of system objectives, the step for reporting all the required specifications, the system design, and the system realization.
• The bottom-up phase which includes system integration on site, system validation and a test phase, an operational phase for production, and a possible specification of procedures to be followed at the end of the system cycle-life.
Dependability related tasks are required during the whole system cycle-life. Before proposing specification reports, preliminary analysis of safety, of dangers, and/or risks have to determine safety objectives, safety demands, system safety functions and safety procedures. The specification phase takes into account each phase of the system cycle-life.
Figure 1. System cycle-life phases
Therefore, specification reports are for example composed of system architecture report, recommendation report for design and realization, tests report, operational control and maintenance reports to be used during the operational phase, the dismantling or demolition reports at the end of the operational phase. Nevertheless, those development models do not integrate explicitly human operators although they actively participate in the real time operation of the system.
Human centered specification
When the specification phase concerns an existing system to be refined, a human centered specification has to take into account not only the process characteristics but also the activity of the human operators who control and supervize it, Figure 2. All interactive and retroactive steps are oriented by initial system objectives to be achieved, e.g. conditions of system dependability, system ergonomics, human and machine roles.
Figure 2. Experimental approach for human centered specification
The system analysis studies the present human-machine system and integrates new system goals. It consists of analysing both the process and human activity. System analysis has to take into account different contexts including both normal and abnormal function. Therefore, system analysis allows to identify both prescribed tasks identification related to the process analysis and actual tasks modeling performed by the human operators related to the analysis of activities. Differences between prescribed tasks and actual tasks will provide first requirement for the design step. Activity analysis can be based on both objective data (e.g. ocular activity analysis, postural activity analysis, communications analysis) and subjective data (e.g. verbalization methods, classification methods) collected during both normal and abnormal activities.
The tasks analysis aims at determining what information is needed for process supervision and at identifying action sequences to be carried out by the human operators to solve problems. Therefore, it provides temporal, functional and informational characteristics for each task and a task classification: those that may be entirely automated, those that cannot be executed by an automated tool, those that can be shared.
System modeling consists of modeling technical and human component characteristics. It includes process models and human models in both normal and abnormal function. System modeling can be refined by simulation or by observations in the field, if that is possible, related to well-defined methodological protocols. Both tasks analysis and system modeling which gives abilities and limits of humans and machines will provide recommendation for designing an experimental system.
The experimental design process uses conclusions of previous steps in terms of characterization of the human operator's role in the control and supervisory loop, the level of automation and the human-machine cooperation modes. Before realizing the final human-machine system, the last design step consists of the specifications of the dialogue interfaces.
After having built the designed system and having defined experimental protocols to study it, analysis of the results will validate or modify the proposed designed system. The evaluation objectives can take into account an assessment of the global performance, i.e. the difference between the real production of the pilot system and the expected production. Moreover, ergonomic criteria can explain some user problems using methods that help to reconstitute the unobservable human mental activity. These are based on subjective evaluation methods and measurable parameters to assess, for example, the human workload. An appropriate experimental protocol defines the evaluation contexts and criteria to be taken into account during experiments. Finally, an analysis of the recorded subjective and/or objective data of the effective performed tasks will permit the validation or refinement of the specifications of the proposed human-machine system.
This approach for system design must take into account human error not only in order to propose ergonomic design solutions to limit human error risks, but also to design on-line error prevention supports.
On-line error prevention support
Those assistances can be designed at various levels, Figure 3 (Vanderhaegen, 1995; Vanderhaegen, Telle and Moray, 1996):
• Before action. This decision oriented support system provides advice to human operators who act alone on the process. This passive preventive approach aims at avoiding error and can be useful to support alarms filtering or diagnosis.
• On the process directly. This action oriented support system requires an organization in which assistance tool and human operator are on the same decisional level. Control and supervisory tasks can then be distributed between human and computer in order to relieve human operator of overloaded situations. The corresponding task sharing policy is either subjective in a manual allocation, or objective in an automated allocation, related to the task characteristics or to the human behavior, in order to obtain an optimal regulation of system performance. This proactive approach aims both at reducing the risk of human error and at regulating workload.
• After action. This recovery support allows feedback for the human operator about any erroneous actions, and adds advice about intervention and fault management. This approach also permits active intervention in the process, such as over-ruling commands or emergency stops. This approach to fault management is thus an intelligent watchdog which filters human action according to evaluation criteria such as a list of possible erroneous actions, or cognitive modeling.
Figure 3. One-line error prevention support tools
Unreliability based study
Error study can be centered on the negative point of view, i.e. related to unreliability of humans and machines. According to a definition of the concept of error, methods to evaluate human reliability are then proposed. They can be used to build human error models. Nevertheless, they are statistics based methods which can be hardly applied to predict probable human behaviour. Therefore, a model of unreliability is proposed to study malfunctions. It is a descriptive and causal model which includes interactions between humans and machines.
Error concept
Human reliability is defined as the probability for a human operator (1) to perform correctly required tasks in required conditions and (2) not to perform task which may degrade system performance (Miller and Swain, 1987). Therefore, human reliability study requires human error study related to dependability features of a system.
From a process point of view, an error is a deviation of an actual action from that required by the process (Leplat, 1990). This deviation is a combination between dynamic human abilities and system requirements (Rasmussen, 1988). But one should note that when speaking of a human agent the approach is different: "The notions of intention and error are inseparable" (Reason, 1990).
This means that only human can make errors. But if one consider the case of intelligent systems, notions of intention and error may be introduced. An intelligent system has a set of possible actions for a single input, and may select a solution and examine its effects by entering a "what if" mode (Edwards, 1991). Among error taxonomies, different points of view can be considered in the human factor literature (Norman, 1988; Rasmussen, 1986; Reason, 1990; Hollnagel, 1991; Masson and De Keyser, 1995; Van der Schaaf, 1995; Frese, 1996).
A taxonomy of particular interest is Reason's taxonomy, which classifies errors according to the failures which generate them (Reason, 1990):
• Lapse: memory failure (omitting planned items, place-losing, forgetting intentions).
• Slip: attentional failure (intrusion, omission, reversal, misordering, mistiming).
• Mistake: rule-based mistakes (misapplication of good rule, application of bad rule); knowledge-based mistakes (many variable forms).
• Violation: routine violation; exceptional violations, act of sabotage.
From a technical point of view, a failure is declared when an entity loss the ability to perform its required function (Villemeur, 1988). On the other hand, related to human factor point of view, it is difficult to find a generic definition of failure, but it is generally considered as the set of behaviors which generate unreliable human actions. In addition a failure can be described in terms of its features, i.e. causes and modes:
• Failure causes: principal (the physical entity is out of order); secondary (propagated as the failure of others entities); of command (generates an inappropriate command from control system). Note that inappropriate commands are not always erroneous actions.
• Failure modes: domain (output value, or inadequate triggering); detection (failure of supervisory agents); consequence (level of system tolerance).
Error modeling
For a design feasibility study based on an a priori dependability study in order to specify on-line error prevention support tools, two approaches - based on feedback from field studies or simulations - can be considered:
• Related to normal function models. An abnormal function is declared out of the limits of normal function models and related to thresholds.
• Related to abnormal function models. Those models represent characteristics of abnormal situations, e.g. characteristics of possible failures, characteristics of the causes or the consequences of those failures, characteristics of procedures which will return the system to a normal state.
The second approach is presently based on two kinds of a priori analyses. The former is represented by machine centered methods whereas the latter is represented by human centered methods (Swain and Guttmann, 1983; Embrey, 1986; Villemeur, 1988). The machine centered methods can be adapted for human error studies:
• Inductive methods, i.e. from causes of faults to consequences for the process. There are methods such as FMECA (Failure Mode, Effects and Criticality Analysis), HAZOP (HAZard and OPerability study), PRA (Preliminary Risk Analysis) or the Consequence Tree method. They help to characterize system deviations, failures modes, dangerous situations or unacceptable events sequences.
• Deductive methods, i.e. from consequences to causes. There are methods such as the Fault Tree method used to assess the probability of error, or the Markov Network method used to define safety parameters.
• Combined methods such as MDCC (MŽthode du Diagramme Causes-ConsŽquences) which combines the Fault Tree method with the Consequence Tree method.
Different human oriented approaches for reliability study can also be used:
• Quantitative and predictive methods related to assessment of human error rates and of theirs effects on the system. There are methods such as THERP (Technique for Human Error Rate Prediction) or TESEO (Tecnica Empirica Stima Errori Operatori).
• Descriptive methods related to human behavior analysis, task analysis, error analysis, or human-machine interaction analysis. There are methods such as Critical Incident Analysis, OAET (Operator Action Event Tree) or approaches for human error taxonomies.
• Combined methods such as HEART (Human Error Assessment and Reduction Technique) or SHERPA (Systematic Human Error Reduction and Prediction Approach).
Those methods are either machine centered methods or human centered methods while reliability of a system implies both humans and machines. Even though machine centered methods can be adapted for human behaviour assessment, they are difficult to apply because they require data on human behaviour which cannot be tested as technical components. Moreover, human centered methods are not homogeneous, i.e. results they give can be different between methods (Reason, 1990). Therefore, it seems that error study has not to be based on statistics. The model of unreliability developed below is both a descriptive and causal model which aims at orienting the specification of on-line prevention support tools. The model consists in analysing unreliability instead of realising a statistical assessment of reliability. It is based on three-component principles: failure, "errorgenous" information processing and action.
Three-component principles for a model of unreliable agent
Reliability is the entity's ability to realise one or more required functions in specified conditions. It is interesting to define different kinds of terms related to unreliability characteristics for both human and non-human agent, Figure 4:
• Acquisition related failure. It is often the first cause of reduced reliability, because it generates an "errorgenous" environment. In the unreliability model, failure is considered as the same level for both human and machine. For example, attention mistakes or dazzle are comparable to a physical sensor failure: both human and physical failures are principal causes and have same consequence, i.e. lack of information.
• "Errorgenous" information processing. It is a failure of information processing and concerns both erroneous processing of information and processing of erroneous information. For example, because of lack of attention, sensory failures, display failures, the provided information cannot correspond to the real state of process. In such a way, it is possible to generate an unreliable situation for the process without doing erroneous action related to perceived information, but without doing a required action related to the real state of the process. On the other hand, an insufficient knowledge can generate an erroneous processing of perceived information.
Figure 4. Model of unreliable agent
• Correct action or erroneous action. One must distinguish the action, its consequences for the process, and the output information which is generated. With regard to the agent who performs an action, this action is correct when it corresponds to a required action according to the input information whatever the quality of this information (i.e. false or true with regard to the system state), otherwise the action is erroneous. Thus, when there is an erroneous action, there is a failure of the intelligent agent (i.e. human or machine). One the other hand, a correct action for an agent can be an erroneous action with regard to the process and its environment, i.e. with regard to the consequence of the action. Therefore, during error analysis, a reference point is needed to describe an action as correct or erroneous.
Relations between components
As a matter of fact, the three components of the model can be linked. For instance, the consequence of an acquisition related failure can involve "errorgenous" information processing or an erroneous action, an unrequired action can be caused by the "errorgenous" information processing, and an erroneous action can generate an acquisition related failure.
Those types of interaction between components can be internal or external. Internal interactions are related to physical, psychological and cognitive state of a human operator. External interaction concern the links with other human agents, with technical components, with external environment.
Interaction between human agents are made through verbal communication (e.g. directly or using a communication tool such as telephone or radio), or through non-verbal communications (e.g. manual transfer or communication via network). As a matter of fact, a model of unreliability has to be applied to both humans and machines.
Moreover, decision, action or recovery support tools may be considered as intelligent components which can have intention, and therefore which can make error. Thus the three components principles for unreliability can also be applied for those tools. In such a case, interactions between human agent and non-human agent can appear. The corresponding human-machine interface has then to be induced in the model of unreliability. The relations between the unreliable components of the human-machine system can be as follows:
• Consequence of an acquisition related failure involves its own "errorgenous" information processing, but can also acts on the "errorgenous" information processing of another agent, via the human-machine interfaces. Moreover, a failure can seem without consequence on its own "errorgenous" information processing but, can generate a failure on the other agent (e.g. a display failure does not affect the computer, but does mislead one the main human agent information processing).
• Causes of "errorgenous" information processing can act on another agent behavior through human-machine interfaces as a failure consequence (e.g. a human operator inputs a bad value, the non-human agent performs an unrequired action).
• Correct action and erroneous action (according to input information) are able to generate failure for both human and machine (e.g. a human operator who switches off involuntarily the remote electronic surveillance can cause a control system failure).
Example of the application of the model of unreliability
The case of the railways system illustrates a use of this model of unreliability. Firstly, the functional modeling of the global Canadian system is proposed and an accident is analyzed. The model of unreliability is then applied.
Functional analysis in railways system
Structured Analysis Design Technique (IGL, 1989) is a good way to identify system function during the system analysis. In such representation, the general railway system function is to realize a transport mission by means of railway system and competent staff, according to the specification of the demand to transport passengers or goods by train, Figure 5. The inputs of this function are the real needs required by customers and the feedback when mission are realized in order to update the planning of the rail transport services. A mission consists in driving passengers or goods from a departure point to an arrival one.
Figure 5. General railway system function
The accident considered is a collision which occurred on 7 June 1994 in the Lac-Saint-Jean Subdivision of Garneau Yard (Canada) and it is described in a report of the Canadian Transportation Safety Board which also reports the investigation into the incident (TSB, 1994). A functional description of the Canadian railway system in which the accident occurred was made: the corresponding analysis is focused on the function of control of the traffic in the yard area which, in the railway company considered, is normally carried out by a single human operator, the yard master.
The Garneau yard is a marshalling and maintenance yard. The work of the yard master includes supervising the switching operations of yard crews, issuing instructions to incoming and outgoing trains, transmitting and receiving remote information from the rail traffic controller (RTC) located in Montreal, providing appropriate information to car department and locomotive shop personnel of the yard and, when necessary, providing information to track maintenance personnel concerning the movements of trains in the Garneau Yard area. He makes and receives telephone calls and uses the computer, the fax machine, a photocopy machine and a railway radio with channels providing access to the RTC, train and yard crews within radio range in the yard. His office is located in a tower on the top of the yard office building, at Mile 0.0 of the Lac-Saint-Jean subdivision, and he can see the ongoing yard activities in and approaching Garneau Yard.
A yard master is in duty 24 hours each day except Saturday and Sunday and during the work period there are three shift changes, at 7.30, at 15.30, and at 23.30. Before these changes, the yard master who has finished his duty has to transfer responsibility to the following yard master.
The general function performed by this operator can be classified as to control the traffic in the yard area and it is realised by carrying out the following different sub-functions, Figure 6:
• To transmit and receive remote information from the rail traffic controller located in Montreal;
• To receive local information from the yard area;
• To integrate this information;
• To implement plans and issue information to incoming and outgoing trains;
• To control switching and train movements;
• To modify plans according to the situation development;
• To transfer responsibility to the following yard master.
The yard master receives all situation elements from communication or viewing the scene. He has to integrate the information by using his experience and knowledge of how aspects of the situation work together and influence each other, and to project this information into the future to make and modify plans as tasks are completed and a new situation arises. There is no status board or job aid to help the operator to remember or confirm the situations or plans and no procedure seem to be used in communication exchanges between different operators. He has also to perform additional administrative tasks this resulting in a considerable workload.
Figure 6:. Decomposition of the general function into different sub-functions
Accident description and events reconstruction
The locomotive for train No. 418 was travelling northward on the main track to enter the yard extension track to couple to train 418 which was standing on track No. S-253. Train No. 421 arrived at Garneau Yard from Montreal and was yarded on track No. S-263. The crew uncoupled the locomotive from the train and operated it northward on the yard extension track to the main track to switch at Mile 1.78. The locomotive entered the main track and operated southward to the Garneau yard locomotive shop. On a six-degree curve at Mile 1.5 the two locomotives collided.
On the day of the occurrence, the day yard master began transferring responsibility to the afternoon operator using a computer-generated status report which was printed 35 minutes before the actual change of shift. During the transfer both yard masters reviewed the status report and discussed plans for train movement and control. There was no documentation of plans or changes occurring after the printing of the status report nor there was a procedure to ensure that the incoming yard master understood the situation and the plans at the time of the transfer. At the time of the transfer, the day yard master verbally communicated the position of trains.
The situation was as follows:
• Train 421 was being yarded on track No. S-263 and its locomotive had to return to the locomotive shop via the main track.
• The locomotive for train 411 had to couple to its train on track No. S-262.
• The locomotive for train 418 had to proceed through the yard track No. 260 to couple to the train on track No. S-253.
After the second yard master started his duty, the situation changed: He authorised a requested change in the instructions given by the first yard master to the conductor of train 418 which resulted in both train 418 and train 421 using the main track in opposite directions, Figure 7.
Figure 7:. Accident description
Application of the model of unreliability
In the considered system, information has to be received, integrated and transferred from a human agent, the yard master, to different other agents, the train and yard operators, in order to achieve the general objectives of the system. There is no way to verify if all important information has been transferred and if it has been correctly understood. At the same time the yard master has to rely only on himself to remember all information necessary to implement plans and to transfer orders to the crew members of the trains. It is possible to apply the causal model of unreliability to explain the development of events which have caused the collision.
The analysis of unreliability is made for a sub-function: the control of the movements of trains, Figure 8. The model is duplicated in order to represent the interaction between two different human agents. The left box of the figure is the yard master, whereas the right box is the trainman of train 418, and in particular, the crew of this train in general. For each model of unreliable agent, unreliable components which have caused the collision were identified. A number indicates each unreliable behavior.
The outgoing yard master transferred the information to the incoming one, but this transfer occurred in a distracting environment and all information was transferred only verbally. As a consequence of the lack of support tools, the second yard master did not develop an adequate situation awareness (1) and allowed two trains to employ the same track in opposite directions (2), without realizing that in this way a conflictual situation would arise. His action generated an "errorgenous" information processing in the train crew member of train 418 (3): because of the permission given by the yard master, train crew members were sure that there was no other train on the main track and they operated (8) at a speed that prevented them from avoiding the collision.
Figure 8. Application of the causal model of unreliability
This errorgenous information processing was also reinforced by an erroneous action of the crew members of train 418: the radio in the locomotive cabin has to be always set at a volume to ensure continuous monitoring, but it was set at a low volume (4). This action generated an acquisition related failure: train crew members did not heard the warning given by the crew of train 411 to train 421 about train 418 approaching on the same track (5). Trains 418 and 421 were operating within the cautionary limits, that is portions of the main track or main tracks within which speed has to permit stopping within one-half the range of vision from any equipment on a track unit. They relied on the yard master for permission to use the main track within those limits when they should have been complying with the rules governing the operations of trains within them. This is then a violation of the rules (6). Moreover the trainman of train 418 realised that train 421 was approaching on the same track when it was too late to avoid the collision, because instead of looking at the track, he was trying to understand how to operate his new portable radio. He was meant to test it before the duty, but the instructions were not available before. This is an omission (7) consequent to organizational problems. All these different elements had as a consequence the collision of train 418 and train 421.
The system described in this paper presents a very low degree of automation and many of the problems which have led to the final accident could be avoided by the introduction of support tools for the operator such as aids to remember the situation or plans and to develop an adequate situation awareness and of procedures which govern the information exchange. However the model can be applied to different systems, from highly automated systems to less automated ones.
Conclusion
The paper has focused on human error analysis which aims at orienting off-line and on-line prevention of human error. The first part has presented off-line error prevention approaches: a system development method and a human centered specification phase by designing on-line support tools. Those tools aim at preventing human error providing human operator an help to decision, action or recovery. Related to the system analysis and modeling, i.e. process and activity analyses and modeling, the second part has discussed on an reliability based approach. Different abnormal function models were then developed in order to model human errors. Most of human reliability assessment methods aim at reducing probability of the error occurrence. Nevertheless, they are off-line prevention approaches which do not take into account the impact of on-line prevention support tools on the future designed system. Moreover, results are not homogenous between methods. A different approach was then presented. This approach is not based on statistics and starts from the consideration that both human agent and machine are unreliable. This model was applied to a case taken from the railway system.
The purpose of the model is to provide a structural description of the system which takes into account dynamic aspects such as the information flow, the allocation of roles between human operators and machines, to consider the different levels of interaction which may characterize a complex system and localize where problems may occur and how they may propagate throughout the system. A functional analysis of the system identifies functions which have to be carried out to ensure the achievement of the system goals, who performs these functions and what are the available tools. The resulting functional modeling can be completed by the structural model of unreliability which identifies unreliable components of a given function.
The perspectives of the study are the integration of a more developed model of unreliability in the system specification. A field study focused on the observation of railway operators’ activity and the analysis of other railway accidents may allow the development of a case-based unreliability approach that will aim at identifying the weak points of the considered system. Thus, the model of unreliability will be able to isolate the possible error producing conditions, human-machine interaction problems, the causes and the consequences of failure, errorgeneous information processing and action. Moreover, it will provide solutions in order to improve reliability. For example, related to the human specification phase, a simulation study based on the generation of accident scenarios and the analysis of the experimental results might evaluate the impact of on-line error prevention support tools on system reliability.
Aknowledgement
This project was supported by the European Community during the "Human Capital and Mobility" Network on "Prevention of Error in Systems for Energy Production and Process Industry".
References
Bšehm, B.W. (1988) 'A spiral model of software development and enhancement' IEEE Computers, 21, 61-72
Edwards, J. L. (1991) 'Intelligent Dialogue in Air Traffic Control Systems' in Wise, J.A., Hopkin, V.D. and Smith, M.L. (eds.) Automation and System Issues in Air Traffic Control NATO ASI Series F73, 137-151
Embrey, D.E. (1986) SHERPA: A Systematic Approach for Assessing and Reducing Human Error in Process Plants Human Reliability Associated Ltd
Frese, M. (1995) 'Error management in training: conceptual and empirical results' in Gagnara, S., Zucchermaglio, C.E. and Stucky, S. (eds.) Organizational Learning and Technological Change Springer Publ. Co., 112-124
Hollnagel, E. (1991) 'The phenotype of erroneous actions: implications for HCI design' in Weir, G.R.S., Alty, J.L. (eds.) Human-computer interaction and complex systems London Academic Press, 73-121
Jaulent, P. (1990) GŽnie logiciel: les mŽthodes A. Colin
Laprie, J.C. (1995) Guide de la sžretŽ de fonctionnement CŽpaduŽs
Leplat, J. and de Terssac, G. (1990) Les facteurs humains de la fiabilitŽ dans les systŹmes complexes - Economie et gestion Octares
Masson, M. and De Keyser, V. (1992) 'Human error: lesson learned from a field study for the specification of an intelligent error prevention system' in Kumar, S (ed.), Advances in Industrial Ergonomics and Safety IV, Taylor and Francis, 1085-1092
Miller, D. P. and Swain, A. D. (1987) 'Human error and human reliability' in Salvendy, G. (ed.), Handbook of Human Factors John Wiley & Sons, 219-250
Norman, D. A. (1988) 'Categorisation of Action Slips' Psycological Review 88, 1-15
Rasmusen, J. (1986) Information Processing and Human-Machine Interaction North-Holland
Rasmusen, J. (1988) 'The role of error in organizing behavior' Ergonomics 33, 1185-1199
Reason, J. (1990) Human Error, Cambridge University Press
Swain, A.D. and Guttmann, H.E. (1983) Handbook of Reliability Analysis with Emphasis on Nuclear Plant Applications Technical Report NUREG/CR-1278, NUclear REGulatory Commission, Washinton D.C
Telle, B., Vanderhaegen F. and Moray, N. (1996) 'Railway System Design in the Light of Human and Machine Unreliability' IEEE Transactions On Systems, Man and Cybernetics (Beijing, China, 14-17 October)
TSB (1994) Transportation Safety Board, Canada, Report number R94Q0029
van der Schaaf, T. W. (1995) 'Human recovery of errors in man-machine systems' 6th IFAC/IFIP/IFORS/IEA Symposium on Analysis, Design and Evaluation of Man-Machine Systems (Boston, USA, 27-29 June, 91-96)
Vanderhaegen, F. (1995) 'Human-machine organization study: the case of the air traffic control' 6th IFAC/IFIP/IFORS/IEA Symposium on Analysis, Design and Evaluation of Man-Machine Systems (Boston, USA, 27-29 June, 615-620)
Vanderhaegen, F., Telle, B. and Moray, N. (1996) Error based design to improve human-machine system reliability Computational Engineering in Systems Applications CESA'96 IMACS Multiconference (Lille, France, July 9-12 1996, 165-170)
Villemeur, A. (1988) SžretŽ de fonctionnement des systŹmes industriels Eyrolles
Analysing Requirements Specifications for Mode Confusion Errors
N. Leveson, J. Reese, S. Koga, L.D. Pinnel, S.D. Sandys
University of Washington
This paper was presented in LaTeX format and still needs to be converted into html by the editor.
Embedding Modelling of Errors in Specifications
P. Palanque and R. Bastide,
Universite de Toulouse I
This paper still needs to be converted into html by the editor.