The Evaluation Of User Interface Notations
Chris Johnson and Jarle Gj\os\ather,
Department of Computer Science,
University of Glasgow,
Glasgow, G12 8QJ.
Scotland.
Email: johnson@dcs.gla.ac.uk
http://www.dcs.gla.ac.uk/~johnson
Tel: +44 0141 330 6053
ABSTRACT
Over the last decade a wide range of graphical, tabular and textual notations have been proposed to support the design of human-computer interfaces. These notations are intended to strip away the clutter of implementation details that frequently obscure interaction properties. Unfortunately, relatively little work has been done to evaluate the usability of these notations for 'real-world' interfaces. We have, therefore, conducted an empirical evaluation of the User Action Notation (UAN), State Transition Networks (STN) and temporal logic 'in the wild'. By this we mean that our subjects were drawn from realistic samples of users and designers. We also presented our subjects with realistic descriptions of two user interfaces. This avoids a weakness of previous investigations that have used 'toy examples'. The results of our investigation show a strong preference amongst our subjects for the use of natural language descriptions. More surprisingly, our results also suggest a link between the frequency of comprehension errors and positive attitude statements towards particular notations. In other words, our subjects made most errors with the notations that they liked the best. This suggests that while graphical notations, such as state transition networks, have a strong intuitive appeal they may also create significant problems for real-world development tasks.
KEYWORDS:
Interface design notations; UAN; STN; Temporal Logic.
1. INTRODUCTION
A vast array of notations are now available for the development of human-computer interfaces (Gray and Johnson, 1995). These range from purely graphical languages, such as State Transition Networks (Green, 1985), through tabular notations, such as the User Action Notation (Hix and Hartson, 1993), to textual formalisms, including grammars and logics (Johnson, 1993). These notations have had a relatively low impact upon the development of mass-market computer systems. They are, however, playing an increasingly important role in the design of human-computer interfaces to safety-critical applications (Johnson and Harrison, 1994). The additional complexity and the high consequences of 'failure' are such that design notations are being exploited by government agencies, such as NASA (1989) and the UK Ministry of Defence (1991), as well as by corporations, including IBM and Mitsubishi (Jack, 1992).
The commercial application of interface design notations is complicated by the difficulty of selecting an appropriate language. Previous authors have focused upon technical comparisons, such as consistency and completeness (Johnson and Harrison, 1994). While these issues are important, they are perhaps less significant than the usability of the notations themselves. Green emphasises this point in his work on cognitive dimensions (Green, 1989). He assesses the usability of a notation against a number of generic criteria. For instance, the 'viscosity' of a language indicates how easy it is to change a design. This work has, typically, been driven by the qualitative assessments of researchers (for example, see Buckingham-Shum, 1991). In contrast, we want our assessments to come direct from the designers and domain experts who must use the notations (Johnson, 1995, 1995a).
2. STRUCTURE OF THE PAPER
Section 1 has introduced the argument that is presented in this paper. Section 3 goes on to describe the methodological problems that frustrate the use of laboratory-based studies, think aloud techniques and questionnaires as means of evaluating interface design notations. It is argued that in the absence of any 'ideal' evaluation technique, it is nevertheless important that we attempt to validate our work on the specification and verification of interactive systems. Section 4. presents some of the problems that arise when recruiting users for such an evaluation. It is argued that we are facing a Catch-22 situation. It is difficult to produce statistically significant results because not enough industrial designers are exploiting formal and semi-formal notations. This makes it difficult to identify a reasonable population of users to support any evaluation. In turn, we will not develop a reasonable population of users until we can evaluate and demonstrate the benefits of interface notations. Section 5 introduces the UAN, state transition and temporal logic notations that were used in our study. Section 6 then presents the window manager example that was used to evaluate the usability of these formalisms. Section 7, in contrast, introduces the safety-critical gas turbine interface that was used in our second study of industrial interface designers. Section 8 describes the comprehension questions that were used in our two trials. It also presents the more qualitative attitude statements that were used to identify more general responses to the UAN, temporal logic and state transition specifications. Section 9 presents the results of the evaluation while Section 10 discusses the reasons for our findings. Finally, Section 11 presents the conclusions that can be drawn from this research and suggests directions for further work.
3. THE PROBLEMS OF EVALUATION
Our aim was to assess the usability of interface notations for 'real-world' design tasks (Johnson, McCarthy and Wright, 1995). This raised a number of methodological questions. For instance, much of the previous work into the usability of programming languages has focused upon subjects drawn from University courses (Scholtz and Wiedenbeck, 1990). This is far from ideal because undergraduates may have little experience of the commercial pressures that are often cited as reasons why interface design notations are not used in the 'real-world'.
3.1 Laboratory Evaluations
The need to recruit a realistic user-population led us to focus our evaluation on industrial designers. This raised a number of further problems. In particular, it made it difficult for us to exploit the 'laboratory-based' evaluation techniques that have traditionally been used in other areas of human-computer interaction. Asking commercial designers to leave their normal work to conduct a tightly controlled experiment under laboratory conditions may not produce results that will be replicated under the pressures of everyday life (Campion, 1989).
Few employers are willing to release their staff for the period of time that would be required in order to complete a thorough laboratory based evaluation of interface design notations. Significant amounts of training and practice are necessary if individuals are to reach basic levels of proficiency. A further problem is that interface designers come from a range of backgrounds. Some have considerable expertise with mathematical notations. Others are 'self-taught' and have no experience with such notations. This makes it difficult to 'factor out' expertise. In order to ensure that the users have identical training we would ideally want to put them all through the same introductory course. The results of our previous investigations into the industrial application of formal methods have shown that there is little incentive for designers who are already familiar with interface notations to sit through such a course (Johnson, 1996)
Further problems arise because the term 'usability' is too vague to produce statistically significant results. This ambiguity can be reduced. For example, we might design a series of experiments to identify the total time that it takes to learn different notations. Alternatively, we might analyse the frequency and type of errors that are made when reading design documents. Unfortunately, such approaches raise further methodological problems. In particular, those errors that might be observed under controlled laboratory conditions may have little relationship to the mistakes that people actually make when using an interface design notation 'in the wild'. Similarly, the time taken to learn an interface notation will clearly be affected by the competing tasks and deadlines that characterise the working lives of interface developers.
3.2 Think Aloud Techniques
Think aloud evaluation techniques, such as those developed by Wright and Monk (1991), offer an alternative to the tightly controlled conditions of laboratory-based evaluation techniques. This approach requires that users complete a number of 'realistic' tasks under normal working conditions. They are asked to talk about the critical incidents and decisions that occur while they complete these tasks.
This approach overcomes some of the problems associated with the replication of laboratory results in people's working environments. The evaluation of a notation can take place in a realistic setting. The intention is to provide qualitative feedback. The focus is not upon the production of statistically significant results. It is, therefore, perfectly possible for an evaluation to be interrupted by the distractions that frequently occur during everyday life: the telephone may ring; colleagues may ask questions and so on. Such normal events are, typically, factored out in laboratory-based investigations where interruptions would distort the results obtained for errors or learning times.
Unfortunately, think aloud techniques do not entirely resolve the methodological challenges that are posed by interface design notations. For instance, the task of thinking aloud can affect the results that are produced by this approach. Designers are forced to reconsider their actions during the process of vocalisation. This period of introspection may actually reduce the number of errors that would otherwise have been made. In consequence, the critical incidents that are noted during the analysis might be very different from those that would occur 'in the wild' (Wright and Monk, 1991).
We were concerned to analyse the usability of interface design notations against two different classes of application. This decision was motivated by the argument that safety-critical applications may pose totally different challenges than mass market word processors or windowing systems. This raised further problems for the use of think-aloud evaluation techniques. It is extremely difficult to gain access to interface designers for safety-critical systems. Those designers that we could interview were developing user interfaces for oil rig workers on production platforms off the Norwegian and United States' coasts. We could not, therefore, directly observe the use of our chose notations in the manner advocated for think aloud techniques
3.3 Questionnaires And Surveys
Questionnaires avoid many of the weaknesses of laboratory-based evaluations and think aloud techniques. They do not take designers out of their normal context of work nor do they require that interface designers enter their working environment to observe their everyday tasks with an novel notation. In contrast, they are low cost and can be easily distributed with minimal overheads for the companies involved.
Unfortunately, the use of questionnaires also limits the scope of our investigation. In laboratory based approaches, evaluators can tightly control a users' working environment. In think aloud techniques, evaluators can observe users working on specified tasks. In surveys, users are free to respond to questions at any time in their daily life. They may fill in the form during a break or while waiting for a program to compile or while working on another design task. These other activities can interfere with the results that are obtained from our study.
The caveats raised above emphasise the point that there are no ideal techniques for the evaluation of interface design notations. Laboratory-based techniques filter out real-world influences. Think-aloud techniques require close access to designers who are already familiar with a notation. Surveys are unconstrained and cannot easily be controlled. With this in mind, we are exploring a range of evaluation techniques that combine elements of all of these approaches (Johnson, 1996). This work is still in its early stages. In anticipation of the results of this work, existing evaluation techniques must be exploited to analyse the usability of interface design notations. It is important that such investigations are attempted even if the evaluation tools are far from ideal. Without such evidence there is no means of validating current research into the specification and verification of interactive systems.
4. THE USERS
As mentioned, it is difficult to find industrial designers with the time and motivation to support the evaluation of interface notations. We were only able to recruit six people who were prepared to help us study the usability of these formalisms. Their expertise with interface design notations ranged from none at all up to three years experience with the application of formal and semi-formal specification languages. The group included recent graduates and experienced interface designers from an oil production company. This range of backgrounds further complicated the problem of producing reliable results. It is difficult to draw direct comparisons between users with such different levels of expertise and skill. The situation was even worse. Those designers that did have previous experience were not all skilled in the same notations. Some users were familiar with software engineering formalisms but not with user interface notations. Others claimed to have high levels of expertise in both areas. Additional problems are created because it is difficult to assess whether claimed levels of skill are actually vindicated by an individual's ability to use a particular notation to design a human-computer interface.
Given the limited sample size and the varied backgrounds of our participants, it is difficult to produce statistically significant results. Unlike many laboratory studies of small-scale user interfaces or well constrained psychological phenomena, it is simply not possible to factor out all of the potential biases and external factors that might influence our results. One solution to the problems listed above would be to recruit more users to perform the survey. It can be argued that as more and more people use interface specification notations then it will become easier to recruit users for our evaluations. Unfortunately, we are in a Catch-22 situation. Many people will not exploit formal and semi-formal notations until they are convinced of their utility. We cannot, however, gain evidence of the utility of notations until we have a large enough user population of industrial designers.
5. THE NOTATIONS
UAN, STN and temporal logic were chosen for our evaluation because each represents either a tabular, a graphical or a textual approach to interface design. The following paragraphs describe the basic components of each of these notations.
4.1 User Action Notation
Hix and Hartson's (1993) User Action Notation categorises user interface events according to the agent that initiated them. In this context, the agent is assumed to be either the user or the computer. Figure 1 illustrates the resulting tables. The users action of moving the mouse pointer to the button is represented in the interface by the cursor tracking to that button. The action of pressing the button down is represented in the interface by the button being highlighted. Finally, the user action of releasing the button is mirrored by the button being de-highlighted in the interface and by the corresponding action being executed in the application. The sequence of events flow from left to right and then from top to bottom. The structure of the table, therefore, not only provides important cues about the agent that is engaged in a particular activity but it also represents temporal information.
Figure 1:
Simple Example Of The User Action NotationWe are interested in evaluating this notation because it provides a relatively simple syntax with what is claimed to be an intuitive structuring. In particular, the use of columns to organise user and system function helps the reader to keep track of their various activities. Both perform visible actions and hypothesised, invisible internal actions. These internal actions for the system include operations on the dialogue state, application actions etc. Those for the user include changes to short and long term memory, planning and other cognitive actions.
5
.2 State Transition NetworksState transition networks provide a graphical notation for the development of user interfaces (Olsen, 1985). The state of interaction is represented by a circle. The state of the system changes when particular conditions are satisfied. In Figure 2, this would occur if the condition 'User presses mouse' were fulfilled. If the mouse were pressed then the action 'highlight button' would be executed. The state of the system would change and the button would now be highlighted. The complexity of this simple textual explanation indicates the attraction of graphical representations, such as State Transition Networks! Figure 2, also represents the hierarchical nature of state transitions networks. The box to the right of the diagram represents another state-transition network that may contain sub-sub-networks.
Figure 2: Simple Example Of State Transition Networks
We are interested in State Transition Networks, rather than other graphical notations such as Petri Nets (Bastide and Palanque, 1990) or State Charts (Harel, 1988), for entirely pragmatic reasons. During initial tests it was found that State Transition Diagrams could be more quickly understood from written instructions than other available formalisms. This was significant as our evaluation was to be based around a series of paper-based questionnaires. Again, our findings re-iterate the close relationship between evaluation tools and the scope of any validation exercise for interface design notations. It is difficult to validate the claims that are made for many of the more sophisticated tools and techniques because they require significant training and practice. Many designers will not meet the costs of such training and practice until the approaches have been thoroughly validated.
5.3 Temporal Logic
Temporal logic is a purely textual notation (Johnson, 1993). The following example states that an action is executed if in the present interval the user moves the mouse and eventually they select a button and in the next again interval the system executes the action associated with that button. The
O symbol is read as 'next', à is read as 'eventually':execute_action
Üuser(moves_mouse),
à (user(selects_button),
O(system(executes_action)).
It may seem paradoxical to include temporal logic. We excluded Petri Nets because our users found them difficult to understand. The logic notation poses greater challenges than the networks of the graphical formalism. Our decision is justified by a concern to compare a relatively intuitive graphical notation with the more 'demanding' textual formalism. We are also interested in evaluating temporal logic because timing properties can have a profound impact upon the course of interaction (Johnson, 1995a). For example, numerous studies have shown that variable delays in system responses lead to frustration and error (Kuhmann, 1989). It is important, therefore, that designers can represent and reason about the impact of these problems. Temporal logic provides a textual representation of the timing properties that are represented using the graphical and spatial cues of UAN and state transition networks. We are also concerned to evaluate this notation because we have limited qualitative evidence to suggest that there are a number of usability problems with the logic (Gray and Johnson, 1995). It can be difficult for designers to exploit such notations; a training in discrete mathematics is required in order to understand the formal underpinnings of the language. It can also can be extremely difficult to construct the complex chains of interaction that arise between a system and its user.
5.4 The Difficulties Of Comparing Notations
Our intention was to ask designers and users to provide qualitative usability assessments of UAN, state transition networks and temporal logic Such analyses have only previously been produced from the introspection and insight of academics and researchers in human-computer interaction (Gray and Johnson, 1995). The examples presented in the previous paragraphs illustrate some of the problems of performing such an evaluation of different interface description techniques. For example, each of the notations encodes slightly different design information. UAN explicitly represents the agents involved in interaction. This information may be included in STN and temporal logic but it is not explicitly part of the notation. Similarly, temporal logic offers a wide range of facilities for dealing with time. These can be introduced into UAN and STN but again they are not basic components of the notations (Gray and Johnson, 1995). Such differences make it difficult to determine whether the various descriptions of an interface are equivalent. For instance, if we chose to represent an interface to a time-critical system, such as a process control application, then we might expect temporal logic to perform well in our qualitative evaluation. If we described interaction with an agent based interface then UAN might out-perform the other notations. In order to reduce this bias, we chose to evaluate the notations against two different scenarios. The first involved the design of a graphical user interface to a window manager. The second involved human-computer interaction with a turbine on an oil-rig.
6. TRIAL 1: THE WINDOW MANAGER
A window manager was chosen because it typifies many mass-market applications. Our intention was to provide a detailed example; Figure 3 illustrates the STN from this evaluation. Our subjects were asked to use this description to answer a series of questions about the interface. There then followed a number of qualitative questions about the usability of the notation. This process was repeated for each of the remaining notations.
Figure 3: STN For Opening An Application In The Window Manager
This experimental design created a number of problems. The design of the user interfaces had to be different enough from the existing systems so that designers had to use the descriptions. Questions about qualitative preferences between notations might also have been biased if each of the notations had been introduced in the same order. Similarly, learning effects might bias answers about the interface design. Information gleaned from a description using one notation might have been used to answer questions about another description. In order to combat these effects we had to produce three sets of descriptions and three sets of behavioural questions for each of the notations. The ordering of the material was randomised.
7. TRIAL 2: THE TURBINE CONTROL SYSTEM
The human-computer interface for a turbine control system was chosen because it typifies the safety-critical applications that are increasing being developed using abstract notations such as temporal logic. As in the previous trial, we analysed the reactions of 'real' users and designers to each of the interface design notations. In this case our subjects were drawn from a Norwegian oil production company. Figure 4 provides an example of the UAN description that was presented to our subjects. At first sight, this might seem impossibly complicated for designers and users with little expertise in interface notations. It should be remembered, however, that the subjects had a high degree of domain expertise and, as the results show, they were able to interpret descriptions at this level of detail.
Figure 4: UAN For Operator Selection In The Turbine Control System
Each of the case studies was given to a different group of subjects. In both trials we endeavoured to use interface designers and users who might actually be involved in the development of such an application. In our first case study, this involved subjects from a range of backgrounds and educational experience. In the second case study, the specialist nature of the user interface created a much more homogenous user population; oil-rig engineers with experience in turbine control.
8. THE QUESTIONS
We wanted to gain evidence about designer and user reaction to interface description notations. In order to do this we were concerned to ensure that the subjects made some attempt to understand the notations in each condition. A number of questions were, therefore, asked about the behaviour of each interface. The following table presents the questions that were asked about the STN shown in Figure 3.
The design of these comprehension questions raised a number of interesting problems. Early versions of the questionnaire provided 'Don't know' in addition to the 'Yes' and 'No' options. An initial survey revealed a strong tendency amongst the subjects to avoid commitment by selecting the 'Don't Know' option. This is an interesting phenomena for further research. The decision was taken, however, to force commitment to a binary decision. This creates the risk that the subjects would rely on chance by guessing the answers. Fortunately, our results do not indicate that these tactics were not being used.
A second set of questions provided qualitative evidence about users' and designers' reactions to the interface notations. The same questions were asked in both case studies and for all of the conditions within each test. Subjects were asked to tick the boxes next to those statements that they agreed with.
As mentioned in the introduction, the decision to conduct our analysis 'in the wild' imposed a number of constraints upon our evaluation. The users and designers in the second trial were 'off-shore' and could only be reached by radio or fax. We, therefore, had no means of recording completion times for the survey. Subjects in all of the conditions were asked to take as much time as they liked. They were, however, asked to record how long they spent on the comprehension questions associated with each notation. Each trial was performed by six subjects.
9. RESULTS OF THE EVALUATION
9.1 Time Consumption
Chart 1 illustrates the average total time required to complete the comprehension questions in each of the test conditions. These figures do not include the time taken to read and understand the 'familiarisation' material. It is important to emphasise that these timings were reported by the subjects and were not monitored by the investigators.
Chart 1: Average Total Time Required To Answer The Comprehension Questions
In the window manager condition, the standard deviation for both the STN and the UAN conditions was 2.0. In the former case, this was caused by one of the subjects taking substantially longer than the mode of 5 minutes. In the UAN condition this was caused by one user taking substantially longer than the modal value of 10 minutes. In the temporal logic condition, the standard deviation was 0 as all of the users took longer to complete the tasks than the allotted 10 minute interval. The standard deviation for the STN and the UAN in the Turbine Controller conditions was 2.6. For temporal logic, the standard deviation was 2.0. This lower figure was due to only one user completing the task within the allotted ten minutes.
Before presenting further results is worth mentioning that the figures reports in Chart 1 help to illustrate additional problems that must be addressed during the evaluation of interface notations. The standard deviations seem high in relation to the absolute time values reported. These results are due to the strong effects that arose when many users in the trial fail to complete the task within the allotted interval. This raises serious problems. If we simplify the questions to reduce the time taken then we will be forced to evaluate our techniques against toy systems and trivial examples. If we expand the amount of time required to complete an evaluation then we increase the burdens and demands upon our scarce user population.
9.2 Score For Behavioural Questions
Chart 2 indicates the percentage of correct answers to the comprehension questions for each of the interface description languages.
As mentioned in previous sections, these questions elicited yes or no responses to a series of questions about the potential interfaces.Chart 2: Percentage Of Comprehension Questions Answered Correctly
In the Window Manager trial the standard deviation for all three notations was 17.9. This homogeneity reflects in part the effects of previous skills and experitse that have been address in the earlier sections of this paper. Those with little experience of formal notations scored well below the mean. Those with considerable skills in the use of interface design notations, scored consistently above the average in each case. In the Turbine Controller trial, the standard deviation for STN was 19.7, for temporal logic it was 22 and for UAN it was 26.6. This last result is rather surprising as it reflects a genuine spread of results from 20% correct up to 100% correct. Even so, the similarity in the standard deviations again re-iterate the problems of evaluating interface notations. Greater confidence in the range of scores might have been obtained by asking a larger number of questions. This would have emphasised the differences between novices and experts, it might also have emphasised the differences between each of the notations. Unfortunately, the results from Chart 1 already raise serious questions about the amount of time that is required to conduct any evaluation of interface notations. A related issue is the question of fatigue that might begin to have a serious effect if designers were asked to apply particular notations to answer a large number of comprehension questions.
9.3 Overall Attitude Statements
Chart 3 records the summation of the attitude statements for each of the notations in the two interfaces. The scores were obtained by adding one to a total for each of the positive attitude statements that were agreed with. One was subtracted if a positive statement was denied. This total was then divided by the number of subjects in each test. In this way the most positive score would have been six, zero would have indicated a neutral attitude and minus six represents the strongest negative reaction to a notation.
Chart 3:
Overall Summation Of Scores For Attitude StatementsThe standard deviation for the qualitative assessment in the Window Manager case was 1.6 for STN, 3.0 for temporal logic and 2.7 for UAN. In the Turbine Controller case, the standard deviation was 3.7 for STN, 1.5 for temporal logic and 3.7 for UAN. As before, these results can only be interpreted with reference to the methodological barriers that frustrate the evaluation of user interface notations. For example, the high variance for temporal logic in the first case study might be compared with the relatively low variance in the turbine controller. This might suggest that users were undecided about the limitations of the notation for designing a Window Manager but were convinced that it was of little benefit for a Turbine Controller. Alternatively, these results may simply reflect the a-typical views of one or two individuals in the sample that we selected. In order to be convincing these results must be replicated. This leads us back to the previous argument that conventional evaluation techniques cannot be used to provide this additional evidence. The low numbers of commercial practitioners and their varying levels of expertise create a pressing need to find alternative means of validating interface specification techniques.
9.4 Detailed Attitude Analysis
Chart 4 presents the summation of scores for the individual attitude statements that were asked about each notation in both of the test interfaces. The absence of a bar in the chart indicates a neutral score of zero. The (1) indicates the window manager trial, the (2) refers to the turbine controller.
Chart 4:
Individual Summation Of Scores For Attitude Statements
10. DISCUSSION
The results shown in chart 4 indicate a preference amongst our subjects for the use of natural language rather than UAN, temporal logic or STN. The only exception to this is a slight preference for STN over natural language in the window manager trial. This positive reaction might have been due to the highly opportunistic design for interaction with this interface (Olsen, 1985). The preference for natural language in the other conditions might have occurred because our examples were too easy. Interface notations are often argued to offer the greatest benefits for complex design tasks (Johnson and Harrison, 1994). This analysis is supported by the high average score for all of the comprehension questions. It is, however, difficult to believe given the complexity of the UAN and STN shown in figures 3 and 4. Alternatively, this negative reaction may indicate that designers and users 'in the wild' remain unconvinced about the benefits of interface description techniques. This is perhaps unsurprising for the window manager trial. The use of abstract notations has made relatively little progress in this area. It is more surprising in the case of the turbine controller where issues of safety and complexity have already led to the use of formal notations by the engineers in our evaluation.
We see a strong negative reaction towards temporal logic in both case studies. This is confirmed by an analysis of variance for the average attitude scores. The relatively low variance for the negative reaction towards temporal logic in the turbine trial indicates a strong consensus on this point:
STN |
TL |
UAN |
|
Window Manager |
13.3 |
43.3 |
37.3 |
Turbine Controller |
70 |
11.3 |
67.3 |
The negative reaction to temporal logic is perhaps explained by the average time taken to answer the comprehension questions. Both UAN and temporal logic took significantly longer than STN in the window manager example. This, in turn, may explain the high positive reaction to STN in the qualitative assessments mentioned above. Our investigation, therefore, confirms previous observations about the high intuitive appeal of graphical notations. More interestingly, our evaluation shows that in spite of this strong preference our subjects actually made significantly more mistakes in the turbine comprehension exercise with STN than with temporal logic. Current work is attempting to replicate this result. If correct, this analysis has profound consequences for the development of human-computer interfaces. The usability problems and additional time that is required in order to understand temporal logic descriptions may be out-weighed by the benefits of increased accuracy.
11. CONCLUSIONS
This paper has argued that more attention must be paid to the usability of interface notations if they are to be applied in the 'real-world'. In order to support this argument, we conducted an evaluation of UAN, STN and temporal logic. Our intention was to ask designers and users to provide qualitative usability assessments of interface design notations. Such analyses have previously been performed by academics and researchers in human-computer interaction.
The user interfaces to a window manager and a turbine control system were described using UAN, STN and temporal logic. These descriptions were then presented to the designers and users of these applications. They were asked to answer a series of comprehension questions to test their understanding of the design. They were also asked some qualitative questions about the use of the notation. The results showed that all subjects achieved high levels of comprehension for both trials and all notations. In spite of this, our subjects expressed a strong preference for natural language descriptions.
The most surprising finding from our study was that the highest number of comprehension errors were observed for notations with the most positive attitude statements. Temporal logic gained significantly better results for the turbine interface than either UAN or STN. This was in spite of the fact that it also provoked the most negative reaction.
Much further research remains to be done. In particular, the work described in this paper has exposed the inadequacies of existing evaluation techniques for the validation of interface design notations. Laboratory studies tend to ignore the everyday pressures of designer's working environments. Think aloud techniques can be difficult to apply with novice designers who are still learning novel notations. Questionnaires provide only limited evidence about the long term effectiveness of a design technique. This is a critical problem. Unless we can validate our techniques then it will be difficult to persuade designers and companies to invest in the specification and verification of interactive systems (Johnson, 1995).
ACKNOWLEDGEMENTS
Thanks are due to Jarle Gj¿s¾ther who helped to design and run the experiments that are described in this paper. Thanks are due to Phil Gray who helped with the application of UAN. Gilbert Cockton helped with our use of the STN notation. Steve McGowan, Paddy O'Donnell and Steve Draper also helped with the comparative analysis. This work is supported by UK EPSRC grant number GR/K55042 and by JCI Grant number SPG-9201233.
REFERENCES
Bastide, R. and Palanque, P. Petri Nets With Objects For The Design, Validation And Prototyping Of User-Driven Interfaces. In Diaper, D., Gilmore, D., Cockton, G. and Shackel, B. (eds.) Interact'90. 625-631. Elsevier Science, 1990.
Buckingham-Shum, S. Cognitive Dimensions Of Design Rationale. In D. Diaper and N. Hammond (eds.), People And Computers VI: Proceedings Of HCI'91, Cambridge University Press, Cambridge, 1991.
Campion, J., Interface The Laboratory With The Real World. In Long, J. and Whitefield, A. (eds.) Cognitive Ergonomics And HCI. 35-65. Cambridge University Press, Cambridge, 1989.
Gray, P. and Johnson, C.W. A Critical Analysis Of Interface Specification Notations. In Palanque, P. and Bastide, R. (eds.), Design, Specification And Verification Of Interactive Systems '95, 113-133, Springer Verlag, Wien, 1995.
Green, M. Design Notations And User Interface Management Systems. In G.E. Pfaff (ed.), User Interface Management Systems, 89-107, Springer-Verlag, Berlin, 1985.
Green, T. Cognitive Dimensions of Notations. In A. Sutcliffe and L. Macaulay (eds.), People And Computers V: Proceedings Of HCI'89, 443-460, Cambridge University Press, Cambridge, 1989.
Harel, D. On Visual Formalisms. Communications of the ACM, (31)5:514-530, 1988.
Hix, D. and Hartson, H.R. Developing User Interfaces, John Wiley and Sons, New York, 1993.
Jack, A. It's Hard To Explain But Z Is Much Clearer Than English, The Financial Times, 22, 12 April 1992.
Johnson, C.W. A Probabilistic Logic For The Development of Safety-Critical Interactive Systems, International Journal Of Man-Machine Studies, (39)2:333-351, 1993.
Johnson, C.W. The Economics Of Interface Design. In Nordby and P.H. Helmersen and D. Gilmore and S.A. Arnesen (eds.), Human Computer Interaction - Proceedings Of IFIP Interact '95, 19-25, Chapman And Hall, New York, 1995.
Johnson, C.W. The Challenge of Time. In Palanque, P. and Bastide, R. (eds.), Design, Specification And Verification Of Interactive Systems '95, 345-357, Springer Verlag, Wien, 1995a.
Johnson, C.W. Literate Specification. Software Engineering Journal (to appear in 1996).
Johnson, C.W. and Harrison, M.D. Software Engineering For Human Computer Interaction, SIGCHI Bulletin, (26)2:46-48, 1994.
Johnson, C.W., McCarthy, J., and Wright, P.C. Using A Formal Language To Support Natural Language In Accident Reports, Ergonomics, (38)6:1265-1283, 1995.
National Aeronautic and Space Administration, Advanced Orbiting Systems - Architectural Specification For The CCSDS Secretariat, Washington DC, 1989.
Olsen, D.R. Presentational Syntactic And Semantic Components Of Interactive Dialogue Specification. In G.E. Pfaff (ed.), User Interface Management Systems, 125-133, Springer-Verlag, Berlin, 1985.
Scholtz, J. and Wiedenbeck, S., Learning To Program In Another Language. In Diaper, D., Gilmore, D., Cockton, G. and Shackel, B. (eds.), INTERACT'90. Elsevier, North Holland, 1990.
UK Ministry of Defence, Requirements for the Procurement of Safety Critical Software, London, MOD DEF-STAN 00-55, 1991.
Wright, P.C. and Monk, A.F. A Cost Effective Evaluation Method For Use By Designers. International Journal of Man-Machine Studies, (35)6:891-912, 1991.