Much of software used in business today takes the form of complex graphical user interfaces (GUIs). Complex GUIs allow users to perform many tasks simultaneously while maintaining the context of the rest of their work; however, such systems are often mouse- and keyboard-intensive, which can be problematic or even impossible to use for many people, including those with physical disabilities. Voice interfaces can provide an accessible solution for physically disabled users, if steps are taken to address inherent usability problems, such as user efficiency and ambiguity handling. Additionally, voice interfaces may increase the efficiency of performing certain tasks.
Large resources have been expended to develop web-based applications to provide portable, platform-independent front ends to complex business applications using, for example, the hypertext markup language (HTML) and/or JavaScript™.
Because software applications have typically been developed with only the visual presentation in mind, little attention has been given to details that would facilitate the development of voice interfaces.
In most computer or data processing systems, user interaction is provided using only a video display, a keyboard, and a mouse. Additional input and output peripherals are sometimes used, such as printers, plotters, light pens, touch screens, and bar code scanners; however, the vast majority of computer interaction occurs with only the video display, keyboard, and mouse. Thus, primary human-computer interaction is provided through visual display and mechanical actuation. In contrast, a significant proportion of human interaction is verbal. It is desirable to facilitate verbal human-computer interaction to increase access for disabled users and to increase the efficiency of user interfaces.
Various technologies have been developed to provide some form of verbal human-computer interactions, ranging from simple text-to-speech voice synthesis applications to more complex dictation and command-and-control applications. The various types of verbal computer-human interaction applications may be described by two factors: (1) the presence or absence of a visual component; and (2) the extent to which the underlying application and interaction is changed when a voice interface is added.
Many research organizations building verbal human-computer interaction systems focus on the second factor: creating new interaction styles that may be used in conjunction with or in lieu of a visual display. For example, various organizations have created the following systems: CommandTalk; ATIS; TOOT; and ELVIS. Each of these systems focuses on providing improved models for verbal human-computer interaction, fundamentally changing the interaction style. For example, CommandTalk maintains a traditional GUI, while fundamentally changing the interaction style to improve usability. ATIS, an air travel information system, maintains a traditional visual component by enabling answers to user queries in a visual tabular format; however, ATIS modifies conventional interaction styles, moving from a database query interface to a natural language query interface. Similarly, TOOT, a train schedule information system, attempts to present tabular data to users; however, TOOT provides the tabular data by voice, eliminating the need for a visual component. Finally, the ELVIS system for accessing email messages by voice has been tested with several interaction styles, which differ from the visual interaction to varying degrees. The system-initiative style makes use of many of the same command names found in the visual interface, while providing a mixed-initiative style significantly changes conventional interactions.
Many commercial systems tend to maintain conventional interaction styles with varying degrees of visual components. Windows access tools such as ViaVoice™ and SUITEKeys mirror the keyboard/mouse interaction to a greater degree than any of the dialogue systems mentioned above. SUITEKeys even goes so far as to directly mimic the interaction of moving the mouse by hand and pressing individual keys. Similarly, many telephony applications mimic the keypad interaction directly with interactions that take the form of “press or say 1.”
Enormous resources have been used to develop business applications requiring complex GUIs to present large quantities of information, display complicated interactions within the information, and manage the complexity of maximizing user capability, configuration, and control. Existing applications provide limited support for controlling an application using voice. Some existing systems allow dictation or limited access commands; however, there is a need for systems and techniques to increase the extent of verbal human-computer interaction in conventional and legacy application to provide increased accessibility for disabled users and increased efficiency of interaction for all users.
Voice-enabled interfaces generally present at least two modes: a data entry mode (in which a user enters data such as dictated text) and a navigation mode (in which the user navigates between, to, or from elements of the interface(s)). For some elements of the voice-enabled interfaces, such as single-select elements having only one possibility for data entry (e.g., a zip code field), transitions from the data entry mode to the navigation mode are easily determined (e.g., with a zip code field, the transition may automatically occur when all five digits are received). However, when an element is an “open interaction element,” (OIE) such as free-form text dictation fields, it may be difficult to determine when the user has completed entries into such a field, in a manner that is convenient and reliable for the user.