Many computer programs have graphical user interfaces (GUIs), which allow users to interact with the programs using images rather than text commands. A GUI represents information and actions by displaying graphical icons and visual indicators on a screen. A user interacts with the GUI through manipulation of the graphical elements using a mouse, trackball, joystick, touch screen or other input device. For example, a user may control the location of a cursor displayed on the screen by moving a mouse. The user may indicate a choice by moving the cursor to an icon and pressing a button on the mouse (“clicking”). Similarly, the user may move the cursor to a location within a rectangle (“text box”) displayed on the screen, click to select the text box and then enter text into the box by typing on a keyboard. GUIs employ metaphors, such as buttons, check boxes, radio buttons, text boxes, pull-down lists (drop-down boxes), etc. (collectively referred to as “controls”), to facilitate human-computer interaction. GUIs are widely used in personal computers, embedded systems, such as automatic teller machines (ATMs) and point-of-sale terminals, hand-held devices, such as mobile telephones and MP3 players, and the like.
Although GUIs represent a significant advance over purely text-based user interfaces, GUIs nevertheless suffer from some shortcomings. For example, a user must be able and willing to manipulate a mouse or other pointing device and, in most cases, enter text by typing on a keyboard. People who do not have use of their hands, either because of a physical disability or because their hands are dedicated to performing some other tasks, can not use GUIs.
Automatic speech recognition (“ASR” or “SR”) technology converts spoken words into text. For example, Dragon Naturally Speaking speech recognition software from Nuance Communications, Inc., Burlington, Mass., may be used to dictate text into a word processing document or into a text box of a GUI, once the text box has been selected. In some cases, voice commands may be used to navigate among controls in a GUI. Two conventional approaches are available for speech-enabling a software application.
In one approach, a speech recognition application parses a displayed GUI at run time and attempts to identify controls, such as buttons and text boxes. The speech recognition application enables the user to navigate among the identified controls by uttering simple commands such as “Next field” and “Previous field.” In some cases, the speech recognition application can identify text displayed near the controls. For example, if the GUI is displayed by a browser rendering a web page, the speech recognition application may parse HTML code of the web page. Text that is displayed adjacent a control may be assumed to be a label for the control. The user is then able to navigate to a control by uttering “Go to xxx,” where “xxx” is the assumed label for the control. Once a control has been navigated to, the user may activate the control and/or use the control, such as by dictating into a text box.
However, correctly identifying displayed text as labels is error prone and sometimes impossible. For example, not all GUI controls have near-by text, and any near-by text may not actually correspond to the control and may not, therefore, provide a suitable navigation label for the control.
Advantageously, this approach to speech-enabling a GUI requires no effort on the part of a GUI developer. However, the navigation capabilities provided by this approach are limited to simple, i.e., “next” and “previous,” navigation commands if suitable label text can not be identified. Furthermore, some GUIs include multiple pages or tabs, the contents of only one of which is displayed at a time. Such complex GUIs are often found in electronic medical record/electronic health record (EMR/EHR) and many other complex application programs. Unfortunately, using the first speech-enabling approach, it is not possible to navigate to, or activate, a control located on a page or tab other than the currently-displayed page or tab.
Another approach to speech-enabling a GUI provides richer and more complete speech navigation. However, this approach requires specifically designing the GUI for speech recognition. A developer who is skilled in voice user interface (VUI) design constructs a “dialog” that guides a user through voice interactions with the application. Thus, the speech recognition capabilities and programming are deeply integrated in the GUI. Furthermore, the language used to specify the voice interactions is distinct from the language used to describe the GUI. Although this approach can lead to a rich and sophisticated VUI, the development process is long, expensive and requires a skilled developer.
Application developers are, therefore, faced with a technical problem of speech-enabling complex GUIs, such as multi-page or multi-tab GUIs, without constructing speech dialogs when the GUIs are designed (“GUI design time”).