Many business applications use GUIs to allow users to interface with the applications for performing a number of operations. Typically, GUIs are mouse- and keyboard-intensive, which can be problematic or even impossible to use for many people, including those with physical disabilities. One type of interface that avoids a mouse or keyboard is a speech interface. A speech interface allows audio input of commands to communicate with applications, and can be used by anyone who wishes to speak to their system, such as mobile users with inadequately-sized keyboards and pointing devices.
One of the main challenges for a speech interface is specifying the desired target of an audio input, especially in a GUI where multiple selectable objects such as windows, text fields and icons, can have the same label or name. In these situations, it is important for both the computer system and the user to know the current focus when an audio input is issued, to help the system resolve possible ambiguities and to help the user keep track of what he is doing.
One type of ambiguity is called “target ambiguity,” where the target of a user's action is ambiguous and must be resolved. In a physical interaction involving a mouse or other pointing device, users specify the target of their actions directly by clicking on the selectable object of interest. Target ambiguities caused by selectable objects that have the same name are resolved through the physical interaction. With audio input, users do not have a way to resolve target ambiguities physically; therefore, the target ambiguity must be handled in some other way.
Traditional speech interfaces for GUIs typically emulate the keyboard and mouse directly using spoken equivalents; however, they are slow to operate and often take quite a bit longer to select an object than conventional mouse or keyboard techniques. Conventional speech interfaces lack public acceptance due to inaccurate control of the interfaces.
Other traditional speech interfaces for GUIs combine audio input with alternative pointing devices such as head- or eye-tracking technologies. However, conventional alternative pointing devices require calibration and expensive equipment, making them difficult to set up and use on computers shared by multiple people.
Still other traditional speech interfaces provide object selection solely by audio input. These speech interfaces explore a current window or current screen area to find selectable objects that match the audio input. One limitation of these speech interfaces is that they only explore the current screen area to find selectable objects matching the audio input. They do not explore the other screen areas for a match to the audio input. Instead, additional audio inputs are required to look for matches to the audio input in screen areas other than the current screen area. These additional required audio inputs decrease the efficiency of conventional speech interfaces.
Another limitation of these traditional speech interfaces involves their capability to resolve target ambiguity. These interfaces mark selectable objects matching the audio input with opaque icons for subsequent selection. However, placing an opaque icon adjacent to the selectable object often pushes screen elements out of place and distorts the screen layout. Furthermore, overlaying the selectable object with an opaque icon often obscures the underlying text and graphics. Finally, displaying opaque icons for all objects that match each other often clutters the screen. Thus, traditional speech interfaces often fail to maintain the integrity of screen layout and the view of the text and graphics of the selectable objects when resolving target ambiguities.