The present invention generally pertains to user interaction with a computing system. More specifically, the present invention pertains to interactions implemented in the context of a speech recognition system.
There is a trend toward implementation of a natural user interface (NUI) as the next generation of user interface. Much attention has been paid to the improvement of related speech recognition technology. However, additional challenges lie in addressing the usability of such an interface, particularly in the context of a visual interface associated with an application or desktop that supports user interactions implemented in the context of a speech recognition system. There remains a need for a system that enables a user to utilize speech recognition to select any visually indicated control (menu, button, hyperlink, text field, etc.) on a display screen in an efficient and controlled manner, and in a way that is not dependent on traditional physically initiated interaction.
One way to configure a speech recognition input system is to enable a user to select an item displayed on a screen by saying the name of that item. For example, “File” will open the file menu or “OKAY” will initiate action associated with an OK button. One way to implement such functionality is to configure an application to use Accessible Interfaces to programmatically expose GUI controls to assistive technologies, such as speech recognition. For example, the file menu would be exposed by the application declaring that there is an item called “File” located at coordinates 30, 10 to 70, 35. The speech recognition software then reads which GUI controls are available, and uses this information to construct a list (i.e., a grammar) of user statement expectations. For example, the list might contain “File”, “Edit”, “View” (menu item buttons), as well as “Open”, “Bold”, “Bullets” (tool bar buttons), or “scroll up”, “scroll down” (scroll bar items), and/or other items. When a user speaks a listed item (e.g., “bold”), the speech recognition software calls the application via the Accessible Interfaces to ‘click’ the appropriate item (e.g., click the “bold” button) thereby initiating an appropriate response.
The described functionality works well in many cases as an efficient way to manipulate GUI controls by voice, but it breaks down under certain scenarios. In one such scenario, if an application omits a programmatic name for a control, the user cannot use voice to manipulate that control since they have no way to identify it (e.g., a name for a particular text box is omitted).
Another scenario that can present a challenge arises when an application has a mismatched name between what the user expects to say and the actual programmatic name of the control. Under these circumstances, the user will not typically speak the name of the control and thus not be able to manipulate it. For example, if an image in an Internet browser has a name of “Flag” yet the image depicts an emblem with the text “House of Windsor”, the user will naturally try and say “House of Windsor” and never know that they were meant to say “Flag”. In another example, if a button in a media player application is programmatically called “WnEqBtn1” yet the text on the label reads “Equalizer settings,” the user will never know to say “WnEqBtn1”.
Yet another scenario presenting a challenge arises when an application depicts something graphically yet the user does not know what it is called (even if it is named ‘appropriately’). In this case, the user will not know how to speak the name of the control to manipulate it. For example, a round button with an arrow therein on an operating system task bar may be called “Show Hidden Icons” but all the user can see is the graphical representation. In another example, an icon is presented on a toolbar presenting a graphical representation indicating a functionality of drawing a border around a table. The majority of users may not know that the icon is called “Outside Border”, even if they do know that it is a button they need to press to draw a border around something.
In all the scenarios discussed above, even users dependent on speech recognition can use voice-enabled keyboard emulation (“Computer Press Tab Tab Tab Enter”) or mouse simulation (“Mousegrid 1 3 4 7 Click”) to select the item they want to manipulate. Such methods enable the user to solve each of the challenges associated with the described scenarios, though at a cost of (a) a significant decrease in efficiency and (b) an increased chance of error due to a limited capacity for precision.
One way to overcome the described challenges associated with speech recognition selection would be to draw a static set of numbers over everything on a display. The user could then simply select a number that corresponds to a desired item for selection. A disadvantage associated with this approach is that it is not uncommon for an application display to incorporate so many numbers that a user cannot clearly determine which number corresponds to an item they want to select. One solution for the crowded numbering problem is to incorporate multiple layers of numbering (e.g., choose word processing application, then choose toolbar area, then choose toolbar, then choose a button on the toolbar, wherein each selection incorporates identification from a limited set of numbers). The layers enable a user to step through a GUI to the item they wish to select. This described method of selection is relatively inefficient and attempts to reduce the number of layers can require solving problems having great mathematical complexity.