Various automatic speech recognition (ASR) systems exist for recognizing speech to create transcripts of such speech and to control software applications. For example, one common use of ASR systems is to enable users to dictate text to be inserted into a word processing document and to control user interface (UI) elements of the word processing application (such as windows, menus, and dialog boxes). For example, when using an ASR system in connection with a word processing application, it may be possible for the user to use a voice command to cause a “Font” dialog box to be displayed and then to use other voice commands to enter a font name into the dialog box and to click on an “OK” button within the dialog box to cause the desired font to be applied to selected text and/or or to be applied to text subsequently typed by the user.
As is clear from even such a simple example, the target application (e.g., word processor) may have a variety of application states, such as a “text entry” state in which mouse and keyboard input provided by the user is interpreted as text to be inserted into the currently-open document and as commands for controlling the window containing the currently-open document, and a “font dialog box” state in which mouse and keyboard input provided by the user is interpreted as commands for controlling user interface elements of the “Font” dialog box and as text to be entered into text fields of the “Font” dialog box. Any ASR system that interacts with such an application must be capable of interacting correctly with the application based on the current state of the application.
One typical way to coordinate an ASR with the state of the target application (e.g., word processor) is to tightly integrate the ASR with the target application. For example, the target application may be designed or modified to be aware of the ASR's speech recognition engine, to appropriately configure the speech recognition engine for use in various application states, and to interpret the speech recognition results appropriately in such application states. For example, a word processing application may be designed to configure the speech recognition engine to use a first particular language model when the word processing application is in a “text entry” state (such as a general English language model), and to configure the speech recognition engine to use a second particular language model when the word processing application is in a “font dialog box” state (such as a “font dialog box” language model which is limited to recognizing only the names of fonts currently installed on the target computer).
One benefit of such tight integration between the speech recognition engine and the target application is that it can increase speech recognition accuracy by enabling the speech recognition engine to use an appropriate language model and to otherwise be appropriately configured for each state of the target application. Another benefit of such tight integration is that the configuration of the speech recognition engine and the interpretation of the speech recognition engine's results do not have to be directly tied to visual features of the application state, such as the size, location, or text labels of buttons and other user interface elements.
A significant disadvantage, however, of such tight integration is that it requires the speech recognition engine and the target application to be designed or modified to be integrated with each other in this way in advance. Such integration, therefore, can require not only significant manual effort to tailor the speech recognition engine to interact appropriately with the various states of the target application, but may also require access to the source code or other internals of the target application that may not be available to the designer of the speech recognition engine. Furthermore, when relying on tight integration, the speech recognition engine is only usable with target applications for which it has been designed to tightly integrate. As a result, such a speech recognition engine will not be capable of achieving the same benefits when used in connection with target applications with which it has not been specifically designed to integrate, and may not even be capable of interacting correctly with modified versions of the same target application.
In general it is desirable for automatic speech recognition systems to be usable with a wide variety of target applications, such as word processors, web browsers, email clients, and database applications, with high recognition accuracy in all of the various states of such target applications. Yet, as the discussion above illustrates, attempting to achieve such interoperability between automatic speech recognition systems and target applications through tight integration of the two can be tedious, time-consuming and, in many cases, not possible from a practical standpoint. Various other approaches for enabling automatic speech recognition systems to interoperate with a wide variety of target applications in their various states have their own drawbacks. What is needed, therefore, are improved techniques for enabling automatic speech recognition systems to interoperate with a wide variety of target applications in the various states of such applications easily and with high recognition accuracy.