The present invention relates to computerized speech recognition. More particularly, the present invention relates to an apparatus and methods to improve the manner in which speech recognition systems react to recognition errors and/or ambiguity.
Speech recognition is a technology that has a number of useful applications that allow people to interface with computing systems using their voices. These applications include: allowing a user to dictate text into a document; allowing a user to issue commands to one or more computer programs via speech; improving automated telephony systems; and many other applications. Such systems are useful in large centralized-server applications, such as computerized telephony processing systems; user interaction with desktop computing products; and even improved interaction and control of mobile computing devices.
Speech recognition is known and is being actively researched as perhaps the future of human interaction with computing devices. While speech recognition technology has progressed rapidly, it has not been perfected. Speech recognition requires substantial computing resources and has not achieved 100% recognition accuracy. This is partly due to inherent ambiguities in human language, and also due, in part, to varying domains over which user speech may be applied.
Current desktop speech recognition systems typically listen for up to three classes of speech. The first class is free form dictation where the recognized text is simply inserted into the document that currently has focus. An example of dictation might be, “John, have you received the report that I sent you yesterday?” The second class of speech is commands in the form of simple names of menus or buttons. Examples of this class of speech include “File,” “Edit,” View,” “OK” et cetera. When a command word is recognized, the items they represent will be selected or “clicked” by voice (i.e. the File menu would open when “File” is recognized). The third class is commands in the form of verb-plus-object command pairs. Examples of this class of speech include: “Delete report,” “Click OK,” and “Start Calculator.” The “Start Calculator” command, when properly recognized, will launch the application called calculator.
By listening for all three classes, the user need not indicate before they speak whether they want to enter text by voice or give a command by voice. The speech recognition system determines this automatically. Thus, if a user utters “Delete Sentence,” the current sentence will be deleted. Additionally, if the user says, “This is a test,” the words “This is a test” will be inserted into the current document. While this intuitive approach vastly simplifies the user experience, it is not without limitation. Specifically, when a user intends to give a verb-plus-object command, and either the command or object is erroneous or the recognition fails, the verb-plus-object will be treated as dictation and be inserted into a document.
The erroneous insertion of an attempted verb-plus-object command into a document creates a compound error situation. Specifically, the user must now undo the erroneously injected text, and the re-speak their command. The fact that the user has to follow more than one step when the verb-plus-object command is misrecognized is what turns the misrecognition error into a “compound error.” Compound errors quickly frustrate a user and can easily color the user's impression of speech recognition. Thus, a speech recognition system that could reduce or even eliminate such errors would improve users' experience with speech recognition in general.