Speech recognizers that are used in telephony and command-and-control applications do not recognize all possible words. Rather, they recognize a relatively small vocabulary. This vocabulary is expressed to a speech recognition engine in the form of a grammar. Grammars are normally expressed in a syntax like Backus-Naur Form (BNF) or Java Speech Grammar Format (JSGF). An example grammar is shown below in a syntax similar to JSGF:                public<Command>=<Meeting.|<Email>        <meeting>=Set up a meeting with <friend> on Monday        <Friend>=Frank|Fred        <Email>=Read my email.        
This grammar describes 3 sentences:                (1) Set up a meeting with Fred on Monday.        (2) Set up a meeting with Frank on Monday.        (3) Read my email.        
The speech recognition engine (recognizer) uses DSP algorithms to match what the user says to elements in the currently active recognition grammar. (Details regarding the process by which a recognizer performs this match are known to those skilled in the art, and therefore are not provided. Moreover, understanding of this part of the process is not necessary for an understanding of the present invention.)
Speech recognition by computer is far from perfect. Though word recognition rates are generally better than 90%, failures still occur quite often, particularly over sequences of words. For example, assuming a 95% word accuracy rate (or a 5% word error rate), the chance of an error in recognizing sentence number (1) is 100%−(95% ^8)=34%, as depicted in FIG. 1. As the grammars involved become more complex, the recognition rates suffer. Moreover, recognition failures are both cumbersome and frustrating for the user, accounting for a large part of overall dissatisfaction with speech interfaces.
Many speech recognition engines support the concept of N-Best-style of voice recognition in response to voice recognition uncertainty. In this mode, the speech recognizer returns a list (up to N elements) that the user might have said, along with an indication of how confident the recognizer is of each potential match. The application software is then responsible for deciding which to use. For example, suppose that the currently active recognition grammar is the one described above. The user says “Set up a meeting with Fred on Monday”. There is some noise on the line, causing the recognizer to not be certain whether the user said “Fred” or “Frank”, but it matches everything else cleanly. The recognizer returns the sentences 1 and 2, indicating that they are equally likely. The application is now responsible for deciding how to respond.
An application may confirm and allow the user the chance to correct information extracted from the users speech. In many situations, this behavior will be required; in others it is optional or completely unnecessary.                Confirmation and correction may be explicit in a confirmation dialogue:        
U:Schedule a meeting with Fred.C:I think that you said Fred. Is this correct? Please say yes or no.U:Yes.C:For when shall I schedule the meeting:                Confirmation and correction may appear implicitly in other steps of the dialogue.        
This style is harder to implement because the grammars can become large quickly.
U:Schedule a meeting with Fred.C:For when shall I schedule a meeting with Frank?U:No, I said Fred.C:For when shall I schedule the meeting with Fred?                Confirmation and correction may be optional or not necessary at all. In this example, there is no confirmation or correction; the user simply queries the system again. A confirmation in this example would have made the interface much more cumbersome for the user because the recognition of the stock names is correct most of the time and the consequences of a recognition failure is very minor.        
U:What is Nortel at today?C:Ortel is at 197, up 3.U:What is Nortel at today?C:Nortel Networks is at 134, up 2.
Given the example grammar described in the introduction, the application must do one of the following:
1.Query the user if he said “Fred” or “Frank” .2.Cause a recognition failure and make the user repeat the command.3.Decide that the user is more likely to have said “Fred” .4.Decide that the user is more likely to have said “Frank” .
Option 1 requires the application to query the user. Implementation of option 1 requires the addition of a state to the recognition dialogue. This creates additional design, debug and maintenance cost and, from the user's viewpoint, makes the interface more clumsy (more steps to get the same work done).
Option 2 is an easier way out—no additional processing or dialogue definition is required, though the application must be designed to query the user to repeat the command. This functionality will be required anyway, to handle true recognition failures (if for example, the user said “What time is it?” which is not included in the grammar in any form or if there is significant noise on the line and the recognition process fails entirely). Like Option 1, the user is likely to experience frustration at having to repeat a command.
Options 3 and 4 require the application to guess about what the user said. Without some context information, there is no way for the application to choose between “Fred” and “Frank” if the speech recognizer is equally confident about the options.
Therefore, there remains a need to overcome the limitations in the above described existing art which is satisfied by the inventive structure and method described hereinafter.