1. Field of the Invention
The present invention relates to speech recognition systems and, more particularly, to disambiguating speech inputs provided to such a system.
2. Description of the Related Art
Speech recognition refers to the ability of a machine or program to convert user speech into a textual representation or string that can be easily manipulated by a computer. Once speech has been so converted, the information can be used in a variety of different ways. For example, speech recognition technology allows computers to respond to user speech commands in the context of command and control. In another example, speech recognition technology enables computers to take dictation.
Generally, a speech recognition system (SRS) performs an acoustic analysis upon a received speech input. Information relating to the pronunciation of the speech input is generated. This data, which provides a phonetic representation of the speech input, then can be compared with a vocabulary of recognizable words or a set of defined grammars to determine a match. A statistical language model also can be used to aid in the recognition process. The statistical language model provides context within which a potential recognition result can be evaluated. That is, given a string of one or more words derived from a user spoken utterance, a statistical model can provide an indication, within a statistical certainty, as to what the next word of the string will be.
SRSs have achieved acceptable levels of accuracy with respect to recognition of phrases comprising a plurality of words. When phrases of words are evaluated, the constituent words usually are acoustically dissimilar and, thus, can be differentiated from one another. The use of a language model provides an additional means of disambiguating one word from another. In other cases, however, such as when recognizing individual words, and particularly proper nouns or individual characters, speech recognition tends to be less accurate. One reason for this is that generating a grammar of all difficult to recognize words, such as names, is very difficult, if not impossible. Also, when recognizing individual words, contextual models provide no additional insight.
One proposed solution for recognizing these more difficult words has been to ask users to spell the word being provided as input. The user is asked to speak each letter or character of the intended word. Letter input, however, can be ambiguous due to the brevity of the utterance and the acoustic confusability of the letters. In English, for example, it is difficult to distinguish between the letters F and S. Other confusingly similar characters can include B, C, D, E, G, P, T, V, and Z. Further, similar to when recognizing individual words, language models do not provide additional information for disambiguating individual letters.
In consequence, it becomes necessary to disambiguate the spelling input using other means. Typically, disambiguation is performed using a combination of N-best matching and querying of the user. The user is asked by the SRS whether a potential recognition result for each spoken letter is correct. For example, for each recognized letter, the user can be queried as follows: “Did you say E?”, “Did you say B?”, “Did you say D?”, etc., continuing down the N-best list of commonly confused letters associated with the potential recognition result until the user responds affirmatively. This continues until the entire word is spelled and recognized.
This method of letter-by-letter, question-answer style disambiguation can be very tedious and time consuming for users. It would be beneficial to have a technique for recognizing and/or verifying word input in a manner which overcomes the deficiencies described above.