ASR technologies enable microphone-equipped computing devices to interpret speech and thereby provide an alternative to conventional human-to-computer input devices such as keyboards or keypads. Many telecommunications devices are equipped with ASR technology to detect the presence of discrete speech such as a spoken nametag or control vocabulary like numerals, keywords, or commands. For example, ASR can match a spoken command word with a corresponding command stored in memory of the telecommunication device to carry out some action, like dialing a telephone number. Also, an ASR system is typically programmed with predefined acceptable vocabulary that the system expects to hear from a user at any given time, known as in-vocabulary speech. For example, during a voice dialing mode, the ASR system may expect to hear keypad vocabulary such as “Zero” through “Nine,” “Pound,” and “Star,” as well as ubiquitous command vocabulary such as “Help,” “Cancel,” and “Goodbye.”
One problem encountered with voice dialing, and speech recognition generally, is that ASR systems sometimes misrecognize a user's intended input speech. Such ASR misrecognition includes rejection, insertion, and substitution errors. A rejection error occurs when the ASR system fails to interpret a user's intended input utterance. An insertion error occurs when the ASR system interprets unintentional input, such as background noise or a user cough, as an intended user input utterance. A substitution error occurs when the ASR system mistakenly interprets a user's intended input utterance for a different input utterance.
More particularly, a substitution error is usually due to confusability between similar sounding words. For example, a substitution error sometimes occurs where the keypad word Pound, is misinterpreted as the command word Help. As a result, the ASR system may process the incorrect word, or may repetitively ask the user to repeat the command. In either case, the user can become frustrated.
One solution to this problem is to allow a user to indicate to the ASR system, after the fact, that the user's utterance was misrecognized. Thereafter, the ASR system presents the user with a list of recently received words and allows the user to select those words that were misrecognized. Then, the selected words are input to a speech training process, which modifies acoustic models to improve future recognition accuracy.
Another solution to this problem is to allow a user to train an out-of-vocabulary word into an in-vocabulary lexicon using a keyboard and a microphone. The system converts the text of the word and the user's pronunciation of the word into a phonetic description to be added to the lexicon. Initially, two possible phonetic descriptions are generated; one is formed from the text of the word using a letter-to-speech system, and the other is formed by decoding a speech signal representing the user's pronunciation of the word. Both phonetic descriptions are scored based on their correspondence to the user's pronunciation, and the phonetic description with the highest score is then selected for entry into the lexicon.
There are several drawbacks to the above-mentioned solutions. They involve time-consuming user feedback loops or user-initiated word training. Also, they may be particularly distracting to a user who is driving a vehicle. And, although these solutions may increase recognition performance of future utterances, they do not improve recognition performance of a current utterance. Accordingly, the ASR system may time out and impair a current communication session. Thus, a better method is needed for reducing confusability between similar sounding words to improve recognition performance of a current utterance.