The invention relates to speech recognition and, more particularly, to apparatus and methods for identifying homophones among words in a speech recognition system.
It is generally very difficult to identify which words in an existing vocabulary of a speech recognition engine are or may be confusible with other words in the vocabulary. That is, when a user utters one word that the speech recognizer has been trained to decode, it is possible that the speech recognizer will output the wrong decoded word. This may happen for a variety of reasons, but one typical reason is that the word uttered by the speaker is acoustically similar to other words considered by the speech recognition engine. Mistakes are committed at the level of the output of the recognizer, by misrecognizing a word or dropping a word from an N-best list which, as is known, contains the top N hypotheses for the uttered word.
In addition, with the advent of large vocabulary name recognition employing speech (e.g., a voice telephone dialing application), the problem of resolving which particular spelling of a word was intended by the speaker, when many possible spellings exist within the vocabulary, has added to the difficulty. For example, the two spellings of "Gonzalez" and "Gonsalez" result in similar but perhaps not the same baseforms, as shown below:
GONZALEZ .vertline. G AO N Z AO L EH Z GONSALEZ .vertline. G AO N S AO L EH Z
Furthermore, many words result in the same baseforms, which are somewhat arbitrarily treated by the speech recognizer. This creates a problem that is often tackled by hand editing the entire vocabulary file, prior to any real-time decoding session, to attempt to remove such potential problems. However, this hand-editing method is not possible if large lists of names are to be automatically incorporated into the vocabulary of the speech recognizer.
This problem exists in other speech recognition areas and up to now has largely been corrected by using the manual approach or using the context to resolve the correct spelling. For example, the words "to", "two" and "too" are familiar examples of homonyms, i.e., words which have the same sound and/or spelling but have different meanings. The approach to detect which one of these words was actually meant when uttered by a speaker has traditionally been to use the context around the word. Some recognizers may even be capable of intelligently noting that the distance of the spoken speech to all of these words will be the same and thus may prevent such extra scoring by first noting that all three may have the same baseform.
U.S. Pat. No. 4,468,756 to Chan discloses a method for processing a spoken language of words corresponding to individual, transcribable character codes of complex configuration which includes displaying a set of homonyms corresponding to a set of homonym set identifying codes. However, these homonyms and related codes are previously classified and stored in files in accordance with known rules of the particular spoken language (e.g., it is known that in Chinese, approximately 230 characters, among the approximately 2700 basic characters, are classified as homonyms). Then, whenever the spoken word corresponds to a word which was previously classified as a homonym, the method discloses using the code to access the homonym file and then displaying the known homonyms from that file. However, the Chan method is disadvantageously inflexible in that it is limited to the pre-stored classified homonyms. Therefore, among other deficiencies, the Chan method cannot perform real-time identification of words in a vocabulary that are acoustically similar to an uttered word and thus cannot display words that are not otherwise pre-classified and stored as homonyms.
Accordingly, it would be highly advantageous to provide methods and apparatus for substantially lowering the decoding error rate associated with a speech recognizer by providing an automatic real-time homophone identification facility for resolving the intended word in cooperation with the user without regard to known homophone rules of any particular spoken language. It would also be highly advantageous if the results of the homophone identification facility could be used in an off-line correction mode.
Further, it would be highly advantageous to use the output of the homophone identification facility to add homophones to the N-best list produced by the speech recognizer. The list could then be used for re-scoring, both acoustic and language model, or error correction in dictation applications.