The invention relates to speech recognition and, more particularly, to apparatus and methods for identifying potential acoustic confusibility among words.
Many existing and proposed speech recognition-based systems which employ speech signals as inputs to the system provide the capability for the user to customize the speech recognition vocabulary associated with the system. In particular, application developers using a speech recognition engine typically want to be able to expand the recognition vocabulary of the system. In the case where the speech recognition system is used to provide command or control functions to some other application (e.g., voice dialing, security access authorization, etc.), expanding the vocabulary effectively results in an expansion of the set of command words and phrases which the user may employ to command and control the particular application. Unfortunately, while methods for such expansion are known, some are more appropriate than others. Some choices are inherently wrong because they are more prone to acoustic confusion. Acoustic confusion is the situation where a word or phrase uttered by a user is mis-recognized due to its acoustic similarity to another word or phrase in the speech recognition vocabulary. Minimization of acoustic confusion is especially important for command and control interfaces implemented with speech recognizers which inherently have some non-zero, but finite, error rate.
Application developers typically do not desire to gain in-depth understanding into the capabilities of the recognition engine when seeking to expand the command sets employed in their applications. Unfortunately, the choice of optimal vocabulary expansion often requires some experience regarding the capabilities of the recognition engine. However, it is generally very difficult to determine which words are, or will be, confusible for a speech recognition engine.
With the advent of large vocabulary name recognition employing speech, the problem of resolving which particular spelling of a word was intended by the speaker, when many possible spellings exist within the vocabulary, has added to the difficulty. For example, the two words "waste" and "paste" may be poor choices for commands due to the potential confusion in decoding these similarly sounding uttered words. However, replacing the word "waste" by "erase", "delete", "eliminate", "cut", or even "trash", results in much better discriminant capabilities.
Furthermore, many words result in the same baseforms which are somewhat arbitrarily treated by the speech recognizer, at least at the acoustic level. While language modeling and contexts can help in dictation and conversation tasks, in command and control decoding, acoustics are still one of the most important parameters. The problem of recognition inaccuracy due to acoustic confusion is often tackled by hand editing the speech recognition vocabulary file to remove such potential problems. However, this hand-editing method is not possible if large lists of commands and words are to be automatically incorporated by non-specialists (i.e., persons with little or no in-depth understanding of recognition engine operations and capabilities) into the vocabulary of the recognizer.
This problem exists in other speech recognition areas and up to now has been corrected by using a manual approach or using the context in which the command or word is used in order to resolve the command or word. For example, the words "to", "two" and "too" are typical examples of confusible words. The approach to detect which one of these words was actually meant when uttered by a speaker has traditionally been to use the context around the word. Some recognizers may even be capable of intelligently noting that the distance of the spoken speech to all of these words will be the same and thus may prevent such extra scoring by first noting that all three may have the same baseform.
Accordingly, it would be desirable to provide a method and apparatus for relieving the recognizer from performing acoustic confusibility checks and for informing users, such as, for example, application developers, of potential problems. The developer would then be able to decide to use a synonym, coerce the grammar, modify the interface (e.g., provide capability for asking user to confirm the command) or modify the set of options in that particular context (e.g., limit the active vocabulary to exclude the competing commands). Also, it would be quite valuable to employ a tool for automatically evaluating the effect of vocabulary expansion on the acoustic performance of the speech recognizer without the need to build a new vocabulary and perform recognition tests.