In the environment of the invention, a word is understood to mean any linguistic entity of more or less restricted length, and thus may include brief sentences, personal and other names, and other items that warrant machine recognition upon their being presented in the form of speech. In particular, the invention addresses the problem of finding an acoustic representation, hereinafter also called transcription, of an unknown word as a sequence of sub-word units. This is effected through providing only a few sample utterances of the unknown word(s), and furthermore, an inventory of speaker-independent sub-word unit models.
A problem arises if a user wants to add one or more additional vocabulary words to a speaker-independent recognition system, by training the system with only a few utterances of the new word. Speaker-independent recognition is used when the number of envisaged speakers to use a particular type of system is relatively large and/or the system is relatively inexpensive. A typical example would be a speech actuated telephone device that normally may recognize the ten digits and a few standard terms, and which the user may train to recognize in addition such names or other labels that pertain to frequently called telephone extensions.
Another example would be useful with a speaker-independent speech recognition system that could have only a limited standard set of recognizable words, such as only twenty. This system then should have been trained on many different speakers. The system may now have to be extended with extra words, for which only a very limited number of training speakers, such as no more than three, is available, but for which extra words the same recognition robustness is required as for the original set.
Still another example would be with a grapheme-to-phoneme conversion, wherein a new word from keyboard entry is transcribed into an acoustic model. To improve reliability, the keyboard entry is then supplemented by acoustic entry of the same word. The parallel presentations again improve robustness, and under particular circumstances would also solve reliability problems due to orthographic errors, or due to the existence of two correct pronunciations of a single written word which then would have respective different meanings.
In particular, it is a requirement that the minimally necessary number of training utterances should remain low, such as no more than three, for nevertheless attaining a reliable performance at later recognizing. The problem also is generally restricted to systems that allow adding only a limited set of words, say, up to ten words. If the number of words added becomes too high, the transcription could render confusable results. On the other hand, the set of standard words may be either small or large.