The present invention relates to speech recognition. In particular, the present invention relates to adding phonetic descriptions of words to the lexicon of a speech recognition system.
In speech recognition, human speech is converted into text. To perform this conversion, the speech recognition system identifies a most-likely sequence of acoustic units that could have produced the speech signal. To reduce the number of computations that must be performed, most systems limit this search to sequences of acoustic units that represent words in the language of interest.
The mapping between sequences of acoustic units and words is stored in a lexicon (sometimes referred to as a dictionary). Regardless of the size of the lexicon, some words in the speech signal will be outside of the lexicon. These out-of-vocabulary (OOV) words cannot be recognized by the speech recognition system because the system does not know they exist. Instead, the recognition system is forced to recognize other words in place of the out-of-vocabulary word, resulting in recognition errors.
In the past, some speech recognition systems have provided a way for users to add words to the speech recognition lexicon. In order to add a word to a lexicon, the text of the word and a phonetic or acoustic description of its pronunciation must be provided to the speech recognition system, in addition to its likelihood in contexts (or so called language model).
Under some prior art systems, the pronunciation of a word is provided by a letter-to-speech (LTS) system that converts the letters of the word into phonetic symbols describing its pronunciation. The conversion from letters to phonetic symbols is performed based on rules associated with the particular language of interest.
Such LTS systems are only as good as the rules provided to the system. In most LTS systems, these rules fail to properly pronounce entire classes of words, including foreign originating words and complex acronyms. If the LTS rules fail to properly identify the pronunciation for a word, the speech recognition system will not be able to detect the word when later spoken by the user.
In other systems, the pronunciation of a word is provided by recording the user as they pronounce the word. This recorded signal is then used as a template for the word. During recognition, the user's speech signal is compared against the template speech signal directly and if they are sufficiently similar, the new word is recognized.
Note that a template system requires a significant amount of storage for each new template. This is because the template must store the speech signal itself instead of a phonetic description of the speech signal. This not only requires more storage space but also requires a modified recognition process because most recognition systems utilize the phonetic description of words when performing speech recognition.
A third possibility is closely related to out-of-vocabulary detection. Some systems use a network of any phoneme followed by any other phoneme to recognize a new word, which may be composed of any sequence of phonemes. Usually a phoneme bigram or trigram is used in the search process to help the performances both in accuracy and speed. However, phoneme sequence recognition, even with bigram or trigram, is well known to be difficult. The phoneme accuracy is usually low.
Thus, a system is needed for adding words to a speech recognition lexicon that provides a sequence of phonetic units for each added word while improving the identification of those phonetic units.