The present invention relates generally to speech recognition systems and, more particularly, to methods and apparatus for forming compound words for use in speech recognition systems.
It is an established fact that the pronunciation variability of words is greater in spontaneous, conversational speech as compared to the case of carefully read speech where the uttered words are closer to their canonical representations, i.e., baseforms. Whereas most of the speech recognition systems have focused on the latter case, there is no standard solution for dealing with the variability present in the former case. One can argue that by increasing the vocabulary of alternate pronunciations of words, i.e., acoustic vocabulary, most of the speech variability can be captured in the spontaneous case. However, an increase in the size of alternate pronunciations is typically followed by an increase in acoustic confusion between words since different words can end up having close or even identical pronunciation variants. It should be understood that the phrase xe2x80x9cacoustic confusionxe2x80x9d is also referred to herein as xe2x80x9cconfusabilityxe2x80x9d and refers to the propensity of a speech recognition system to confuse words due to pronunciation variants.
Consider the word xe2x80x9cTOxe2x80x9d which when preceded by a word such as xe2x80x9cGOINGxe2x80x9d is often pronounced as the baseform AX. That is, instead of a user uttering the phrase xe2x80x9cGOING TO,xe2x80x9d the user may utter the phrase xe2x80x9cGONNA,xe2x80x9d which may have baseforms such as G AA N AX or G AO N AX. It is well known that words may have more than one baseform since a word may be pronounced a number of ways. For instance, a vowel in a word may be pronounced as a short vowel (e.g., xe2x80x9cAxe2x80x9d as AX) or a long vowel (e.g., xe2x80x9cAxe2x80x9d as AY). Another example of the word xe2x80x9cTOxe2x80x9d being pronounced as AX is when the phrase xe2x80x9cWANT TOxe2x80x9d is uttered as xe2x80x9cWANNAxe2x80x9d (W AA N AX or W AO N AX).
However, in the above two examples, merely adding the baseform AX to the vocabularies of the speech recognition system for the word xe2x80x9cTOxe2x80x9d would lead to confusion with the word xe2x80x9cAxe2x80x9d for which baseform AX is the standard pronunciation.
On the other hand, most co-articulation effects, as the above two examples illustrate, arise at the boundary between adjacent words and can often be predicted based on the identity of these words. These co-articulation effects result in alterations of the last one or two phones of the first word and the first phone of the second word. These phones can undergo hard changes (e.g., substitutions or deletions) or soft changes, the latter ones being efficiently modeled by context dependent phones.
The use of crossword phonological rewriting rules was first proposed in E. P. Giachin et al., xe2x80x9cWord Juncture Modeling Using Phonological Rules for HMM-based Continuous Speech Recognition,xe2x80x9d Computer, Speech and Language, 5:155-168, 1991, the disclosure of which is incorporated herein by reference, and provides a systematic way of taking into account co-articulation phenomena such as geminate or plosive deletion (e.g., xe2x80x9cWENT TOxe2x80x9d resulting in W EH N T UW), palatization (e.g., xe2x80x9cGOT YOUxe2x80x9d resulting in G AO CH AX), etc.
Yet, another known way of dealing with co-articulation effects at word boundaries is to merge specific pairs of words into single compound words or multi-words and to provide special co-articulated pronunciation variants for these new tokens. For instance, frequently occurring word pairs such as xe2x80x9cKIND OFxe2x80x9d, xe2x80x9cLET MExe2x80x9d and xe2x80x9cLET YOUxe2x80x9d can be viewed as single words KIND-OF, LET-ME and LET-YOU, which are often pronounced K AY N D AX, L EH M IY and xe2x80x9cL EH CH AX,xe2x80x9d respectively. A major reason for merging frequently co-occurring words into compound words is to tie confusable words to other words. The resulting phone sequences will be longer and therefore more likely to be recognized by the acoustic component of the speech recognition system. For instance, the word xe2x80x9cASxe2x80x9d by itself is particularly confusable in spontaneous speech, but the sequence AS-SOON-AS is far more difficult to be mis-recognized.
As mentioned previously, indiscriminately adding more tokens (compound words) to the acoustic vocabulary and/or the language model will increase the confusability between words. The candidate pairs for compound words have to be chosen carefully in order to avoid this increase. Intuitively, such a pair has to meet several requirements:
1. The pair of words has to occur frequently in the training corpus. There is no gain in adding a pair with a low occurrence count (i.e., the number of times the word pair occurs in the training corpus) to the vocabulary since the chances of encountering that pair during the decoding of unseen data will be low. Besides, the compound word issued from this pair will contribute to the acoustic confusability of other words which are more probable according to the language model.
2. The words within the pair have to occur frequently together and more rarely in the pair context of other words. This requirement is necessary since one very frequent word a can be part of several different frequent pairs, e.g., (a, b1), . . . , (bn+1, a), . . . , (bm, a). If all these pairs were to be added to the vocabulary, then the confusability between bi and the pair (a, bi) or (bi, a) would be increased especially if word a has a short phone sequence. This will result in insertions or deletions of the word a when incorrectly decoding the word bi or the sequence bixe2x88x92a or axe2x88x92bi.
3. The words should ideally present co-articulation effects at the juncture, meaning that their continuous pronunciation should be different than when they are uttered in isolation. This requirement is not always compatible with the previous ones, meaning that the word pairs which have the strongest co-articulation effects do not necessarily occur very often nor do the individual words occur only together.
The use of compound words (or multi-words) was first suggested in M Finke et al., xe2x80x9cSpeaking Mode Dependent Pronunciation Modeling in Large Vocabulary Conversational Speech Recognition,xe2x80x9d Proceedings of Eurospeech ""97, Rhodos, Greece, 1997 and M. Finke, xe2x80x9cFlexible Transcription Alignment,xe2x80x9d 1997 IEEE Workshop on Speech Recognition and Understanding, Santa Barbara, Calif., 1997, the disclosures of which are herein incorporated by reference. These articles propose two measures for finding candidate pairs. The first measure is language model oriented and consists of maximizing the mutual information between two words while decreasing the bigram perplexity of the augmented language model (by these tokens) on the training corpus. It is to be appreciated that perplexity measures the quality of the language model and may be represented as:                               Perp          =                      ⅇ                                          -                                  xe2x80x83                                ⁢                                  1                  N                                            ⁢              log              ⁢                              xe2x80x83                            ⁢                              P                ⁡                                  (                  C                  )                                                                    ,                            (        1        )            
where C represents the training corpus, N represents the number of words in corpus C, and P(C) represents the probability of the corpus C according to the language model. A low perplexity value translates to a better language model. Bigram perplexity refers to the perplexity of a language model with bigram probabilities and may be represented as:                                           P            ⁡                          (              C              )                                =                                    ∏                              t                =                1                            N                        ⁢                          P              ⁡                              (                                                      w                    i                                    ❘                                      w                                          i                      -                      1                                                                      )                                                    ,                            (        2        )            
where C=w1, . . . , WN. Even though the mutual information between two words is a good measure with respect to the second requirement above, the minimization of the perplexity is in contradiction with the first (and most important) requirement. Indeed, according to the first requirement, frequent pairs are good candidates for compound words. The language model, built by adjoining these pairs to the existing vocabulary, will present a higher perplexity on the training data since the bigram count for every pair (which was high) has been taken out by merging the pair into a single (compound) word. In other words, the prediction power of the language model without compound words is stronger (or the perplexity lower) because it can often predict the second word given the first word of a frequently seen pair. It may be readily shown that the perplexity of the resulting language model increases whenever frequent pairs are added to the vocabulary.
The second measure that the Finke articles disclose is related to pronunciation modeling at word boundaries. The articles propose measuring the reduction in entropy between the co-articulated pronunciation variants of a hypothesized compound word and the variants of the individual words. The pair which gives the maximum reduction will be merged into a compound word and all the instances of that pair in the training data will be replaced by this new word.
The present invention provides for methods and apparatus for improving recognition performance in a speech recognition system by, for example, more accurately modeling pronunciation variability at boundaries between adjacent words in a continuously spoken utterance. Such improved modeling is accomplished according to the invention by introducing new language model measures and acoustic measures for determining whether candidate element sets (two or more individual elements) in the training corpus should form a compound element.
In a first aspect of the invention, a method of forming an augmented textual corpus associated with a speech recognition system includes computing a measure for an element set in a textual corpus for comparison to a threshold value, the measure being an average of a direct n-gram probability value and a reverse n-gram probability value. Then, the element set in the textual corpus is replaced with a compound element depending on a result of the comparison between the measure and the threshold value. For example, if the element set is a consecutive pair of words, the average of a direct bigram probability value and a reverse bigram probability value is computed. If the measure is not less than the threshold value, a compound word replaces the consecutive pair of words in the corpus.
The compound element may also be added to a language model vocabulary and an acoustic vocabulary associated with the speech recognition system. Further, the augmented training corpus may be used to recompute a language model and retrain an acoustic model, each associated with the system.
In a second aspect of the invention, a measure used in the comparison may be based on mutual information between elements in the set. In a third aspect of the invention, a measure may be based on a comparison of the number of times a co-articulated baseform for the set is preferred over a concatenation of non-co-articulated individual baseforms of the elements forming the set. In a fourth aspect of the invention, a measure may be based on a difference between an average phone recognition score for a particular compound element and a sum of respective average phone recognition scores of the elements of the set.
It is to be appreciated that a speech recognition system augmented by compound elements according to the invention produces a decoded sequence associated with the augmented textual corpus.
It is to be further appreciated that while a detailed description of the invention is described below in terms of forming a compound word from a pair of consecutive words in a training corpus, the invention more generally provides for forming a compound element from two or more individual elements of the training corpus. An element may be a full word or one or more phones that are not necessarily full words.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.