1. Technical Field
The present application generally relates to automatic speech recognition and, in particular, to system and methods for generating acoustic and language models for large vocabularies.
2. Description of the Related Art
In general, automatic speech recognition systems (“ASR”) operate with two kinds of vocabularies: an acoustic vocabulary and a language vocabulary. In a language vocabulary (or word vocabulary), words are represented with an ordinary textual alphabet. In an acoustic vocabulary, the spoken sounds of words are represented by an alphabet consisting of a set of phonemes. The words that comprise the acoustic vocabulary are referred to as baseforms. These baseforms can be generated either manually (which is typically done for relatively small vocabularies) or by utilizing spelling-to-sound mapping techniques (which are used for languages having well defined pronunciation rules such as the Russian language).
The vocabulary size of many languages is small enough such that conventional statistical language modeling methods (e.g., trigram models) may be efficiently utilized in real-time decoding speech recognition applications. For instance, more than 99% of the English language which is typically written and spoken may be represented by a relatively small language vocabulary (e.g., 35 k words) and acoustic vocabulary (e.g., 100 k words). Indeed, n-gram modeling of words (e.g., n=3) has been successfully utilized in recent years in speech recognition systems for vocabularies having up to 64,000 words (which generally require a training corpus of a few hundred million words).
On the other hand, word-based language models such as n-grams are inadequate for inflected languages having relatively much larger vocabularies (e.g., several hundred thousand words or more). For example, the Russian language requires at least 400,000 word forms to represent more than 99% of the everyday spoken Russian language, and a vocabulary of several million words is needed to completely cover all of the possible word forms in the Russian language.
There are several problems associated with utilizing a word-based language model for a large vocabulary in a real-time ASR system. For example, with conventional n-gram modeling, a large vocabulary cannot be directly utilized as a basic 1-gram component in the n-gram language model due to the excessive time associated with accessing such data during decoding. Moreover, an extremely large corpus of data would be required to train the n-gram language models. Furthermore, a database of every word comprising such a large vocabulary is generally not available (e.g., for performing training) and is difficult to construct.
These problems are compounded by the fact that an acoustic vocabulary is significantly larger than the corresponding language vocabulary since there can be several basic pronunciations of one word that give rise to multiple baseforms per word. Consequently, a large acoustic vocabulary significantly increase the acoustic processing time due to the large number of candidate baseforms which have to be verified before the ASR system can choose one or several baseforms which match a spoken utterance.
Furthermore, a large vocabulary may also be encountered with speech that relates to one or more technical fields consisting of unique, specialized language (which are hereinafter referred to as out-of-vocabulary (“OOV”) words). For example, medical and law vocabularies must be utilized if real-time ASR decoding is to be performed in a court room during a medical malpractice trial. Accordingly, when faced with inflected or specialized languages, efficient and accurate real-time decoding requires decreasing the vocabulary size and processing time of the OOV words.
There is a need, therefore, for a method for generating a language model which allows a large, basic language vocabulary to be compressed to a manageable size such that the language can be efficiently modeled for real-time ASR applications. One such method for generating Slavic language models is disclosed in U.S. patent application Ser. No. 08/662,726 entitled “Statistical Language Model For Inflected Languages” by Kanevsky et al., which is commonly assigned to the present assignee and incorporated herein by reference. With this method, words in a training corpus are split into stems and endings (i.e., word components) and n-gram (e.g., trigrams) statistics are generated for stems only, endings only and stems and endings in their natural order, as well as statistical distributions of stems/endings. The resulting language model is based on the vocabulary of components and is a weighted sum of the different language models that are generated for each of these components. By using the components (e.g., stems and endings) as the working vocabulary (as opposed to using the vocabulary consisting of the “non-split” words), the size of the vocabulary may be reduced by an order of magnitude as compared to the vocabulary of (non-split) words. Consequently, a language model that is based on word components is more compact than a standard n-gram language model that is based on the whole (non-split) word-form vocabulary.
Nevertheless, the language model discussed above (which is derived from word components) requires consideration of six consecutive components (stem-ending-stem-ending-stem-ending) in order to fetch trigram probabilities of some of its components (stems-stems-stems or ending-ending-ending). The consideration of 6-tuple strings can be computationally expensive for real-time ASR decoding applications.
Another concern with the above approach is how the words can be split into stems and endings so as to sufficiently compress the size of the component vocabulary (as compared to the size of the vocabulary of non-split word forms). One method for splitting a vocabulary of word forms into stems and endings is to take a list of components (e.g., stems and endings) and then match each word form from the vocabulary with the list of components using a set of matching rules (such as described in the above-incorporated U.S. patent application Ser. No. 08/662,726). This approach, however, may not necessarily lead to the smallest total number of vocabulary components.
Another method for splitting word forms to produce a small word component vocabulary is the arithmetic-based method disclosed in U.S. Ser. No. 08/906,812 entitled “Apparatus and Method For Forming A Filtered Inflected Language Model For Automatic Speech Recognition” by Kanevsky et al., which is commonly assigned to the present assignee and incorporated herein by reference. With this arithmetic approach, word forms are mapped into word numbers which are then “split” into smaller numbers using modular arithmetic. The “split” numbers are used to represent corresponding vocabulary components. This method provides a significantly compressed representation of statistical data for distributions of n-tuples of words (i.e., n-grams). Although this approach provides efficient compression of the statistical data formed by the words comprising a large vocabulary, it does not provide for a method of reconstructing OOV words by verifying whether a word form generated by concatenating one or more “split” components is, e.g., a legal word. The reason for this is as follows. Word numbers from “split’ component numbers are reconstructed using pure arithmetic means. If the “split components” (e.g., a pair of small numbers n1 and n2) are reconstructed into a word number N (e.g., the Nth word number in the vocabulary) which is smaller than or equal to the size of the vocabulary, a word can be matched to the reconstructed word number. In this manner, the Nth word in the vocabulary can be attached to the number pair n1 and n2. On the other hand, if the word number N is larger than the size of the vocabulary, N will not correspond to any word in the vocabulary and, consequently, no spelling can be attached to the reconstructed N (nor to the pair n1 and n2). Therefore, the above arithmetic compression method may not properly work with OOV words since there can be no match between OOV words and any word numbers.
A method for splitting acoustic baseforms into acoustic stems and endings is also disclosed in the above patent application U.S. Ser. No. 08/906,812. In this approach, acoustic stems and endings for baseforms of split words are associated with the stems and endings of the split words. Since words have multiple baseforms, this procedure generally produces larger multiple sets of acoustic stems and endings per each language stem and ending. For example, if a given word is associated with two different baseforms, then a corresponding stem and ending of the word can each be associated with two stem and ending baseforms. This gives rise to four new baseform units for the acoustic component vocabulary. This is demonstrated by the following example. Assume the word WALKED has two baseforms: W AO K D and W O K E D (in some phonetic representation with phonemes W, AO, K ,E, D). The stem WALK has two stem baseforms: W AO K and W O K; and the end ED has two baseform endings D and E D. From this example, it is apparent that although the number of language components is reduced, there can still exist a significant number of acoustic components. This raises the problem of efficient compression of the acoustic component vocabulary and the interfacing of acoustic and language component vocabularies which has not heretofore been addressed.