The present invention relates to speech recognition and, more particularly, to methods and apparatus for forming an inflected language model component of an automatic speech recognizer (ASR) that produces a compact and efficiently accessible set of 2-gram and 3-gram language model probabilities.
It is known that a relatively small vocabulary, e.g., 35,000 words, can represent more than 99% of the everyday spoken English language. A different situation exists for inflected languages that use large vocabularies having several million different word forms. For example, the Russian language requires at least 400,000 word forms to represent more than 99% of the everyday spoken Russian language. Such a large vocabulary cannot be used directly as a basic 1-gram component in a standard n-gram language module of a real-time ASR. In one approach for attempting to decrease the size of a basic vocabulary employed in Slavic language models, words are split into stems and endings and the following sets of statistics for components are found: trigrams for stems only, trigrams for endings only, and distributions of stems/endings. Such an approach is disclosed in U.S. Ser. No. 08/662,726 (docket no. YO995-208) filed on Jun. 10, 1996, entitled "Statistical Language Model for Inflected Languages." The resulting language model is based on the vocabulary of components and is a weighted sum of different language models of these components. The size of the vocabulary of components (stems and endings) may be an order of magnitude less than the size of the vocabulary of (non-split) words. Therefore, the language model that is based on components is more compact than the standard n-gram language model that is based on the whole (non-split) word form vocabulary. Nevertheless, the conventional language model formed from components requires consideration of 6 consequent components (stem-ending-stem-ending-stem-ending) in order to fetch trigram probabilities of some of its components (stems-stems-stems or ending-ending-ending). The consideration of 6-tuple strings can be computationally expensive for real time applications in ASR.
Another related problem is how to split words into stems and endings to get a sufficient compression of the size of the vocabulary of components (as compared to the vocabulary of non-split word forms). It is known that in order to split a vocabulary of word forms into stems and endings one can take a list of components (stems and endings) and then match word forms from the vocabulary with the list of components using some matching rules.
In different countries, e.g., Slavic countries, there exists some lists of components (stems and endings) that have been produced by groups of linguists (sometimes during several decades of work). These lists of components could be used to split the corresponding vocabularies of word forms into stems and endings and produce relevant language models of components as described in the Kanevsky et al. application cited above. However, these existing sources of components may not be satisfactory for practical applications of making a language model module in an ASR for several reasons. First, these sources of components may not cover all word forms in a particular vocabulary that is used in an ASR. Second, they may not provide a sufficient ratio of compression (i.e., a number of word forms to a number of components). Third, such a ratio of compression is rather a fixed number. As a result, the ratio cannot be easily varied as may sometimes be necessary for different practical applications that may involve vocabularies of different sizes (e.g., from several million word forms to hundreds of thousands of word forms). Lastly, it can be expensive to license a list of components (stems and endings) from its owner in order to produce and sell an ASR that employs such list of components in its language models. However, the production of a language's own list of components may require complicated algorithms and can be time consuming.