Speech recognition is a process by which an unknown speech utterance ("input signal") is identified. Speech recognition typically involves a signal processing stage in which a plurality of word string hypotheses, i.e., possible word sequences, are proposed for the input signal. The task is then to recognize or identify the "best" word string from a set of hypotheses, i.e., proposed word strings consistent with the input signal. Speech recognition systems utilize a language model for such a purpose.
Typical speech recognition systems may employ a quantitative language model. Quantitative models associate a "cost" with each hypothesis, selecting the lowest cost hypothesis as the recognized word string.
One example of a quantitative model is a probabilistic language model. Probabilistic models assign probabilities to word strings and choose the string that has the highest probability of representing a given input signal. The probability calculation can be performed using a variety of methods. One such method, referred to as the N-gram model, specifies the probability of a word that is part of a string conditional on the previous N-1 words in the string. See, for example, Jelinek et al., "Principles of Lexical Language Modeling for Speech Recognition," Adv. Speech Signal Processing, pp. 651-699 (1992). This article, and all other articles mentioned in this specification, are incorporated herein by reference. The N-gram model is lexically sensitive in that the parameters of the model are associated with particular lexical items, i.e., words. This sensitivity allows the model to capture local distributional patterns that are idiosyncratic to particular words.
A second method, referred to as stochastic context-free grammar, uses a tree-like data structure wherein words within an input signal appear as fringe nodes of a tree. Probabilities are assigned as the sum of probabilities of all tree derivations for which words in the candidate string appear as fringe nodes. See, for example, Jelinek et al., "Computation of the Probability of Initial Substring Generation by Stochastic Context-Free Grammers," Computational Linguistics, v. 17(3), pp. 315-324 (1991). In context-free grammars, structural properties are modeled, i.e., the probability that a phrase of a particular category, e.g., noun or verb phrases, can be decomposed into subphrases of specified categories.
Both of the aforementioned methods for assessing probability suffer from disadvantages. The N-gram model, while lexically sensitive, suffers as a result of its failure to capture meaningful long range associations between words. When grammar is ignored, useful information that can only be derived from grammatical relationships between words is lost. While a stochastic context-free grammar is sensitive to such grammatical relationships, it fails to capture associations between lexical items that reflect semantic information that makes one string much more likely than another. A language model that fails to consider both semantic and structural information inevitably suffers from a loss in accuracy.
The prior art probability models are typically compiled into one large state machine. The aforementioned drawback of the lexically-sensitive probability models are due, in part, to this structure. The machines usually implemented for speech recognition are typically limited to moving left to right through the word string hypotheses, processing word strings in a word-by-word manner. As a result, the long-range associations between words are lost.
Compiling stochastic context-free grammars, or, more properly, approximations of such grammars, into one large state machine does not limit the ability of those models to capture long-range associations. As previously discussed, such associations are captured due to the nature of the model. There is another drawback, however, related to the use of a single large state machine that affects both types of probability models. When compiling the model into one large state machine, the complete lexicon or vocabulary of the language model must be contained therein. In the typical case of a software implementation, such state machines become too large for computers with limited RAM memory.
Thus, there is a need for a language model that possesses both lexical and structural sensitivity, and when implemented in software, is compact enough to be installed on computers having limited RAM memory.