Language models have been used in many areas to predict an outcome based on previous data. With respect to speech recognition, products which recognize continuously spoken small vocabularies have been on the market for over a decade. However a more important goal is to develop speech recognition systems capable of recognizing unrestricted continuous speech.
Certain automatic speech recognition devices, automatic language translation devices, and automatic spelling correction devices have been known to operate according to the model shown in Equation (1). ##EQU1##
In this model, w is a word-series hypothesis representing a series of one or more words, for example English-language words. The term p(w) is the probability of occurrence of the word-series hypothesis. The variable y is an observed signal, and p(y) is the probability of occurrence of the observed signal. p(w y) is the probability of occurrence of the word-series w, given the occurrence of the observed signal y. P(y w) is the probability of occurrence of the observed signal y, given the occurrence of the word-series w.
For automatic speech recognition, y is an acoustic signal. See L. R. Bahl, et al. "A Maximum Likelihood Approach to Continuous Speech Recognition." IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume PAMI-5, No. 2, March 1983, pages 179-190, herein incorporated by reference. For automatic language translation, y is a sequence of words in another language different from the language of the word-series hypothesis. See P. F. Brown, et al. "A Statistical Approach To Machine Translation." Computational Linguistics, Vol. 16, No. 2, June 1990, pages 79-85. For automatic spelling correction, y is a sequence of characters produced by a possibly imperfect typist. See E. Mays, et al. "Context Based Spelling Correction." Information Processing & Management, Vol. 27, No. 5, 1991, pages 517-522.
In all three applications, given a signal y, one seeks to determine the series of English words, w, which gave rise to the signal y. In general, many different word series can give rise to the same signal y. The model minimizes the probability of choosing an erroneous word series by selecting the word series w having the largest conditional probability given the observed signal y.
As shown in Equation (1), the conditional probability of the word series w given the observed signal y is the combination of three terms: (i) the probability of the word series w, multiplied by (ii) the probability that the observed signal y will be produced when the word-series w is intended, divided by (iii) the probability of observing the signal y.
In the case of automatic speech recognition, the probability of the acoustic signal y given the hypothesized word series w is estimated by using an acoustic model of the word series w. In automatic language translation, the probability of the sequence of words y in another language given the hypothesized English-translation word series w is estimated by using a translation model for the word series w. In automatic spelling correction, the probability of the sequence of characters y produced by a possibly imperfect typist given the hypothesized word series w is estimated by using a mistyping model for the word series w.
In these types of applications, the probability of the word series w can be modeled according to the equation: EQU p(w.sub.1.sup.k)=p(w.sub.1)p(w.sub.2 .vertline.w.sub.1) . . . p(w.sub.k .vertline.w.sub.1.sup.k-1) (2)
where w.sub.1.sup.k represents a series of words w.sub.1, w.sub.2, . . . , w.sub.k.
In the conditional probability p(w.sub.k .vertline.w.sub.1.sup.k-1), the term w.sub.1.sup.k-1 is called the history or the predictor feature and represents the initial (k-1) words of the word series. Each word in the history is a predictor word. The term w.sub.k is called the predicted feature or the category feature.
The mechanism for estimating the conditional probabilities in Equation (2) is a language model. A language model estimates the conditional probabilities from limited training text (training data). The larger the training text, and the larger the number of parameters in the language model, the more accurate and precise are the predictions from the language model.
As stated above, a purpose of a language model is to assign probabilities to a word series, e.g., the probability of a trigram w.sub.1 w.sub.2 w.sub.3, given that bigram w.sub.1 w.sub.2 has just occurred.
A previously successful language model is a trigram model based upon deleted interpolation as described in Bahl, et al., "A Maximum Likelihood Approach to Continuous Speech Recognition", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No. 2, pp. 179-190 (March 1983).
This trigram model is based upon deleted interpolation. This model requires the storage of records that identify: (a) a trigram id w.sub.1 w.sub.2 w.sub.3 and its count c(w.sub.1 w.sub.2 w.sub.3); (b) a bigram identification w.sub.2 w.sub.3 and its count c(w.sub.2 w.sub.3); and (c) a unigram identification w.sub.3 and its count c(w.sub.3). The count of a given trigram is the number of occurrences of this given trigram in the training data. Significant redundancy exists in this model because a particular bigram can be included as part of the trigram count and as part of the bigram count, i.e., a given w.sub.2 w.sub.3 can be counted twice, thereby increasing the amount of storage required. This redundancy is shown by Equation (3) which shows that the bigram count is the sum of the corresponding trigram counts. ##EQU2##
The probability assigned to the next word by this trigram model is shown in Equation (4) ##EQU3## where V is the vocabulary size, in number of words, N is the size of the training data in number of words, and the .lambda..sub.i 's (i=0, . . . , 3) are the smoothing parameters. The smoothing parameters are the relative weight given to each quotient of Equation (4). The smoothing parameters are estimated by using a portion of the training data. A percentage, e.g. five percent, of the training data is not used as training data. Instead this percentage of data is "held-out", i.e., not used to train the language model. Instead this held-out data is used to fine-tune the smoothing parameters. The smoothing parameters are estimated by maximizing the likelihood of this held-out data. This procedure is more fully described in Bahl, et al., "A Maximum Likelihood Approach to Continuous Speech Recognition", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No. 2, pp. 179-190 (March 1983).
The storage requirements for trigram based language models are predominately dominated by the trigram record storage. A method and system to reduce the memory requirement of language models without a significant reduction in performance is needed.