The invention relates to computer speech recognition.
In computer speech recognition, the probability of occurrence of a hypothesized string w of one or more words given the occurrence of an acoustic processor output string y may be given by ##EQU1## In Equation 1, the probability P(y.vertline.w) of the acoustic processor output string y given the utterance of hypothesized word string w, is estimated with an acoustic model of the hypothesized word string w. The probability P(w) of occurrence of the hypothesized word string w, is estimated using a language model. Since the probability P(y) of occurrence of the acoustic processor output string y, does not depend on the hypothesized word string w, the probability P(y) of occurrence of the acoustic processor output string y may be treated as a constant. The use of Equation 1 to directly decode a complete acoustic processor output string y is not feasible whenever the number of different hypothesized word strings w is very large. For example, the number of different word strings w of ten words which can be constructed from a 20,000 word vocabulary is 20,000.sup.10 =1.024.times.10.sup.43.
When the use of Equation 1 is not feasible, the amount of computation can be reduced by carrying out a left-to-right search starting at an initial state with single-word hypotheses, and searching successively longer word strings.
From Equation 1, the probability P(w.vertline.y.sub.1.sup.i) of a hypothesized incomplete string w of one or more words, given the occurrence of an initial subsequence y.sub.1.sup.i of the acoustic processor output string y may be given by: ##EQU2## where y.sub.1.sup.i represents acoustic processor outputs y.sub.1 through y.sub.i. However, the value of P(w.vertline.y.sub.1.sup.i) in Equation 2 decreases with lengthening acoustic processor output subsequence y.sub.1.sup.i, making it unsuitable for comparing subsequences of different lengths. Consequently, Equation 2 can be modified with a normalization factor to account for the different lengths of the acoustic processor output subsequences during the search through incomplete subsequences: ##EQU3## where .alpha. can be chosen by trial and error to adjust the average rate of growth of the match score along the most likely path through the model of w, and where E(y.sub.i+1.sup.n .vertline.y.sub.1.sup.i) is an estimate of expected cost of accounting for the remainder of the acoustic processor output sequence y.sub.i+1.sup.n with some continuation word string w' of the incomplete hypothesized word string w. (See, Bahl et al, "A Maximum Likelihood Approach to Continuous Speech Recognition." IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No. 2, March 1983, pages 179-190.)
It is known that the pronunciation of a selected word may depend on the context in which the word is uttered. That is, the pronunciation of a selected word may depend on the prior word or words uttered before the selected word, and may also depend on the subsequent word or words uttered after the selected word. Therefore, a word may have several context-dependent acoustic models, each depending on the prior word or words uttered before the selected word and the subsequent word or words uttered after the selected word. Consequently, the selection of one of several acoustic models of a word will depend on the hypothesized context in which the word is uttered.
In generating a hypothesized string w of one or more words being uttered, words are added to a partial hypothesis one word at a time in the order of time in which they are uttered. After each single word is added, but before any further words are added, the probability of the partial hypothesis is determined according to Equation 1. Only the best scoring partial hypotheses are "extended" by adding words to the ends of the partial hypotheses.
Therefore, when a new word is added to a partial hypothesis, and when the probability of the extended partial hypothesis is determined according to Equation 1, the hypothesized prior word or words are known, but the hypothesized subsequent word or words are not known. Consequently, the acoustic model selected for the new word will be independent of the context of words following the new word.