In various environments, a string of events may be known and a prediction of the next event is sought. For example, in a speech recognition environment, a string of words may be known and it is a prediction of the next word that is sought.
One approach to determining the next word is to store, for each word in the vocabulary, a respective probability of being next based on frequency of occurrence. That is, a sample text is examined and the number of times each predefined sequence of words occurs in the sample text is counted. From the count for a given predefined sequence, a corresponding probability is readily computed. While this approach is useful where there is a small vocabulary of words and an extensive sample text covering the numerous predefined sequences, the approach is inadequate where data is sparse relative to the size of the vocabulary.
In speech recognition systems which compute next word probability estimates, the available data is typically sparse. In this regard, it is observed that even a very large data collection will normally not include sufficient data from which the probabilities of infrequent word sequences--which may occur rarely or not at all--may be estimated. Hence, there is insufficient data to account for all possible next words.
The problem of sparse data is explained in the context of m-grams. An m-gram is a sequence of m events (in the present case, words). A sample text is examined to determine how often each m-gram occurs therein. An m-gram, it is noted, may be one word, two words, . . . , or j words long. The larger the value for j, the more possible combinations of words there are. For a vocabulary of 5000 words, there would be 5000.sup.2 =25 million two word combinations (referred to as bi-grams). Also for 5000 words, there are 5000.sup.3 =125 billion three word combinations (referred to as tri-grams).
It is readily observed that the sample text required to permit each tri-gram to occur just once is impractically large. Moreover, in that different events must occur at different frequencies if the statistics are to be useful, the sample text must be considerably greater than 5000.sup.3. In addition, if the vocabulary is to be greater than 5000--for example, 20,000--the problem of sparse data is even more pronounced.
To address the paucity of data, some prior techniques have linearly summed or combined the respective relative frequencies f.sub.L of m-grams of differing length L for a subject word to estimate the probability of the subject word being the next word. That is, to estimate the probability of word w.sub.j following the ordered string of previous known words w.sub.1,w.sub.2, . . . ,w.sub.j-1 the following expression has been suggested: EQU Prob(w.sub.j .vertline.w.sub.1, . . . ,w.sub.j-1)=af.sub.1 (w.sub.j)+bf.sub.2 (w.sub.j .vertline.w.sub.j-1)+cf.sub.3 (w.sub.j .vertline.w.sub.j-2,w.sub.j-1)+ . . .
where a, b, and c represent weighting factors which may be included. When probabilities of m-grams of varying length are linearly combined, the relative weight or importance of each must be evaluated, a task which has been found to be difficult to achieve.
Prior art which addresses the sparse data problem includes U.S. Pat. Nos. 4,538,234, 4,038,503, 4,489,435, 4,530,110 and the following Articles: IBM Technical Disclosure Bulletin, vol. 27, number 7b pp. 4521-3 (Nadas); IBM Technical Disclosure Bulletin, vol. 24, Number 11A pp. 5402-3 (Damerau); IBM Technical Disclosure Bulletin, vol. 24, Number 4 pp. 2038-41 (Bahl et al.); and IBM Technical Disclosure Bulletin, vol. 28, Number 6 pp. 2591-4 (Jelinek at el.).