Statistical language models (SLMs) estimate the probability of a text string as a string of natural language, and thus may be used with applications that output natural language text. For example, systems such as speech recognizers or machine translation systems generate alternative text outputs, and those outputs may be processed by statistical language models to compute probability values indicating which of them are the most natural. The more natural and human-like the piece of text is, the higher the probability that the statistical language model should assign to it.
The most widely used types of statistical language models are N-gram models, which estimate the probability of each word in a text string based on the N−1 preceding words of context. For example, the maximum likelihood estimate (MLE) N-gram model determines the probability of a word in a given context of N−1 preceding words as the ratio, in a training corpus, of the number of occurrences of that word in that context to the total number of occurrences of any word in the same context. However, this assigns a probability of zero to any N-gram that is not observed in the training corpus, and thus works poorly whenever an N-gram that was not observed in training is fed to the statistical language models in actual usage.
In order to overcome this problem, numerous smoothing methods have been employed. In general, these methods reduce the probabilities assigned to some or all observed N-grams, in order to provide non-zero probabilities for N-grams not observed in the training corpus.
Kneser-Ney smoothing and its variants, well-known in the art, are generally recognized as the most effective smoothing methods for estimating N-gram language models. For example, Kneser-Ney smoothing and its variants provide very high quality results as measured by evaluating how well such models assign higher probabilities to randomly-selected human-generated text versus the probabilities assigned to mechanically-generated or randomly-generated text.
Smoothing methods operate by using a hierarchy of lower-order models (e.g., unigram, then bigram and so on) to smooth the highest-order N-gram model. In most smoothing methods, the lower-order N-gram models are recursively estimated in the same way as the highest-order model. However, in Kneser-Ney smoothing the lower-order models are estimated differently from the highest-order model. More particularly, Kneser-Ney smoothing is based upon using nonstandard N-gram (diversity) counts for the lower-order models.
As a result of these nonstandard N-gram counts, Kneser-Ney smoothing is inappropriate or inconvenient for some types of applications, including coarse-to-fine speech recognition and machine translation applications that search using a sequence of lower order to higher-order language models. In general, this is because the lower-order models used in Kneser-Ney smoothing are primarily directed towards estimating unobserved N-grams, and thus the lower-order models provide very poor estimates of the probabilities for N-grams that actually have been observed in the training corpus. Further, the nonstandard N-gram counts of Kneser-Ney smoothing are unable to be efficiently computed with language models trained on very large corpora (e.g., on the order of forty billion words), such as when processing such a large amount of data depends on using a “backsorted trie” data structure.
In sum, Kneser-Ney smoothing provides very high-quality statistical language models. However, Kneser-Ney smoothing is not appropriate for use in certain applications. What is desirable is a smoothing technology for tokens (words, characters, symbols and so forth) that can be readily used with such applications, as well as other applications, while at the same time providing generally similar high-quality results.