Statistical language models (SLMs) estimate the probability of a text string as a string of natural language, and thus may be used with applications that output natural language text. For example, systems such as speech recognizers or machine translation systems generate alternative text outputs, and those outputs may be processed by statistical language models to compute probability values indicating which of them are the most natural.
In general, N-gram language models estimate the probability of a particular word in a given context of n−1 preceding words. In a typical implementation, the probability is based upon the number of times the word appeared in that context in a training corpus, the number of times the n−1 word context appears in the training corpus, and on an estimate of the probability of the word, given a shorter context of n−2 preceding words. Because the model has to handle words that do not appear in a given context anywhere in the training data, smoothing is performed, which in general reduces the probabilities of words observed in a particular context so as to reserve some probability mass to assign to words that have not been observed in that context. Generally, the estimate of the probability of a word given the shorter context is multiplied by a constant, called the backoff weight, that depends on the longer context. The backoff weight is used so that the probabilities for all words, given the longer context, sum to 1.0 to create a valid conditional probability distribution.
The more natural and human-like the piece of text is, the higher the probability that the statistical language model should assign to it. One way that the quality of a language model is measured is to evaluate the probabilities that the language model assigns to human-generated text compared to those it assigns to mechanically or randomly-generated text. One standard metric used to measure the probability assigned to a text string by a statistical language model is perplexity. The perplexity of a text string according to a language model is a positive real number that is closer to zero, the more likely the string is according to the model. Thus, a good language model will assign a lower perplexity to human-generated text than it assigns to mechanically or randomly-generated text.
Another factor in evaluating statistical language models is their size. When training language models, higher and higher order N-Gram models (at present sometimes ranging from 5-grams to 7-grams), along with larger and larger corpora are being used, because doing so tends to increase their quality (i.e., result in lower perplexity for human-generated text). In general, training proceeds by building a lower order model and then using that lower order model to smooth the next higher order model, and so on, until the final N-gram model is built.
However, higher order N-Gram training tends to result in language models that are so large that they are impractical and/or inconvenient to use in many scenarios. For example, when a statistical language model cannot be stored in the memory of a single server, complex and inefficient distributed storage schemes need to be used.
There are known pruning techniques that can reduce the size of the model, but only at the expense of significantly reducing its quality. What is desirable is a way to provide a reduced-size statistical language model that does not significantly reduce (and even can improve) the quality of the statistical language model relative to an un-pruned language model.