The present invention relates to language models. In particular, the present invention relates to storage formats for storing language models.
Language models provide probabilities for sequences of words. Such models are trained from a set of training data by counting the frequencies of sequences of words in the training data. One problem with training language models in this way is that sequences of words that are not observed in the training data will have zero probability in the language model, even though they may occur in the language.
To overcome this, back-off modeling techniques have been developed. Under a back-off technique, if a sequence of n words is not found in the training data, the probability for the sequence of words is estimated using a probability for a sequence of n−1 words and a back-off weight. For example, if a trigram (wn−2 wn−1 wn) is not observed in the training data, its probability is estimated using the probability of the bigram (wn−1 wn) and a back-off weight associated with the context (wn−2 wn−1).
N-gram language models that use back-off techniques are typically stored in a standard format referred to as the ARPA standard format. Because of the popularity of back-off language models, the ARPA format has become a recognized standard for transmitting language models. However, not all language models have back-off weights. In particular, deleted interpolation N-gram models do not have back-off weights because they use a different technique for handling the data sparseness problem associated with language models. As a result, deleted interpolation language models have not been stored in the standard ARPA format. Because of this, it has not been easy to integrate deleted interpolation language models into language systems that expect to receive the language model in the ARPA format.