Language models provide probabilities for sequences of words and are a primary component in most modern speech and language applications. These models are generated from a set of training data by counting the frequency of occurrence of sequences of n words in the training data (where n is an integer). Sequences of n words are referred to as n-grams. N-grams are classified based on the number of words included in the n-gram. For example, a unigram is a single word, a bigram is an ordered sequence of two words, a trigram includes three words, and a 5-gram includes five words. Because not all possible sequences of words will appear in the training data, back-off modeling techniques have been developed to assign estimated frequencies to non-appearing sequences.
Many such applications, in particular, automatic speech recognition (ASR) and machine translation (MT), have evolved over the past decade, offering high performance and usability. Today, despite extensive research on novel approaches, the standard back-off n-gram language model remains the model of choice in most applications due to its efficiency and reliability. Significant gains in performance are achieved by utilizing larger amounts of training data available for language modeling. However, very large data sets (e.g. data sets including billions of words) pose a computational challenge where one must be able to estimate billions of parameters. Systems and methods are needed for reducing the memory requirements of language models without reducing model accuracy.