This specification relates to language models stored for digital language processing.
Language models are used to model a probability that a string of words in a given vocabulary will appear in a language. For example, language models are used in automatic speech recognition, machine translation, and optical character recognition applications. Modeling the probability for a string of words in the vocabulary is typically performed using a chain rule and calculating the probability of a given word, w, in a given string context, p(w|context), where the context is the words in the string preceding the given word, w.
In an n-gram language model, the words in the vocabulary are formed into n-grams. An n-gram is a sequence of n consecutive tokens, which in the present specification are typically words. An n-gram has an order, which is the number of words in the n-gram. For example, a 1-gram (or unigram) include one word; a 2-gram (or bi-gram) includes two words.
A given n-gram can be described according to different portions of the n-gram. An n-gram can be described as a context and a future word, (context, w), where the context has a length n−1 and w represents the future word. For example, the 3-gram “the black sheep” can be described in terms of an n-gram context and a future word. The n-gram context includes all words of the n-gram preceding the last word of the n-gram. In the given example, “the black” is the context. The left most word in the context is referred to as the left word. The future word is the last word of the n-gram, which in the example is “sheep”. The n-gram can also be described with respect to a right context and a backed off context. The right context, also referred to as a “back-off n-gram”, includes all words of the n-gram following the first word of the n-gram, represented as a (n−1)-gram. In the example above, “black sheep” is the right context. Additionally, the backed off context is the context of the n-gram less the left most word in the context. In the example above, “black” is the backed off context.
The probability according to the n-gram language model that a particular string will occur can be determined using the chain rule. The chain rule determines a probability of a string as a product of individual probabilities. Thus for a given string “e1, e2, . . . , ek”, the probability for the string, p(e1, e2, . . . ek), is equal to:
      ∏          i      =      1        k    ⁢      p    ⁡          (                                    e            i                    |                      e            1                          ,        …        ⁢                                  ,                  e                      i            -            1                              )      
The n-gram language model can be limited to a particular maximum size n-gram, e.g., limited to 1-grams, 2-grams, 3-grams, etc. For example, for a given string “NASA officials say they hope,” where the maximum n-gram order is limited to 3-grams, the probability for the string can be determined as a product of conditional probabilities as follows: p(NASA officials say they hope)=p(NASA)×p(officials|NASA)×p(say|NASA officials)×p(they|officials say)×p(hope|say they). This can be generalized to:
      p    ⁡          (                        e          1                ,        …        ⁢                                  ,                  e          k                    )        =            ∏              i        =        1            k        ⁢          p      ⁡              (                                            e              i                        |                          e                              i                -                n                +                1                                              ,          …          ⁢                                          ,                      e                          i              -              1                                      )            where n is the order of the largest n-gram allowed in the language model.
Sentences are considered independently (bounded by sentence beginning and end markers, <s> and </s> respectively. The sentence independence constraint translates to predicting the first words in the sentence from the boundary symbols, but not words in the previous sentence. For the example, NASA is predicted using P(NASA|<s> </s>). The end of sentence marker is predicted as well to make sure we deal with a proper probability model.
The conditional probabilities are generally determined empirically, according to relative frequencies in a corpus of text. For example, in the example above, the probability of the word “say” given the context of “NASA officials” is given by:
            p      ⁡              (                  say          |                      NASA            ⁢                                                  ⁢            officials                          )              =                  f        ⁡                  (                      NASA            ⁢                                                  ⁢            officials            ⁢                                                  ⁢            say                    )                            f        ⁡                  (                      NASA            ⁢                                                  ⁢            officials                    )                      ,where f (NASA officials say) is a frequency or a count of the occurrences of the string “NASA officials say” in the corpus. Conditional probabilities for strings within the maximum n-gram order in the n-gram language model correspond to the probability stored in the language model for the n-gram, e.g., p(say|NASA officials) is the conditional probability stored in the language model for the 3-gram entry “NASA officials say”. A back off weight can optionally be determined for n-grams having an order less than the maximum order. The back off weight (“BOW”) is a factor applied to estimate the probability for a particular n-gram when it is not found in the model.