The present invention relates to language models. In particular, the present invention relates to compressing language models for efficient storage.
Language models are used in a variety of applications including automatic translation systems, speech recognition, hand writing recognition and speech and text generation. In general, language models provide a likelihood of seeing a sequence of words in a language. One common language model is an n-gram language model that provides probabilities for observing n words in a sequence.
As n increases, the likelihood of observing particular sequences of words in a training text decreases. For example, it is much more likely that a sequence of two words, AB, will be found in a training text than a sequence of five words, ABCDE. When a particular sequence of words is not observed in the training text or is not observed often in the training text, the corresponding probability trained for the sequence will not provide an accurate reflection of the likelihood of the sequence in the language. For example, simply because a sequence of words such as “fly over Miami” does not appear in the training text does not mean that this sequence has zero probability of appearing in the English language.
To overcome this data sparseness problem, the prior art has developed a back-off scheme in which n-gram models for sequences of words that are not observed often are replaced with (n−1)-gram models multiplied by a back-off weight, α, that is dependent upon the preceding n−1 words before the current word. For example, if trigram models are used in which n=3, the back-off probability would be a bigram probability that provides the probability of the current word given a preceding word and the back-off weight α would be dependent on the preceding word and the second preceding word.
For such models, a model probability must be stored for each level of n-gram from 1 to N, where N is the longest sequence of words supported by the language model. In addition, a back-off weight must be stored for each context in each n-gram. For example, for a trigram model with back-off weights, the following parameters must be stored (1) P(wi|wi−1wi−2), (2) P(wi|wi−1), (3) P(wi), (4) α(wi−1wi−2) and (5) α(wi−1).
To reduce the amount of storage space needed, one system of the prior art stored some of the probabilities and back-off weights together in a same record. This reduces the amount of space required to index the values. In particular, for two words wa and wb, the bigram probability P(wa|wb) would be stored with the back-off weight α(wawb) This can be done because the record can be uniquely identified by word identifiers wa and wb. Similarly, the probability P(wa) can be stored with back-off weight α(wa). Thus, a unigram probability is stored with a bigram back-off weight and a bigram probability is stored with a trigram back-off weight.
Recently, a new type of language model known as a predictive clustering language model, which is a variant of asymmetric clustering models, has been developed. In general, an asymmetric clustering model can be parameterized as follows with trigram approximation:
                              P          ⁡                      (                                          w                i                            |              h                        )                          =                              P            ⁡                          (                                                c                  i                                |                                                      c                                          i                      -                      1                                                        ⁢                                      c                                          i                      -                      2                                                                                  )                                ×                      P            ⁡                          (                                                w                  i                                |                                                      c                                          i                      -                      1                                                        ⁢                                      c                                          i                      -                      2                                                        ⁢                                      c                    i                                                              )                                                          EQ        .                                  ⁢        1            where wi denotes a word, h denotes a context, and ci denotes the cluster that wi belongs to. Asymmetric models use different clusters for predicted and conditional words, respectively. As an extreme case, in the predictive model, each conditional word is regarded as a cluster so it can be formally formulated as:P(wi|h)=P(ci|w−1wi−2)×P(wi|wi−1wi−2ci)  EQ. 2
From Equation 2, it can be seen that the parameter space is larger than that of a word-based language model and there are two sub-models in the predictive clustering model. One is the cluster sub-model P(ci|wi−1wi−2) and the other is the word sub-model P(wi|wi−1wi−2ci). Each of these sub-models are backed off independently. This creates the following back-off approximations:P(ci|wi−1wi−2)=P(ci|wi−1)×α(wi−1wi−2)  EQ. 3P(ci|wi−1)=P(ci)×α(wi−1)  EQ. 4P(wi|wi−1wi−2ci)=P(wi|wi−1ci)×α(wi−1wi−2ci)  EQ. 5P(wi|wi−1ci)=P(wi|ci)×α(wi−1ci)  EQ. 6
From Equations 3 and 4, it can be seen that there are five types of parameters in the cluster sub-model that need to be stored. These include (1) P(ci|wi−1wi−2), (2) P(ci|wi−1), (3) a(wi−1wi−2), (4) P(ci) and (5) α(wi−1). From Equations 5 and 6, there are also five types of parameters that must be stored for the word sub-model. These include: (1) P(wi|wi−1wi−2ci), (2) P(wi|wi−1ci), (3) α(wi−1wi−2ci), (4) P(wi|ci) and (5) and α(wi−1ci).
It is not possible to combine these parameters in the same way in which the word-based model parameters were combined. Specifically, the bigram probability P(ci|wi−1) cannot be stored with the trigram back-off weight α(wi−1wi−2) because they require different indexing keys. Similarly, P(ci) cannot be stored with α(wi−1). In the word sub-model, back-off weight α(wi−1wi−2ci) and probability P(wi|wi−1wi−2ci) cannot be stored together nor can back-off weight α(wi−1ci) and P(wi|ci) because they have different indexing keys.
Traditionally, four trees have been stored to store these parameters with one tree for each sub-model used to store probability parameters and one tree for each sub-model used to store back-off weights. The deficiency of this approach is that in each tree there is a separate indexing data structure. As a result, the indexing data structure is duplicated in the multiple tree structures creating an overall model size that is much larger than the word-based model given the same training corpus. Thus, a system is needed to improve the efficiency of storing predictive clustering language models.