The present invention relates to language models. In particular, the present invention relates to training language models for specific domains.
Language models provide a measure of the likelihood of a series of words appearing in a string of text. Such models are used in speech recognition, character segmentation, and pinyin-to-character conversion to identifying a most likely sequence of words given a lattice of possible sequences. For example, in speech recognition, a language model would identify the phrase “go to bed” as being more likely than the phonetically similar phrase “go too bed”.
Since most current techniques of language modeling are statistically based, a language model is usually trained on text that is similar to the text that the language model will decode. If the training data and the user data are too different, the probabilities used by the language model will not accurately reflect the likelihood of the text in the user data. In particular, if the user data is focused on a specific domain and that domain is not well represented in the training data, the language model probabilities will not be reliable for most of the text in the user data. This causes the language model to give preference to a sequence of words that does not accurately match the user data.
To overcome this problem, the prior art has built task-specific language models that are trained only on data from the task-specific domain, which is always insufficient to train a reliable language model. So during decoding, the task-specific model usually works with a general language model—that is, both models provide probabilities for certain sequences of words,—and the resulting probabilities are linearly combined by applying a weight to the task-specific language model. This technique was thought to shift the probability of the general language model toward the probability provided by the task-specific language model.
To set the weights for the linear combination of probabilities, the prior art adjusted the weights to minimize perplexity, which is defined as:
                    PP        =                              2                          -                              1                N                                              ⁢                                    ∑                              i                =                1                            N                        ⁢                          log              ⁢                                                          ⁢                              P                ⁡                                  (                                                            w                      i                                        |                                          w                                              i                        -                        1                                                                              )                                                                                        EQ        .                                  ⁢        1            where PP is the perplexity, N is the number of words in a test document, and P(wi|wi−1) is the probability of an n-gram (in this case the probability of the ith word given the word before the ith word, called a bigram probability). In general, the perplexity can be thought of as the geometric mean of the branching factor of the test document.
However, systems that combine the probabilities provided by a task-specific model and a general model have not provided a significant reduction in the error rate associated with task-specific data. One reason for this is that the perplexity does not correlate well with error rate when language model probabilities are linearly combined. Thus, a language model is needed that can perform well on task-specific words even when there is only a limited amount of task-specific training data available.