The present invention relates to automated language systems. More specifically, the present invention relates to language models in statistical language systems.
Automated language systems include speech recognition, handwriting recognition, speech production, grammar checking and machine translation.
Machine translation (MT) systems are systems that receive an input in one language (a “source” language), translate the input to a second language (a “target” language), and provide an output in the second language.
One example of a MT system uses logical forms (LFs), which are dependency graphs that describe labeled dependencies among content words in a string as an intermediate step in translation. Under this system, a string in the source language is first analyzed with a natural language parser to produce a source LF. The source LF must then be converted into a target language LF. A database of mappings from source language LF pieces to target language LF pieces (along with other metadata, such as sizes of mappings and frequencies of mappings in some training sets) is used for this conversion. All mappings whose source language LF pieces are a sub-graph of the source LF are first retrieved. Typically, the source language LF piece of a single mapping does not cover the entire source LF. As a result, a set of mappings (possibly overlapping) must be selected and their target language LF pieces must be combined to form a complete target LF.
To identify the set of target logical forms, an MT system uses a greedy search algorithm to select a combination of mappings from the possible mappings whose source language LF pieces match the source LF. This greedy search begins by sorting the mappings by size, frequency, and other features that measure how well the source language LF pieces of the mapping match the source LF. The sorted list is then traversed in a top-down manner and the first set of compatible mappings found that covers the source logical form is chosen. This heuristic system, however, does not test all possible combinations of input mappings, but simply selects the first set of mappings that completely cover the source LF.
After the set of mappings is selected, the target language LF pieces of the mappings are combined in a manner consistent with the source LF to produce a target LF. Finally, running a natural language generation system on the target LF produces the target language output.
However, MT systems do not always employ logical forms or other parsed structures as intermediate representations. Nor do they necessarily use heuristic methods to resolve translation ambiguities. Some other MT systems try to predict the most likely target language string given an input string in the source language using statistical models. Such MT systems use traditional statistical frameworks and models, such as the noisy-channel framework, to decode and find the target sentence T that is the most probable translation for a given source sentence S. Maximizing this probability is represented by:
                    T        =                                            arg              ⁢                                                          ⁢              max                                      T              ′                                ⁢                      P            ⁡                          (                                                T                  ′                                ❘                S                            )                                                          Equation        ⁢                                  ⁢        1            where T′ ranges over sentences in the target language. By using Bayes Rule, maximizing this probability can also be represented by:
                    T        =                                            arg              ⁢                                                          ⁢              max                                      T              ′                                ⁢                      P            ⁡                          (                              S                ❘                                  T                  ′                                            )                                ×                      P            ⁡                          (                              T                ′                            )                                                          Equation        ⁢                                  ⁢        2            where P(S|T′) is the probability of the source string S given a target language string T′ and P(T′) is the probability of the target language string T′. In string-based statistical MT (MT where no parsed intermediate representation is used), a target language model trained on monolingual target language data is used to compute an estimate of P(T), and alignment models of varying complexity are used to compute and estimate P(S|T).
There are a number of problems associated with conventional, string-based statistical MT systems. In particular, the search space (all possible strings in the target language) is quite large. Without restricting this search space, a practical MT system cannot be built because it takes too long to consider all possible translation strings. To address this, many systems use a simplifying assumption that the probabilities of the channel model and the target language model for an entire string can be determined as the product of probabilities of sub-strings within the string. This assumption is only valid as long as the dependencies in the strings and between the strings are limited to the local areas defined by the sub-strings. However, sometimes the best translation for a chunk of source language text is conditioned on elements of the source and target language strings that are relatively far away from the element to be predicted. Since the simplifying assumptions made in string-based statistical MT models are based in large part on string locality, sometimes the conditioning elements are far enough from the element to be predicted that they cannot be taken into account by the models.
For example, some string-based statistical MT systems use string n-gram models for their language model (LM). These n-gram models are simple to train, use and optimize. However, n-gram models have some limitations. Although a word can be accurately predicted from one or two of its immediate predecessors, a number of linguistic constructions place highly predictive words sufficiently far from the words they predict that they are excluded from the scope of the string n-gram model. Consider the following active and passive sentences:
1. John hit the ball.
2. The balls were hit by Lucy.
The following trigrams occur in these sentences with the indicated frequencies:
<P> <P> John 1<P > <P> The 1<P> John hit 1<P >The balls 1John hit the 1the balls were 1hit the ball 1balls were hit 1the ball <POST> 1were hit by 1hit by Lucy 1by Lucy <POST> 1wherein “<P>” is an imaginary token at the beginning of a sentence providing sentence-initial context, and “<POST>” is an imaginary token at the end of a sentence. It should be noted that each of these trigrams occurs only once, even though the event (the hitting of a ball) is the same in both cases.
In another statistical MT system, a syntax structure in the source language is mapped to a string in the target language. Syntax-based models have several advantages over string-based models. In one aspect, syntax-based models can reduce the magnitude of the sparse data problem by normalizing lemmas. In another aspect, syntax-based models can take the syntactic structure of the language into account. Therefore, events that depend on each other are often closer together in a syntax tree than they are in the surface string because the distance to a common parent can be shorter than the distance in the string.
However, even in a syntax-based model, drawbacks remain: the distance between interdependent words can still be too large to be captured by a local model; also, similar concepts are expressed by different structures (e.g., active vs. passive voice) and are, therefore, not modeled together. These result in poor training of the model and poor translation performance.