Machine translation (MT) refers to the use of a machine or more particularly a computer, to produce translations from one natural language to another with or without human intervention. Generally, the translation process includes two steps, namely decoding the meaning of text or speech in a source language and re-encoding the meaning in a target language. Behind this ostensibly simple process, however, are complex operations aimed at extracting and preserving meaning in light of semantic ambiguity, syntactic complexity, and vocabulary differences between languages, among other things. Numerous MT systems exist today that utilize a variety of approaches to produce translations.
Moreover, system combination for machine translation has emerged as a powerful method of combining the strengths of multiple MT systems and achieving results that surpass those of each individual system. Most state-of-the-art system-combination methods are based on constructing a confusion network (CN) from several input translation hypotheses (output of MT systems), and choosing the best output from the CN based on several scoring functions.
The general idea behind confusion-network-based system combination is to combine hypotheses in a representation where for each word there is a set of possible words shown in columns (an alternative representation of a directed acyclic graph), as provided in the below exemplary table:
TABLE 1sheboughttheJeepεshebuystheSUVεsheboughttheSUVJeepThe final output is determined by choosing one word from each column, which can be a real word or the empty word “ε.” In the example above, eight distinct sequences of words can be generated including: “she bought the Jeep” and “she bought the SUV Jeep.” The choice is performed to maximize a scoring function using a set of features and a log-linear model.
A confusion network can be viewed as an ordered sequence of columns or, in other words, correspondence sets. Each word from each input hypothesis belongs to one correspondence set. Each correspondence set includes at most one word from each input hypothesis and contributes one of its words (including the possible empty word) to the final output, and final words are output in the order of correspondence sets. In order to construct such a representation two sub-problems need to be solved, namely the alignment problem and ordering problem. The alignment problem pertains to arranging words from all input hypotheses into correspondence sets, and the ordering problem concerns ordering correspondence sets. After constructing the CN, there is a third sub-problem, the lexical selection problem, wherein a determination is made as to which words to output from each correspondence set.
Conventionally, construction of the CN is performed as follows. First, a backbone hypothesis is selected, which determines the order of words in the final system output, and guides word-level alignments for construction of columns of possible words at each position. For example, assume that there are three hypotheses: “she bought the Jeep,” “she buys the SUV,” and “she bought the SUV Jeep,” and the second hypothesis, “she buys the SUV” is selected as the backbone. The other two hypotheses are aligned to the backbone such that these alignments are one-to-one, and empty words are inserted, if needed, to make one-to-one alignment possible. Words in the hypotheses are sorted by position of the backbone word they align to and the confusion network is determined, for example, as depicted in TABLE 1.
It is clear that the quality of selection of the backbone and alignments has a large impact on performance, because the word order is determined by the backbone, and the set of possible words at each position is determined by alignment. Since the space of possible alignments is extremely large, approximate and heuristic techniques have been employed to derive them. In pair-wise alignment models, for example, each hypothesis is aligned to the backbone in turn, with separate processing to combine the multiple alignments. A major problem with such methods is that each hypothesis is aligned to the backbone independently, leading to sub-optimal behavior.
For example, suppose that a state-of-the-art word alignment model is employed for pairs of hypotheses as provided above. If the first hypothesis is aligned to the second hypothesis (the backbone), “Jeep” is likely to align to “SUV” because they express similar content. The third hypothesis is separately aligned to the backbone, and since the alignment is constrained to be one-to-one, “SUV” is aligned to “SUV” and “Jeep” to an empty word, which is inserted after “SUV.” The confusion network represented in TABLE 1 is a result of this process. An undesirable property of this CN is that two instances of the word “Jeep” are in separate columns, and thus they cannot vote to reinforce each other.
Incremental methods have been proposed to relax the independence assumption of pair-wise alignment. Such methods align hypotheses to a partially constructed CN in some order. For example, in such a method the third hypothesis is first aligned to the backbone followed by alignment of the first hypothesis. It is likely that the following CN will be produced as a result of this alignment:
TABLE 2sheboughttheεJeepshebuystheSUVεsheboughttheSUVJeepHere, the two instances of “Jeep” are aligned. However, if the first hypothesis is aligned to the backbone first, the CN provided in TABLE 1 results. Further, if the desired output is “she bought the Jeep SUV,” the output cannot be generated from either confusion network because a re-ordering would be required with respect to the original input hypotheses.