The following relates to the machine translation arts, statistical machine translation (SMT) arts, and related arts.
Phrase-based statistical machine translation (SMT) systems employ a database of relatively short source language-target language bi-phrases. The SMT system translates a source language text by identifying bi-phrases that include words or phrases of the source language text and optimizing the selection and arrangement of the bi-phrases to generate the translation target language text. The optimization is respective to various quality metrics, usually called features. Commonly, these features include local features calculated for individual bi-phrases and alignment features that characterize the arrangement of bi-phrases. An example of a local feature is a frequency metric indicating how often the bi-phrase occurred in a training corpus of source language-target language documents. In general, the higher the frequency in the training corpus, the more probable it is that the bi-phrase is a correct selection for the SMT translation. This probability can be formulated as two conditional probabilities: the frequency of the target language text given the source language text, and vice versa.
Features characterizing the linguistic quality of the translation in the target language are typically formulated using a target language model (target LM), for example an n-gram model estimated over the target language portion of the training corpus. For n=3, as an example, the n-gram model provides a probability for a sequence of three bi-phrases based on the frequency of occurrence of the corresponding target language sequence in the corpus. Another feature typically used is a distortion feature, which is a metric of deviation of the target language translation from what would be obtained if the bi-phrases were ordered in accord with the ordering of the source language phrases in the source language text. (Colloquially, the distortion feature penalizes target language translations that have ordering of target language words strongly deviating from the ordering of the corresponding source language words).
In constructing the translation, another constraint that may be applied is a source word consumption constraint. Typically, this constraint is designed to ensure that each source language word of the text to be translated is used (i.e. consumed) exactly once in generating the target language text. This constraint is difficult to apply because there is (in general) no one-to-one correspondence between source language words and target language words. For example, in translating French into English, the bi-phrase “curieuse—quite bizarre” equates one source language word (“curieuse”) to two target language words (“quite bizarre”). The opposite situation can also arise, e.g. in performing English-to-French translation this same bi-phrase translates a two source language words (“quite bizarre”) to a single target language word (“curieuse”).
A typical formulation for phrase-based SMT employs a log-linear model of the form:
      p    ⁡          (              t        ,                  a          |          s                    )        =            1              Z        s              ⁢    exp    ⁢                  ∑        k            ⁢                        λ          k                ⁢                              h            k                    ⁡                      (                          s              ,              a              ,              t                        )                              where the hk terms are features, that is, functions of the source string s, of the target string t, and of the alignment a which is a representation of the sequence of biphrases that were used to build t from s. The order of the sequence is defined by reference to the target side, that is, a biphrase b precedes a biphrase b′ in the alignment if and only if the target side of b precedes the target side of b′ in the target language string t. The λk terms are weights, and Zs is a normalization factor that guarantees that p(t, a|s) is a proper conditional probability distribution over the pairs (t, a).
Local features are those features that are local to biphrases (or, said another way, can be computed based only on the biphrase). Some suitable local features include forward and reverse conditional probability features log p({tilde over (t)}|{tilde over (s)}) and log p({tilde over (t)}|{tilde over (s)}), where {tilde over (s)} is the source side of the biphrase and {tilde over (t)} is the target side. The values of these features for a biphrase are suitably estimated on the basis of statistics for that biphrase in a large bilingual training corpus. Another possible local feature is the so-called “phrase penalty” feature, which is equal to 1 for each biphrase in the alignment. Similarly, a “word penalty” feature may be employed which counts the number of words in {tilde over (t)}.
Global features depend on the order in which biphrases appear in the alignment, and cannot be computed based on a biphrase in isolation. One such feature is the language model feature log p(t), which computes the probability of the target string t associated with the translation candidate, typically according to an n-gram language model estimated over a large target language corpus. Another possible global feature is a distortion feature, which measures how much the sequence of biphrases of the candidate translation deviate from the “monotonic” order, namely the order that would be imposed if the target sides of the biphrases were sequenced in the same order as their source-sides.
Design of a particular implementation of the log-linear model p(t, a|s) involves providing a bi-phrase database, selecting the set of features hk, and training the model on a bilingual corpus to optimize the weights λk and ZS. A decoder then employs the trained model to find a target language translation, defined by a pair (t, a), that maximizes the conditional probability p(t, a|s) for an input source string s and outputs the corresponding target language translation. An example of a phrase-based SMT employing this approach is the Moses statistical machine translation system (available at http://www.statmt.org/moses/ last accessed Dec. 21, 2012).
Existing decoders typically employ some variant of a heuristic left-to-right search, that is, they attempt to build a candidate translation (t, a) incrementally, from left to right, extending the current partial translation at each step with a new biphrase, and computing two scores: a score for the known elements of the partial translation so far, and a heuristic estimate of the remaining cost for completing the translation. One such approach uses a form of beam-search, in which several partial candidates are maintained in parallel, and candidates for which the current estimated likelihood p(t, a|s) is too low are pruned in favor of candidates that are more promising.
These existing decoders have certain disadvantages. For example, because the search tree is pruned during the search to avoid combinatorial explosion, the solution that is found at the end of the search is typically suboptimal. Even if the solution is actually an optimal solution, there is no way to determine this. The suboptimality problem is heightened in the presence of a high-order target language model (LM), because such high-order models make it more difficult to “merge” states during the beam-search, and thus lead to larger state spaces that need to be maintained in memory. The left-to-right processing also leads to decisions taken by the decoder being dependent on a local context, and limits or prevents the use of global features computed from the whole translation candidate.