The exemplary embodiment relates to Machine Translation (MT) and finds particular application in estimating parameters for a machine translation scoring function when there is a lack of in-domain parallel training data.
Statistical Machine Translation (SMT) systems use a translation scoring function for scoring candidate translations of a source language text string, such as a sentence. Parameters of the scoring function are generally trained on a parallel development corpus containing pairs of source and target sentences which are assumed to be a translation of each other, in at least the source to target direction. In a phrase-based system, the parameters serve as weights for features of the candidate translation, some of which are derived from a phrase table. The phrase table stores corpus statistics for a set of biphrases found in a parallel training corpus. These statistics include phrasal and lexical probabilities that represent the probability that a given source phrase (or its constituent words, in the case of lexical probability) in a biphrase is translated to the corresponding target phrase, or vice versa. In addition to translation model features that are based on such phrasal and lexical probabilities, the translation scoring function may also incorporate parameters of a language model, which focuses only on the target side probabilities of the translation, and parameters of a reordering model, which takes into account the extent to which the words of the translation are reordered when compared with the order of the aligned words of the source sentence. For a new source sentence to be translated, the SMT scoring function is used to evaluate candidate translations formed by combining biphrases from the phrase table which cover the source sentence, where each source word is covered by no more than one biphrase. The respective corpus statistics of these biphrases are retrieved from the phrase table and corresponding features of the scoring function are computed based thereon which aggregate the probabilities for each of the biphrases being used. The scoring function features are weighted by the scoring function parameters in a log-linear combination to determine an optimal set of the biphrases, from which a translation is generated.
One problem which arises in machine translation is that the parameters of the scoring function, and the values of the features themselves, can vary from one domain to another. For example, one feature of the phrase table may be a more reliable predictor of translation quality and thus its parameter should provide a greater weight to the respective feature of the scoring function. The overall quality of translation is thus dependent, in part, on how suited the phrase table is to the domain of interest, but also on how well the weights of the log-linear combination of various translation features are optimized for that domain. Thus, there is considerable interest in generating machine translation systems that are adapted to the particular domain of the text to be translated.
Optimization of the parameters of the scoring function is thus an integral part of building a high quality translation system. One method which has been used for optimization is Minimum Error Rate Training (MERT) (Och, “Minimum Error Rate Training in Statistical Machine Translation,” Proc. 41st Annual Meeting of the ACL, pp. 160-167 (2003)). In the MERT approach, an optimal weight vector is computed by minimizing the error on a held-out parallel development set. Another approach uses the Margin Infused Relaxed Algorithm (MIRA) (Hasler, et al., “Margin Infused Relaxed Algorithm for Moses,” Prague Bulletin of Mathematical Linguistics, 96:69-78 (2011)). MIRA is an online learning algorithm which scales up to a large number of translation features. For both these algorithms, a commonly used objective function for optimizing the weights is the BLEU score (Papineni, et al., “BLEU: a Method for Automatic Evaluation of Machine Translation,” Computational Linguistics, pp. 311-318 (2002)).
Since the parameters of the scoring function tend to be domain dependent, a parallel in-domain corpus called a development corpus, is often used for estimating the parameters. However, such parallel corpora are not always available for a particular domain of interest and may be costly to produce as they generally entail the use of human translators.
It has been suggested that an SMT system may be built by using a large target side monolingual corpus to obtain monolingual features and a bilingual dictionary to build a translation model (Klementiev, et al., “Toward statistical machine translation without parallel corpora,” Proc. 13th Conf. of the European Chapter of the Assoc. for Computational Linguistics, pp. 130-140 (April 2012)). However, a parallel development set was used for tuning the parameter estimates using MERT.
It has also been suggested that the parameter estimates tuned for a more general domain can be applied to a more specific domain (Pecina, et al., “Simple and effective parameter tuning for domain adaptation of statistical machine translation,” COLING, pp. 2209-2224 (2012)). This is called cross-domain tuning. However, selection of a general domain and tuning of parameters is not straightforward.
There remains a need for a system and method for estimating parameters of a translation scoring function where only source text is available in the relevant domain.