The work leading to this invention has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755.
The exemplary embodiment relates to Statistical Machine Translation (SMT) and finds particular application in generating a model which is adapted to use in different domains.
Statistical Machine Translation systems use a model which is trained on a parallel corpus containing pairs of source and target sentences which are assumed to be a translation of each other, in at least the source to target direction. In a phrase based system, for example, the model stores corpus statistics that are derived for biphrases found in the parallel corpus. These statistics include phrasal and lexical probabilities that represent the probability that a given source phrase/word in the biphrase is translated to the corresponding target phrase/word. For a new source sentence to be translated, the SMT model is used to evaluate candidate biphrases which cover the source sentence, by using the respective corpus statics to determine an optimal set of the biphrases. To do this, the SMT model employs a translation scoring function which incorporates the lexical and phrasal probabilities as features that are weighted by respective weights. The feature weights are learned on a development corpus of source and target sentences. The translation scoring function may also incorporate a language model as one of the features, which focuses on the target side probabilities of given sequences of words.
The performance of SMT models tends to be very dependent on the congruence between the domain of the corpus that was used to train the model and the domain of the text which is to be translated. For example, SMT models can be trained on large collections of parliamentary proceeding translations, such as the Europarl corpus (see, Koehn, P., “Europarl: A Parallel Corpus for Statistical Machine Translation,” in MT Summit (2005), hereinafter, “Koehn 2005”). While achieving high translation scores (e.g., BLEU scores) on unseen test sentences from the same collection, the systems trained on such a corpus typically underperform when translating text from a different domain, such as news stories or lecture transcriptions.
This problem could be addressed by using a training corpus which focuses on the domain of interest. For example, a corpus of news articles may provide better training samples for machine translation of news articles. However, parallel corpora are generated by human translators, which is time consuming and expensive. Thus, for most domains, a large-enough corpus of documents and their translations for training an SMT model is not generally available. However, there is often access to a small in domain corpus of documents.
One solution to this problem is to use the small in-domain corpus in combination with a large out-of-domain corpus for producing translation systems which combine the wide coverage of large out-of-domain corpora, with the information on domain-specific translation correspondences contained in in-domain data. Such methods which bring in- and out-of-domain training corpora together in an SMT system tend to produce systems that perform better than employing exclusively one or the other kind of data. However, such methods also involve the danger of diluting the domain-specific translation correspondences contained in the in-domain corpus with irrelevant out-of-domain ones. Also, when bringing all the training data together, the result may be an incoherent translation model which, while offering wide coverage, does not perform particularly well on any kind of data.
One approach for addressing these issues is to track from which subset of the training data each translation option (e.g., a phrase-pair) was extracted. This information could be used to target the translation of the in-domain data. For example, Matsoukas, et al., (“Discriminative Corpus Weight Estimation for Machine Translation,” Proc. 2009 Conf. on Empirical Methods in Natural Language Processing, pp. 708-717, hereinafter Matsoukas 2009) introduce sentence level features which register for each training sentence-pair, the training corpus collection of origin and the language genre (domain) to which it belongs. Using these features, a perceptron is trained to compute a weight for each sentence-pair, which is used to down-weight the impact during training of translation examples that are not helpful on the test-set domain. Chiang, et al. (“Two Easy Improvements to Lexical Weighting,” Proc. 49th ACL Meeting: Human Language Technologies, pp. 455-460 (2011), hereinafter, Chiang 2011) uses similar collection and genre features to distinguish between training sentence-pairs and computes separate lexical translation smoothing features from the data falling under each collection and genre. Tuning on an in-domain development set allows the system to learn a preference for the lexical translation options found in the training examples which are similar in style and genre.
However, while such systems can yield improvements, they lack flexibility in handling diverse domains.