The exemplary embodiment relates to statistical machine translation and finds particular application in connection with a translation system and method which considers potential ambiguity in the target domain.
Statistical Machine translation (SMT) systems are widely used for translation of text from a source language to a target language. The systems often include a model that has been trained on parallel corpora, that is on pairs of sentences in the source and target language that are known or expected to be translations of each other. Large corpora are used for training the system to provide good translation performance. For example, text of the European Parliament, which is often translated into several languages, has been widely used as source and target corpora. Such corpora tend to be domain specific. Since the translation models in SMT rely heavily on the data they have been trained on, they may not provide as good performance outside that domain. Thus, for example, a model trained on the Europarl data is likely to provide weaker performance when translating texts in a medical or agricultural domain.
There are several reasons for this. First, some terms specific to the domain may be missing in the training corpus. Second, even if the correct translation is present, the wrong translation may be promoted by the translation model. Thus, the English-French translation (house, assemblee) is more probable (0.5) in the model trained on Europarl than (house, maison) (0.48). Third, the sentence structure of new domain (e.g., in the case of patents or a travel guide) may be different from the style of the parallel corpus available (e.g., parliamentary speeches). The problem may thus include not only the lexical translation adaptation, but also “structure” adaptation.
While some of these problems could be addressed by having a large training corpus of in-domain parallel text and/or parallel in-domain terminology, this may not be available. The parallel resources are often rare and expensive to produce. Further, the cost of training translation systems for use in many different domains would be high. The domain adaptation of translation models is thus an issue in Statistical Machine Translation.
When a parallel in-domain corpus is available for training an SMT model, several approaches have been proposed or domain adaptation. In the instance weighting approach (see, G. Foster, C. Goutte, and R. Kuhn, Discriminative instance weighting for domain adaptation in statistical machine translation,” in Proc. 2010 Conf. on Empirical Methods in Natural Language Processing, EMNLP '10, pages 451-459, Association for Computational Linguistics (ACL), 2010), where the out-of-domain phrase-pairs are weighted according to their relevance to the target domain. The weighting scheme proposed in Foster is based on the parallel in-domain corpus. A similar framework to that proposed by J. Jiang and C. Zhai (“Instance weighting for domain adaptation in NLP,” in ACL 2007, pages 264-271, 2007) is proposed for adaptation using source monolingual in-domain corpus. However, this has only been applied for Named Entity Recognition and PoS tagging tasks. While Foster's adaptation operates at the phrase level, other approaches operate at the feature-level. Examples of these are mixture models (see, G. Foster and R. Kuhn, “Mixture-model adaptation for SMT,” in Proc. 2nd Workshop on Statistical Machine Translation, WMT'2007, pages 128-135, ACL 2007) and tuning on an in-domain development set (see P. Pecina, A. Toral, A. Way, V. Papavassiliou, P. Prokopidis, and M. Giagkou, “Towards using web-crawled data for domain adaptation in statistical machine translation, in Proc. 15th Annual Conf. of the European Assoc. for Machine Translation, pages 297-304, Leuven, Belgium, 2011).
In the case where there is no in-domain parallel data to train a dedicated translation one approach has been to create artificial parallel data. It has been suggested that pseudo in-domain data could be selected from large out-of-domain available corpus (parallel and/or monolingual) using information retrieval, clustering or classification, or cross-entropy methods (see, A. S. Hildebrand, M. Eck, S. Vogel, and A. Waibel, “Adaptation of the translation model for statistical machine translation based on information retrieval,” in Proc. 10th Conf. of the European Association for Machine Translation (EAMT), Budapest, May 2005; B. Zhao, M. Eck, and S. Vogel, “Language model adaptation for statistical machine translation with structured query models,” in Proc. 20th Intern'l Conf. on Computational Linguistics, COLING '04. ACL 2004; R. Hague, S. K. Naskar, J. V. Genabith, and A. Way, “Experiments on domain adaptation for English—Hindi SMT,” in Proc. of PACLIC 23: the 23rd Pacific Asia Conference on Language, Information and Computation, 2009; J. Xu, Y. Deng, Y. Gao, and H. Ney, “Dependent statistical machine translation,” in Machine Translation Summit, Copenhagen, Denmark, 2007; A. Axelrod, X. He, and J. Gao, “Domain adaptation via pseudo in-domain data selection,” in Proc. 2011 Conf. on Empirical Methods in Natural Language Processing, pp. 355-362, Edinburgh, Scotland, 2011. The “pseudo” in-domain corpus is then used in a combination with an out-of-domain corpus for creating an adapted translation model. It has been suggested that the translation model trained on thus selected “pseudo” in-domain corpus (representing 1% of the whole corpus) might outperform a translation model trained on a whole corpus. This may be due to the lexical ambiguity problem existing in the whole corpus. These approaches, however, do not address the domain adaptation problem directly with an available in-domain monolingual corpus, but rather search for a way to create an artificial in-domain parallel corpus. These methods generally create an in-domain Language Model.
Another approach creates artificial in-domain parallel data by translating source/target monolingual data with a previously trained out-of-domain translation model (see, N. Bertoldi and M. Federico, “Domain adaptation for statistical machine translation with monolingual resources,” in Proc. 4th Workshop on Statistical Machine Translation, pages 182-189, ACL 2009. The in-domain and out-of-domain corpora came from a similar domain (United Nations and Europarl) and may not be representative of many of the situations faced (e.g., parliamentary speeches vs. medical text).