The exemplary embodiment relates to machine translation and finds particular application in connection with a system and method for terminological adaptation of a statistical machine translation system based on phrasal context.
Statistical Machine translation (SMT) systems are widely used for translation of text from a source language to a target language. The systems often include a translation model that has been trained on parallel corpora, that is on pairs of sentences in the source and target language that are known or expected to be translations of each other. From the bi-sentences, phrase pairs are extracted, together with their translation probabilities. Large corpora are used for training the system to provide good translation performance. Such corpora tend to be domain specific, such as the Europarl corpus of translations of text of the European Parliament. Since the translation models in SMT rely heavily on the data they have been trained on, they may not provide as good performance outside that domain. For example the French word “pastille” may have several translations according to the domain: in medicine “une pastille” is translated as “a pill,” in the nuclear domain, it means “a pellet.”
One solution is to adapt the SMT system to a domain of interest by adding lexical information to the translation model. For example, methods have been proposed to extract specific terminology from bilingual corpora (which are either parallel or comparable). These approaches aim to build dictionaries that avoid OOV words by adding these words to the training corpus, using the dictionary as a translation memory in addition to the Translation Model, or using pre- or post-processing to avoid the OOV problem. See, Nizar Habash, “Four techniques for online handling of out-of-vocabulary words in Arabic-English statistical machine translation,” Proc. ACL 08, HLT Short papers, pp. 57-60 (2008), hereinafter, “Habash 2008”; Pratyush Banerjee, et al., “Domain adaptation in SMT of user-generated forum content guided by OOV word reduction: Normalization and/or supplementary data?” Proc. 16th Ann. Conf. of the European Assoc. for Mach. Translation, pp. 169-176 (2012); and Vassilina Nikoulina, et al., “Hybrid adaptation of named entity recognition for statistical machine translation,” 24th Intl Conf. on Computational Linguistics (COLING 2012), pp. 1-16 (2012). However, adding individual words without their phrasal contexts (that is, individual words as opposed to phrases containing these words) does not make effective use of the capabilities of phrase-based SMT. A system limited to unigram-unigram pairs performs much worse than one containing multigram phrase pairs.
Other methods of domain adaptation of SMT systems focus on training data selection. For example, Information Retrieval approaches have been used to extract the parts of the corpus that are the most relevant to a domain (Matthias Eck, et al., “Language model adaptation for statistical machine translation based on information retrieval,” Proc. Intl Conf. on Language Resources and Evaluation (LREC), pp. 327-330 (2004)). Cross-entropy has also been used to select the most relevant parts of the training data (Robert C. Moore, et al., “Intelligent selection of language model training data,” Proc. ACL (Short Papers), pp. 220-224 (2010), hereinafter, “Moore 2010”; Amittai Axelrod, et al., “Domain Adaptation via Pseudo In-Domain Data Selection,” Proc. Conf. on Empirical Methods in Natural Language Processing, pp. 355-362 (2011), hereinafter, “Axelrod 2011”). In another approach, training data is selected according to a specific terminology by selecting the bi-sentences that contain only the specific terminology, without modifying the training process (Raivis Skadinŝ, “Application of online terminology services in statistical machine translation,” Proc. XIV Machine Translation Summit, pp. 281-286 (2013)). However, these approaches tend to result in loss of data. In Axelrod's approach, for example, the data selection is a hard selection, i.e., the method simply removes what is not considered as in-domain. Using this kind of approach can create a large number of Out-Of-Vocabulary (00V) words, even when translating in-domain data. This is particularly problematic when there is a relatively small amount of training data for the selected language pair to start with.
There remains a need for a system and method for terminological adaptation of a machine translation system which addresses these problems.