The exemplary embodiment relates to phrase-based statistical machine translation (SMT) and finds particular application in connection with a system and method for generating a phrase table for a target domain where there a lack of parallel data for generating the phrase table for the specific target domain.
Statistical machine translation systems use a translation scoring function for scoring candidate translations of a source language text string, such as a sentence. Parameters of the scoring function are generally trained on a parallel development corpus containing pairs of source and target sentences which are assumed to be a translation of each other, in at least the source to target direction. In a phrase-based system, the parameters serve as weights for features of the candidate translation, some of which are derived from a phrase table. The phrase table stores corpus statistics for a set of biphrases found in a parallel training corpus. These statistics include phrasal and lexical probabilities that represent the probability that a given source phrase (or its constituent words, in the case of lexical probability) in a biphrase is translated to the corresponding target phrase, or vice versa. In addition to translation model features that are based on such phrasal and lexical probabilities, the translation scoring function may also incorporate parameters of a language model, which focuses only on the target side probabilities of the translation, and parameters of a reordering model, which takes into account the extent to which the words of the translation are reordered when compared with the order of the aligned words of the source sentence. For a new source sentence to be translated, the SMT scoring function is used to evaluate candidate translations formed by combining biphrases from the phrase table which cover the source sentence, where each source word is covered by no more than one biphrase. The respective corpus statistics of these biphrases are retrieved from the phrase table and corresponding features of the scoring function are computed based thereon which aggregate the probabilities for each of the biphrases being used. The scoring function features are weighted by the scoring function parameters in a log-linear combination to determine an optimal set of the biphrases, from which a translation is generated.
One problem which arises in machine translation is that the values of the phrase table features, and also the parameters of the translation scoring function, can vary from one domain to another. The overall quality of translation is thus dependent, in part, on how well suited the phrase table is to the domain of interest. Thus, there is considerable interest in generating machine translation systems that are adapted to the particular domain of the text to be translated.
To provide broad coverage, SMT systems are often trained on a large corpus of documents which may not be well suited to the particular domain of interest. For example, generic SMT systems may be trained on the Europarl corpus of government documents which may make it more likely that the word “bank” in English is translated in its financial sense, rather than as the bank of a river, which would be more appropriate for translations in the agricultural science domain. It is therefore often desirable to tailor a machine translation system to a specific domain of interest, which is known as domain adaptation. One approach for tailoring an MT system to a specific domain is to train a domain-adapted multi-model that combines a set of trained phrase tables from various domains. However, this approach requires parallel training data in the domain of interest. In many cases, however, there may be insufficient training data in the specific domain. For example, some source language documents may be made available, but since translations are costly to produce, corresponding target language documents may be unavailable. The approach is also computationally intensive if there is a large library of phrase tables of various domains from which to choose.
Several metrics have been used to compute similarity between domains, such as Cross Entropy (Rico Sennrich, “Perplexity minimization for translation model domain adaptation in statistical machine translation,” Proc. 13th Conf. of the European Chapter of the Association for Computational Linguistics (EACL '12), pp. 539-549 (2012), hereinafter, “Sennrich 2012”), but this method requires a parallel in-domain corpus. The metric Source LM perplexity can also be used as a measure to score and rank translation models. The Source LM perplexity measure requires only a mono-lingual corpus for computation of similarity with a source domain. However, it assumes the existence of a library of source language models (LMs).
There remains a need for a system and method for retrieving a subset of phrase tables similar to the domain of interest using only a mono-lingual source corpus which can be used to build a multi-model in a time-efficient manner.