The present exemplary embodiment is directed to the field of machine translation. It finds particular application in connection with the use of dynamic bi-phrases in phrase-based statistical machine translation systems.
Phrase-based statistical machine translation (SMT) systems employ a bi-phrase table or “dictionary” as a central resource. This is a probabilistic dictionary associating short sequences of words in two languages. The bi-phrase table is automatically extracted, at training time, from a large bilingual corpus of aligned source and target sentences. When translating from a source to a target language, the bi-phrase table is accessed to retrieve a set of bi-phrases, each of which includes a target phrase which matches part of a source sentence or other text string to be decoded. The retrieved bi-phrases are input to a scoring model, which outputs an optimal translation of the source sentence using a subset of the retrieved bi-phrases. Typically, the scoring model attempts to maximize a log-linear combination of features associated with the bi-phrases entering the combination.
Currently, phrase-based SMT systems rely on a bi-phrase table that is static, that is, is computed once at training time together with associated feature values to be used by the scoring model. A decoder uses a subset of the bi-phrases to generate a translation of an input source sentence into a target sentence. For example, at decoding time, the decoder is initialized with a sub-table of this static table consisting of those bi-phrases that are relevant for the translation of the specific source sentence. The static system poses a problem for translation of sentences which include words which appear infrequently in the training corpus, or not at all. Often, these are handled by substituting the word with a placeholder and replacing this with the untranslated word in the translated target sentence. It also poses a problem where different systems of units are concerned. For example, while in France, prices for liquid volumes may be expressed as Euros or cents per liter, in the US, prices may be expressed in dollars per US pint or US gallon. Readers of the translation thus have difficulty in placing the information in context. There are other instances where it would be advantageous to create a bi-phrase at translation time which differs from those in the static bi-phrase table.