The present exemplary embodiment is directed to the field of machine translation. It finds particular application in the translation of non-contiguous bi-fragments of text.
A recent development in statistical machine translation has entailed the step from word-based models to phrase-based models. While in traditional word-based statistical models, the atomic unit that translation operates on is the word, phrase-based methods acknowledge the significant role played in language by multi-word expressions, thus incorporating, in a statistical framework, the insight behind Example-Based Machine Translation. Example-Based Machine Translation seeks to exploit and integrate a number of knowledge resources, such as linguistics and statistics, and symbolic and numerical techniques, for integration into one framework. In this way, rule-based morphological, syntactic and/or semantic information is combined with knowledge extracted from bilingual texts which is then re-used in the translation process.
Many recent natural language translation methods operate on the basis of bi-fragments: these are pairs of equivalent fragments of text, one in the source language (the language in which a document to be translated is expressed), one in the target language (the language into which the document is to be translated). Such methods are often collectively referred to as “phrase-based methods”. The bi-fragments on which they operate are harvested automatically from large collections of previously translated texts (“bilingual parallel corpora”), and stored in a database. When given a new segment of text to translate, these systems search the database to extract all relevant bi-fragments, i.e., items in the database whose source-language fragment matches some portion of the new input. A subset of these matching bi-fragments is then searched for, such that each word of the input text is covered by exactly one bi-fragment in the subset, and that the combination of the target-language fragments produces a coherent translation.
In general, phrase-based models proposed so far only deal with multi-word units that are sequences of contiguous words on both the source and the target side.
In many translation systems, the quality of the resulting translation is assessed by means of a statistical translation model, which estimates the probability of observing some target-language segment of the text as the translation of the given source-language input. The translation problem reduces to that of finding the combination of bi-fragments which produces the most probable translation. This is a complex task, because the number of possible translations typically grows exponentially with the size of the input, and so not all solutions can be examined in practice. Sub-optimal search procedures are usually employed, that rely on dynamic programming, A*-like beam-search, or heuristic hill-climbing methods.