Many techniques for the automatic translation of one natural language into another (Machine Translation, MT) are based on the use of a repository of existing bilingual texts, that is texts and their translations by humans into one or more other languages. They either learn or induce translation rules from them automatically, as in approaches such as Statistical MT (SMT), or treat them as apposite examples, fragments from which can be assembled into new translations (Example Based MT or EBMT).
One approach to EBMT assembles a translation by first finding the single best-matching bilingual example, as in [US Patent Application 20060004560, Method and apparatus for translation based on a repository of existing translations] and [Sumita, 2003, in Recent Advances in Example-Based Machine Translation, M. Carl and A. Way (eds.), Kluwer Academic]. The input sentence is approximately matched against the example. The result of this matching is an alignment between input and example which includes sub-alignments between stretches which are identical (matched stretches) and sub-alignments between stretches which are not identical (unmatched stretches). The translations of the unmatched stretches in the target side of the example may then be replaced by the translations of the unmatched stretches in the input. The example acts as a template which is known to be well-formed, disambiguated and which can be used as a substantial basis for the construction of the remainder of the translation. However, the success of this technique depends on being able to find in the repository of existing translations the example whose source side is most similar to the input. The prior art assumes that similarity can be adequately defined in terms of the source language alone. However there are many instances where similar expressions in the source language translate very differently into another language.
The open literature contains many examples of EBMT systems which exploit thesaurus information in order to determine the similarity of input sentences and stored examples. [Sumita, 2003, op. cit.] shows that a monolingual thesaurus can be inadequate when matching input to examples, even when the source sides are both syntactically and semantically close. This is because their translation can be substantially different in the target language. His solution is to refine the thesaurus or add examples. However, his thesaurus is still essentially monolingual and is refined according to the particular examples in the database, rather than on the basis of the target language word similarity.
Having found the best match, Sumita's method relies on having a bilingual dictionary of unambiguous single word translations to substitute for the unmatched parts. It cannot disambiguate the dictionary entries, nor does it allow entries that contain several words or collocations.
It is desirable to improve matching of an input text against a repository of existing translations by detecting those elements in input text and source side of stored translation which, while superficially similar, have different translations. It is also desirable to use the translations of the unmatched stretches in the target side of the example to assist in the determination of the scope and sense of translations of the unmatched stretches in the input.