1. Field of the Invention
The present invention relates to a method and apparatus for translating an input sequence of data items in a first format to an output sequence of data items in a second format. In particular, but not exclusively, the present invention relates to the translation of a sentence in a source language to a sentence in a target language.
2. Description of the Related Art
Various techniques are known within the field of Machine Translation, or Machine Aided Translation, that use a repository of existing translated material to assist or automate the production of translations. A Translation Memory (TM) system has a repository of source language sentences each paired with its associated target language sentence, and operates by locating in the repository a sentence that is very close in structure and content to an input sentence, with the associated target language sentence being presented to a translator for manual post-editing. An Example-Based Machine Translation (EBMT) system attempts fully automatic translation and operates by decomposing an input sentence into fragments, finding a translation for each fragment in the repository and then combining these fragmentary translations into a target sentence.
Translation memory systems are highly accurate but tend to have limited coverage. Differences between the input sentence and the retrieved sentences are typically limited to slight variations in word order, morphological form or spelling. Often no changes are made to the target side of the example pair; it is simply presented to the translator as the best matching sentence.
In more sophisticated TM systems, certain elements in the target example may be replaced by their ‘translations’. However, such elements are limited to “placeables”, as discussed in WO 99/57651. In this context, a placeable is an element such as a name or a number which does not require translation but can be copied or whose format can be simply adjusted to meet target language or locality standards.
EBMT systems have much wider coverage, but lower accuracy. This is because, like other techniques for fully-automatic translation, they depend on the incorporation in the system of large quantities of linguistic or statistical knowledge, and this is difficult to collect and encode in an exhaustive manner. Such knowledge is necessary in an EBMT system to enable the decomposition of an input sentence into coherent fragments and the subsequent combination of the translated fragments into a sentence which is well-formed according to the grammar of the target language.
The Machine Aided Translation systems mentioned above make use of well known techniques for indexing and matching of source language inputs against the source language side of examples in the repository, and alignment of the words between source and target language sides of examples.
Techniques for matching are disclosed in GR 1002453 “Intelligent device for retrieving multilingual texts”, which describes the use of edit distance, and U.S. Pat. No. 6,161,083 “Example-based translation method and system which calculates word similarity degrees, a priori probability, and transformation probability to determine the best example for translation”. The references: “Example-Based Machine Translation in the Pangloss System”, Brown, R. D., Proceedings of the 16th Coling, Copenhagen, 1996; U.S. 2003/0125928 “Method for retrieving similar sentence in translation aid system”; and U.S. 2004/0002849 “System and method for automatic retrieval of example sentences based upon weighted editing distance” describe the use of two-stage schemes, in which a first stage based on standard information retrieval techniques determines a small set of examples which are then subject, in a second stage, to a more expensive similarity computation based on edit-distance or similar. Other indexing techniques are disclosed in: U.S. Pat. No. 5,724,593 “Machine assisted translation tools”, which describes the use of character n-grams for indexing; and U.S. Pat. No. 6,473,729 “Word phrase translation using a phrase index”.
When one or more matching examples have been found, it is then necessary to determine their possible translations. If a complete example is matched, its translation is just its paired target language string. But if matching is only partial then it is necessary to determine which portions of the source language string are aligned with which portions of the target language string, with each matched portion in one language completely matching a corresponding matched portion in the other language, and each unmatched portion in one language not matching any portion in the other language at all.
Techniques for alignment of words and/or phrases in bilingual sentence pairs are widely described in the literature. U.S. Pat. No. 5,659,765 “Machine Translation System” describes an interface to allow a user to specify such alignments. U.S. Pat. No. 5,907,821 “Method of computer-based automatic extraction of translation pairs of words from a bilingual text” describes a statistical method based on co-occurrence frequencies. U.S. Pat. No. 6,345,244 “System, method, and product for dynamically aligning translations in a translation-memory system” describes a method based on features shared between words in translations. U.S. Pat. No. 6,598,015 “Context based computer-assisted language translation” describes the use of common format information between the pair. U.S. Pat. No. 6,535,842 “Automatic bilingual translation memory system” describes a hierarchical combination of alignments to produce alignments for phrases of all sizes. Alignment may take place during the processing of a given input sentence, or off-line, prior to the processing, as is usually the case. Alignment may also be a two-stage process with an off-line word alignment and on-line alignment of larger phrases as described in US 2004/0002848 “Example based machine translation system”.
EBMT systems are disclosed in: Sato and Nagao, “Towards Memory-Based Translation” in Proceedings of 13th Coling, Helsinki (1990); Maruyama and Watanabe, “Tree Cover Search Algorithm for EBMT” in Proceedings of 4th TMI, Montreal (1992); U.S. Pat. No. 6,161,083 “Example-based translation method and system which calculates word similarity degrees, a priori probability, and transformation probability to determine the best example for translation”; Brown, R. D., “Example-Based Machine Translation in the Pangloss System” in Proceedings of the 16th Coling, Copenhagen, (1996); and US 2004/0002848, amongst others. These systems all use a matching phase and an alignment phase and in distinction to TM systems may determine several examples each of which matches only a fragment of the input. They disclose various approaches to the problems of breaking a sentence into fragments, choosing a best translation of each fragment, and combining the translations of the fragments into a coherent target language text.
There are two main approaches to the determining and combining of fragments. Generally speaking, in the prior art that is concerned with EBMT between structurally dissimilar languages (i.e. with very different word orders) such as English and Japanese (see Sato and Nagao referenced above; and Maruyama and Watanabe, U.S. Pat. No. 6,161,083) fragmentation and combination is based on a full syntax analysis and tree-structured alignments between source and target sides of an example. In EBMT between languages with similar word order such as English and French (see the R. D. Brown paper referenced above) or English and Chinese (see US 2004/0002848), the translations of fragments may be combined according to the order in the source language.
As regards choosing the best translation of each fragment, this is normally assumed to be the alignment in the example that best matches that fragment. In Sato and Nagao (referenced above) and US 2004/0002848, the best example is determined on the basis of similarity between the input and the entire example containing the fragment. The paper by R. D. Brown (reference above) discloses a method in which “the translation probability is simply the proportion of times each distinct alternative translation was encountered out of all successful alignments for a particular source-language phrase”.
It is desirable to improve the coverage of a Translation Memory system by extending the range of types of element that may differ between an input sentence and a stored example. It is desirable to allow an input sentence and a stored sentence to differ by any elements which may be substituted one for the other without changing the well-formedness of the sentences involved. If it is necessary to translate substitutable elements then it is desirable to provide a method of choosing between the alternative translations that such elements may have in different contexts. It is also desirable to provide a method in which the contextually correct translation of arbitrary substitutable elements may be determined without the need for extensive linguistic knowledge or deep linguistic analysis.