1. Field of the Invention
The invention generally relates to extracting translations from translated texts, and in particular to extracting sentence translations from translated documents.
2. Description of the Related Art
Translation memories require the alignment of the sentences in a source document with the sentences in a translated version of the same document. These sentence/translation pairs serve as a starting point for human translation when the same sentence appears in a new version of the document that has to be translated again. Alignment on sentence level is also a prerequisite for the extraction of bilingual and multilingual lexical and terminological information from existing bilingual and multilingual documents.
Presently, several techniques have been developed for identifying the translation of individual sentences from translated documents. These techniques are based on sentence length criteria or on lexical information.
Length-based approaches are examples of knowledge-poor approaches which ignore most of the available information, except for the sentence length. These approaches have been successfully applied to documents of relatively high quality such as translations of political and legal texts. While these algorithms are rather simple in their structure and work quite fast, these approaches are known to be sensitive to noise, for instance in case of unreliable sentence segmentation due to OCR noise, or translations with long omissions. The length-based approaches do not work well in particular when sentence boundaries cannot be determined with high reliability. Moreover, these algorithms have a cost that grows with the product of the number of units to align. As these algorithms are based on straightforward dynamic programming techniques, the time and memory consumption grows with the product of the lengths of the given documents. Thus, when working on pairs of large documents, their space and memory requirement make them impractical, unless the data is previously manually decomposed into shorter units. That is, very long documents need to be manually split into shorter parts before they can be given to the alignment algorithm.
Techniques based on lexical information offer a high quality and more robustness, but at the price of increased computational complexity. These techniques are knowledge-rich approaches which use lexical information to obtain better alignments, and which at the same time extract lexical information from the texts to be aligned. The approaches mostly increase accuracy and robustness of the length-based approaches by taking into account some form of lexical information which is either built into the system (such as word similarities to exploit cognates and other invariant strings), acquired from external resources such as dictionaries, or extracted from the data being processed. The use of richer knowledge sources often comes at a considerable cost in efficiency, and typical processing speeds are in the order of one second per sentence pair which is not satisfactory for a system that is supposed to work on very large documents or document collections.