The following relates to the information processing arts, natural language translation arts, document processing and storage arts, and related arts.
Translation of a natural language document from a source language to a target language is presently performed manually or in a semi-automated fashion. In a fully manual approach, a bilingual person who is reasonably fluent in both the source and target languages reads the document written in the source language, and generates (e.g., by typing, voice recognition, or the like) a corresponding electronic translated document that is written in the target language. The fully manual approach is tedious and expensive, especially if the source and/or target language is an uncommon language such that competent bilingual translators are a scarce commodity.
It has been found to be difficult to construct machine translation systems operating on first principles. Most natural languages are highly complex, including features such as idioms (semantic phrases that do not mean what they literally say, e.g. a “figure of speech”), collocations (e.g., specialized word combinations whose meaning is affected by the specific combination), synonyms having fine shades of meaning or subtle connotations, polysemy (words that have more than one possible meaning, with the “correct” meaning typically depending upon context), and so forth.
A tool that has been found to be useful for aiding human translation is the translation memory, which includes a database or storage that stores previously translated source language-target language text segment pairs. A source language text segment to be translated is compared with the translation memory contents to find an already-translated source language text segment that is identical with or similar to the source language text segment under consideration. When an exact or approximate match is found in the translation memory, the corresponding target language text segment is retrieved from the translation memory and presented to the human translator as a proposed translation, for example by inserting the proposed target language text segment into the target language text document being generated by the human translator.
One design parameter of a translation memory system relates to the exactness or fuzziness of the match. If an exact match is found, then it is likely (although not certain) that the human translator will accept the proposed target language text segment as a verbatim or near-verbatim translation. However, exact matches are typically infrequent, and so a translation memory system that requires exact matching tends to provide rather limited assistance to the human translator.
On the other hand, the translation memory system can be configured to accept a “fuzzy” match in which there are some differences between the text segment extracted from the document and a source language text segment stored in the translation memory. These differences may be words in the text segment extracted from the document that are missing from the source language text segment stored in the translation memory; or, additional words in the source language text segment stored in the translation memory may be missing from the text segment extracted from the document, or the same words may be ordered slightly differently in the two text segments, or so forth. By allowing some fuzziness to the match, the translation memory system generates more proposed translations and accordingly is more helpful to the human translator. However, as the match fuzziness increases the likelihood that the human translator will reject the proposed target language text segment, or need to modify it substantially, also increases.
In view of these considerations, it is generally considered useful to allow some fuzziness in the matching performed by the translation memory system. However, it is also understood that the fuzziness of the match usually leads to additional work by the human translator in order to correct the proposed (fuzzy) translation. Indeed, some commercial translation services use the fuzzy match level as a metric for estimating translation cost, with higher charges applying to translation jobs for which the translation memory yields less exact matches on average.
It would be useful to reduce the amount of human editing required to “fix” a fuzzy match. Heretofore, such reduction has been achieved by limiting the allowable fuzziness of the match—however, as already discussed this “solution” results in the translation memory system providing relatively less assistance to the human translator due to fewer identified matches, thus requiring an undesirable tradeoff between number of matches and the average human editing per match.