The following relates to natural language translation. It finds particular application in conjunction with efficient re-use of information in computer assisted translation, and will be described with particular reference thereto. However, it is to be appreciated that the following is also amenable to other like applications.
Computer aided translation systems provide assistance in translating documents from a source language into a target language. Such systems will provide a human translator with a convenient computer environment to produce target-language translations to source-language texts. In addition to multilingual text-processing facilities, a computer-assisted translation system will typically include functionalities such as easy access to electronic dictionaries and glossaries, as well as terminology management tools.
To increase efficiency and accuracy of computer aided translation, a translation memory is provided in some systems. The translation memory stores paired source language-target language translation units. The pairs are generated based on previous translations. This stored information allows the previous translations to be re-used in subsequent translation tasks.
In certain highly repetitive documents, such as weather bulletins, financial reports, avalanche warnings, technical manuals, and so forth, substantial portions of the document are frequently repeated. For example, each successive weather bulletin includes common heading information indicating the issuing weather station, the geographical region covered by the bulletin, and so forth. Rather than re-translating this common information each time a new weather bulletin is issued, the translation memory is consulted to re-use the previous translation of the common content. Similarly, a revised version of a document often contains substantial portions of content that are unchanged versus earlier versions. The translation memory is consulted to re-use the previous translation of the unrevised portions of content.
The likelihood of finding a match in the translation memory increases as the size of the translation unit decreases. For example, it is more likely that a single sentence will find a match in the translation memory than an entire paragraph. Accordingly, the translation units are typically chosen to be small, corresponding to textual units such as sentences, phrases, or bullet points.
However, in using small translation units, surrounding contextual information is lost. A consequence of the lost contextual information is an increased likelihood of obtaining multiple matches, in which a given source language translation unit has several close matches in the translation memory. In a given text collection, such multiple close matches can arise in contexts such as similar section headings and standardized front/back matter such as copyright information, disclaimers, and so forth. In collections related to a narrow subject, formulaic constructions sometimes arise, such as “Select Open in the File menu” in software manual collections or “Keep out of the reach of children” in pharmaceutical notices. In such cases, the translation memory may provide a number of close, but not identical, matches without providing a rationale for selecting one close match over the others. As a result, the human translator is called upon to make the selection.
This problem of multiple close matches without corresponding contextual rationale for selection can in principle be addressed by increasing the size of the translation units, thus retaining more contextual information and reducing the number of close matches. For example, rather than storing sentence-based or bullet point-based translation units, entire paragraphs or bullet lists can be stored. However, increasing the size of the translation units increases the likelihood that no match at all will be found, making the translation memory less valuable for freeing translators from repetitive work and increasing their productivity.
The following copending, commonly assigned applications: Bilingual Authoring Assistant for the “Tip of the Tongue” Problem (Xerox ID 20040609-US-NP, Ser. No. 11/018,758 filed Dec. 21, 2004); and Bi-Dimensional Rewriting Rules for Natural Language Processing (Xerox ID 20040117-US-NP, Ser. No. 11/018,892 filed Dec. 21, 2004) are herein incorporated by reference.