1. Field of the Invention
The present invention relates generally to statistical machine translation of documents and more specifically to systems and methods for processing annotations associated with a translation memory.
2. Description of the Related Art
In the field of computer-generated translations, there are two approaches to translating a document from a source language into a target language. The first approach is statistical machine translation (SMT) which uses a set of statistical probabilities to match a word or phrase in one language to an equivalent word or phrase in another language. The set of probabilities is generated using a large quantity of documents that have been previously translated from the source language to the target language.
The second approach is translation memory (TM) which uses bilingual databases of parallel translations of sentence segments. A segment of a source document in a source language is matched to an entry in the TM. The corresponding entry, in the target language, is provided as a translation of the segment. However, TM techniques are limited to translating only the segments that have a corresponding entry in the TM database. For example, a sentence in French such as “Solaris—mise a jour (11-22 UC) serveur NET: 3 licence de utilization” can not be translated using a database with a similar entry such as:
Solaris - mise à jour (33-64 UC) Solaris - SPARC server serveur(33-64 CPU)SPARC: 1 licence de utilisationupgrade: 1 RTU licenseVoir page 122 pour le numéro deSee page 122 for theréférence de la mise à niveau de recommended warrantygarantie recommandée.upgrade part number.because the database does not include an entry that matches the sentence exactly. To adapt the TM to identify tags that indicate occurrences such as words or values, the TM includes abstract tags in the entries. For example, the database above may be rewritten as:
PRODUCTNAME - mise à jour PRODUCTNAME - (NUM-NUM PRODUCTNAMECOMPONENTNAME) serveurserver (NUM-NUM PRODUCTNAME: COMPONENTNAME)NUM licence de utilisationupgrade: NUM RTU licenceVoir page NUM pour le numéro deSee page NUM for the référence de la mise à niveau de recommended warrantygarantie recommandée.upgrade part number.where the tags (in capital letters) appear in place of the words indicating words such as the product name, the component name, and numerals. The rewritten database includes a match for the sentence above and can provide a perfect translation.
However, if the modified TM is used to train an SMT engine, the tags can interfere with the training process and result in less accurate translation probabilities when the system receives segments to be translated that have not been rewritten via the same process used to modify the TM. This may result in inaccurate translations.