1. Field of the Invention
The invention disclosed herein is generally related to machine translation and more specifically to modification of annotated bilingual segment pairs in syntax-based machine translation.
2. Description of the Related Art
To translate written documents from a source language, such as Arabic, to a target language, such as English, machine translation performed by a computer may be used. One technique, statistical machine translation, used to perform machine translation includes generating a translation model comprising translation rules derived from phrases in the source language matched with phrases in the target language These paired phrases include annotated bilingual segment pairs. The annotated bilingual segment pair may be a sentence, a fragment, or a phrase.
In a string-to-tree annotated bilingual segment pair, the target phrase may be represented as a tree having branches separating syntactic structures in the target phrase. The nodes of the tree are typically labeled based on the syntactic structure of the branch. Syntactic structures include noun phrases, verb phrases, adverb phrases, or the like. The annotated bilingual segment pair may further include alignments between the words in the source language and words in the target language.
FIG. 1 is a diagram of a prior art process 100 for deriving translation rules from an annotated bilingual segment pair. The process 100 comprises, in a single iteration, receiving the annotated bilingual segment pair 102 and training a translation engine based on the annotated bilingual segment pair 102 to generate composed rules 104. The composed rules 104 may be used by the translation engine to translate a document from the source language to the target language.
The annotated bilingual segment pair 102 is a tree-to-string annotated bilingual segment pair and comprises one or more parent nodes that are each associated with at least two children. The children may, in turn, be parent nodes for other children. Each node is labeled with a syntactic structure identifier such as noun phrase (NP), verb phrase (VP), adverb phrase (ADVP), or the like. Each endpoint comprises a word in a target language, designated in FIG. 1 by the letter “a.” In the annotated bilingual segment pair 102, words in a target phrase designated by the letter “e” are each aligned via a dotted line to one or more words in the target phrase.
The annotations on a bilingual segment pair are generated automatically by a machine and may include inaccurate or imprecise labels, structures, and/or alignments. In machine translation, millions of the annotated bilingual segment pairs may be used and it may be impractical to correct each of the annotated bilingual segment pairs manually. Further, poor annotated bilingual segment pairs may result in translations that are not comprehensible, nonsensical, or awkward.