The following relates generally to methods, and apparatus therefor, for performing bilingual word alignment.
Aligning words from sentences, which are mutual (i.e., bilingual) translations, is an important problem to resolve as the aligned words are used for carrying out various linguistic or natural language applications. Examples of linguistic applications that make use of mutual translations include: the identification of phrases or templates in phrase-based machine translation; machine translation; and the projection of linguistic annotation across languages. Generally, known methods for performing word alignment are limited in performance (e.g., too computationally complex) or precision (e.g., no guarantee an alignment is “proper”, as defined herein). Accordingly, there continues to exist a need for improved methods for carrying out word alignment from sentences.
Methods, and systems and articles of manufacture therefor, are described herein for producing “proper” alignments of words at the sentence level. In accordance with various embodiments described herein of such methods, given a source sentence f=f1 . . . fi . . . fl composed of source words f1, . . . fl, and a target sentence e=e1 . . . ej . . . eJ composed of target words e1, . . . eJ, the various embodiments identify source words fi and target words ej in the source and target sentences such that the words are “properly” aligned (i.e., are in mutual correspondence). As defined herein, words from a source sentence and target sentence are “properly” aligned when they satisfy the constraints of “coverage” and “transitive closure”.
For the purpose of describing a proper alignment between words at the sentence level, a cept is characterized herein as a semantic invariant that links both sides of an alignment. An N-to-M alignment is therefore represented using cepts as an N-to-1-to-M alignment (i.e., N source words to one cept to M target words). As embodied in this notion of a cept, word alignment is proper if the alignment covers both sides (i.e., each word is aligned to at least one other word) and is closed by transitivity (i.e., all N source words in an N-to-M alignment are aligned to all M target words).
In accordance with the methods, apparatus and articles of manufacture described herein, words of natural language sentences are properly aligned. In performing the method, a corpus of aligned source sentences f=f1 . . . fi . . . fl and target sentences e=e1 . . . ej . . . eJ is received, where the source sentences are in a first natural language and the target sentences are in a second natural language. A translation matrix M is produced with association measures mij. Each association measure mij in the translation matrix provides a valuation of association strength between each source word fi and each target word ej. One or more of an alignment matrix A and cepts are produced that link aligned source and target words, where the alignment matrix and cepts define a proper N:M alignment between source and target words by satisfying coverage and transitive closure. Coverage is satisfied when each source word is aligned with at least one target word and each target word is aligned to at least one source word, and transitive closure is satisfied if when source word fi is aligned to target words ej and el, and source word fk is aligned to target word el, then source word fk is also aligned to target word ej.
Advantageously, these embodiments are computationally efficient methods for determining proper word alignment in sentences. In addition, further advantages may be realized as proper word alignment enables the production of improved translation models. For example, better chunks for phrase-based translation models may be extracted with properly aligned sentence translations. Further, proper word alignments advantageously insure that cept-based phrases cover entire source and target sentences.