The present invention relates to aligning bilingual corpora. In particular, the present invention relates to length-based and word correspondence-based alignment.
Sentence-aligned parallel bilingual corpora have proved very useful for applying machine learning to machine translation and other NLP tasks. Unfortunately, most available parallel bilingual corpora, such as the proceedings of the Canadian Parliament, do not originate in a sentence-aligned form. Thus, before the corpora can be used for machine learning, their sentences must be aligned.
Aligning sentences is not trivial because at times a single sentence in one language is translated as two or more sentences in the other language. In addition, because of imperfections in the corpora, a sentence found in one corpus may not be present in the other corpus.
In the past, two general techniques have been used for aligning bilingual corpora. The first approach is word-based or character-based. Under this approach, a bilingual lexicon is used to align individual words in each corpus. Because of the complexity of determining the alignment of individual words it is undesirably slow. In addition, it requires the existence of the bilingual lexicon and thus cannot be used to align corpora if such a lexicon is not available.
The second general method for aligning bilingual corpora uses probabilistic modeling of the relationship between the length of sentences in one language and the length of their translations in the other language. Although such length-based systems are faster than the word-based systems, they are not as accurate.
Thus, an alignment system is needed that is fast, highly accurate and does not require a bilingual lexicon.