1. Field of the Invention
The present invention relates generally to statistical machine translation, and more particularly to systems and methods for statistical word alignment.
2. Description of Related Art
Word alignment is used in statistical machine translation (SMT) to generate improved translations of documents in two or more foreign languages. SMT may align sentences to extract parallel sentences from parallel documents. After determining sentence alignments, SMT typically includes further aligning words or fragments of the sentences. Conventionally, word alignment in SMT is performed to determine whether a specific word or phrase in one language (e.g., English) corresponds to a specific word or phrase in another language (e.g., French). More specifically, word alignment is a process in which a large collection of parallel documents is used to automatically identify word-to-word or word-to-phrase correspondences.
The Expectation-Maximization (E-M) algorithm is commonly used to perform a word alignment in SMT. In the expectation step of the E-M algorithm, the hypothetical dictionary is used to induce word alignments in a large corpus containing millions of sentences. Based on the induced word alignments, the hypothetical dictionary is modified in the maximization step. The modified dictionary is then used to induce better word alignments by repeating the expectation step. This process is repeated as needed until the hypothetical dictionary remains substantially unmodified from cycle to cycle.
More recently, SMT performs an additional step after the E-M Algorithm is completed. The additional step uses a small corpus comprising manual annotations to indicate word alignments. The additional step estimates another dictionary based on the small corpus and combines this dictionary with the hypothetical dictionary generated by the E-M Algorithm. The combined dictionary is then used to correct word alignments in the large corpus in one final step. However, further improvements to increase the accuracy of SMT are still desired by users of SMT.