The present exemplary embodiment is directed to the field of machine translation. It finds particular application in connection with the translation of words which are used infrequently in a parallel corpus of text used for building a machine translation system.
Out-of-vocabulary (OOV) words are a problem faced by machine translation (MT) systems. Even when translating test sets similar in nature to a system's training data, there will almost always be at least a small number of source-language words for which the system can produce no target-language translation. Current practice varies on the treatment given to OOV words in the output of an MT system. They may be simply passed along into the output, deleted from the output, looked up in another translation or system resource, or handled through a variety of on-the-fly techniques such as attempted spelling correction or synonym substitution.
Phrase-based statistical machine translation (SMT) systems employ a phrase table as a central resource. This is a probabilistic dictionary associating short sequences of words in two languages. When translating from a source to a target language, the phrase table is accessed to retrieve a set of bi-phrases, each of which includes a target phrase which matches part of a source sentence or other text string. The retrieved bi-phrases are input to a scoring model, which outputs an optimal translation of the source sentence using a subset of the retrieved bi-phrases.
The phrase table is often obtained by first aligning a parallel corpus at the level of the individual words. This alignment often relies on a tool called GIZA++. GIZA++ is a statistical machine translation toolkit that is used to train IBM statistical translation models (Models 1-5) and an HMM word alignment model. A further procedure then extracts phrase pairs (bi-phrases) and inserts them in a phrase table, together with the appropriate frequency statistics. The Moses system is the most common package for phrase extraction and decoding in statistical machine translation. For a description of the GIZA++ system, see, Franz Josef Och and Hermann Ney. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1):19-51, 2003 (hereinafter, Och and Ney). For a description of IBM statistical translation models, see Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2):263-311, 1993 (hereinafter Brown, et al.). The Moses system is described in Philipp Koehn, et al. Moses: Open Source Toolkit for Statistical Machine Translation. Proc. ACL 2007 Demo and Poster Sessions, pages 177-180, Prague, Czech Republic, 2007 (hereinafter Koehn, et al. 2007).
In the standard GIZA++ word alignment of Och and Ney, the frequency of a word can have a large impact on its alignment results. It is often difficult to obtaining precise alignments for low-frequency words under the IBM models implemented in GIZA++.
It has now been observed that a significant fraction of out-of-vocabulary words (i.e., words in the source language that an SMT system is unable to translate) in a phrase-based SMT system do occur in the training data, but they are lost while the system is being built because of imprecise rare-word alignment during the standard GIZA++ stage.
The exemplary embodiment provides a system and method which improves the initial word-to-word alignment process, such as that output by GIZA++.