Machine translation is a process by which a textual input in a first language is automatically translated, using a computerized machine translation system, into a textual output in a second language. Some such systems operate using word based translation. In those systems, each word in the input text, in the first language, is translated into a corresponding word in the output text, in the second language. Better performing systems, however, are referred to as phrase-based translation systems. One example of those systems is set out in Koehn et al., Statistical Phrase-Based Translation, Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL) 127-133, Edmonton, Alberta, Canada (2003).
In order to train either of these two types of systems (and many other machine translation systems), current training systems often access a bilingual corpus. The training systems first align text fragments in the bilingual corpus such that a text fragment (e.g., a sentence) in the first language is aligned with a text fragment (e.g., a sentence) in the second language. When the text fragments are aligned sentences, this is referred to as a bilingual sentence-aligned data corpus.
In order to train the machine translation system, the training system must also know the individual word alignments within the aligned sentences. In other words, even though sentences have been identified as translations of one another in the bilingual, sentence-aligned corpus, the machine translation training system must also know which words in each sentence of the first language correspond to which words in the aligned sentence in the second language.
One current approach to word alignment makes use of five translation models and is discussed in Brown et al., The Mathematics of StatisticalMachine Translation: Parameter Estimation, Computational Linguistics, 19(2): 263-311 (1993). This approach to word alignment is sometimes augmented by a Hidden Markov Model (HMM) based model, or a combination of an HMM based model and Brown et al.'s fourth model, which has been called “Model 6”. These latter models are discussed in F. Och and H. Ney, A Systematic Comparisonof Various StatisticalAlignment Models, Computational Linguistics 29(1):19-51 (2003).
These word alignment models are less than ideal, in a number of different ways. The higher accuracy models are mathematically complex, and also difficult to train, because they do not permit a dynamic programming solution. It can thus take many hours of processing time on current standard computers to train the models and produce an alignment of a large parallel corpus.
The present invention addresses one, some, or all of these problems. However, these problems are not to be used to limit the scope of the invention in any way, and the invention can be used to address different problems, other than those mentioned, in machine translation.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.