The present invention relates to statistical word alignment. In particular, the present invention relates to training statistical word alignment models.
In statistical machine translation, parameters are generally trained that estimate the probability of a source language word being translated into one or more target language words. This translation probability can be used to estimate the probability of a sequence of words in the target language given a sequence of words in the source language. For example, under a well known model known as the IBM Model 1, the probability of a sequence of words in the target language given the sequence of words in the source language is estimated as:
                              p          ⁡                      (                          T              ❘              S                        )                          =                              ɛ                                          (                                  l                  +                  1                                )                            m                                ⁢                                    ∏                              j                =                1                            m                        ⁢                                          ∑                                  i                  =                  0                                l                            ⁢                              tr                ⁡                                  (                                                            t                      j                                        ❘                                          s                      i                                                        )                                                                                        EQ        .                                  ⁢        1            where p(T|S) is the probability of a sequence of words in the target language given a sequence of words in a source language, m is the number of words in the sequence of target language words, l is the number of words in the sequence of source language words, ε is the probability that a sequence of words in the target language will be m words long, and tr(tj|si) is the translation probability, which provides the probability of the jth word in the sequence of target language words given the ith word in the sequence of source language words.
The translation probabilities can also be used as part of a statistical word alignment model. Such models are used to identify an alignment between a source sentence and a target sentence, where the alignment is defined as identifying which source words and target words are translations of each other in the two sentences. If the translation model is limited such that each target word can be generated by exactly one source word (including a null word) an alignment a can be represented by a vector a1, . . . , am, where each aj is the sentence position of the source word generating target word tj according to the alignment. When this is true, the most likely alignment â of a source sentence and a target sentence according to IBM Model 1 is given by:
                              a          ^                =                  arg          ⁢                                          ⁢                                    max              a                        ⁢                                          ∏                                  j                  =                  1                                m                            ⁢                              tr                ⁡                                  (                                                            t                      j                                        ❘                                          s                                              a                        j                                                                              )                                                                                        EQ        .                                  ⁢        2            where saj is the source word predicted by alignment aj for target word tj. The notation argmaxaf(a) means the value of a for which f(a) has the maximum value.
Before a translation probability can be used in an alignment model or in a translation model, it must first be trained. Under the prior art, such translation models have typically been trained using an Expectation-Maximization (EM) algorithm. This algorithm relies on a corpus of paired sentences, where each sentence pair consists of a sentence in the source language and a translation of that sentence in the target language. During the expectation phase of the EM algorithm, counts are developed for word pairs, where a word pair consists of one word from the source language and one word from the target language that occur together in at least one of the paired sentences. Each occurrence of the word pair receives a count depending on the probability of the source word being translated into the target word, according to the current estimate of the translation probabilities.
Initially, each translation probability is set to a uniform distribution over the target language vocabulary. During the maximization phase, the counts are normalized and a probability is re-estimated for each translation. The process is then repeated using the updated translation probability estimates. Mathematically, it has been shown that as the number of iterations of this process increases, the EM algorithm will converge on the maximum likelihood estimates for the translation probabilities.
Under the prior art, this was thought to provide the best set of model parameters for alignment and translation. However, model parameters trained in this way have been less than ideal. One reason for this is that the EM algorithm trains the parameters to best fit the training data. If the training data is not representative of the actual data encountered during translation or alignment, the algorithm will over fit the parameters to describe the training data instead of the actual data.
Thus, new techniques are needed to avoid the over-fitting of translation probability parameters during training.