This invention relates to automatic word alignment, for example, for use in training of a statistical machine translation (SMT) system.
SMT systems general rely on translation rules obtained from parallel training corpora. In phrase based SMT systems, the translation rule set includes rules that associate corresponding source language phrases and target language phrases, which may be referred to as associated phrase pairs. When a manually annotated corpus of associated phrase pairs is unavailable or inadequate, a first step in training the system includes identification and extraction of the translation phrase pairs, which involves the induction links between the source and target words, a procedure known as word alignment. The quality of such word alignment can play a crucial role in the performance of a SMT system, particularly when the SMT system uses phrase-based rules.
SMT systems rely on automatic word alignment systems to induce links between source and target words in a sentence aligned training corpus. One such technique, IBM Model 4, uses unsupervised Expectation Maximization (EM) to estimate the parameters of a generative model according to which a sequence of target language words is produced from a sequence of source language words by a parametric random procedure.
EM is an iterative parameter estimation process and is prone to errors. Less than optimal parameter estimates may result in less than optimal alignments of the source and target language sentences. The quality of the outcome depends largely on the number of parallel sentences available in the training corpus (a larger corpus is preferable), and their purity (i.e., mutual translation quality). Thus, word alignment quality tends to be poor for resource-poor language pairs (e.g., English-Pashto or English-Dari). In some cases a large proportion of words can be incorrectly aligned or simply left unaligned. This can lead to inference of incorrect translation rules and have an adverse effect on SMT performance. Thus, improving alignment quality can have a significant impact on SMT accuracy.
Other work has sought to improve word alignment quality. For example, a number of “boosting” algorithms have been proposed. In some traditional boosting algorithms (e.g., AdaBoost) for binary classification tasks, an iterative weight update formula emphasizes incorrectly classified training samples and attenuates those that are correctly classified, in effect “moving” the class boundaries to accommodate the misclassified points. Classifiers trained at each boosting iteration (also known as weak learners) are combined to identify class labels for test samples. In many cases, this combination of weak learners results in better classification performance than using a standard train/test approach.
However, such placing of emphasis on poorly aligned sentence pairs can distort word alignments and reduce alignment quality over the entire corpus because poorly aligned sentence pairs tend to be lower quality or non-literal translations of each other.
Additionally, word alignment is significantly more complex than simple binary classification. Moreover, a direct measure of alignment quality (which can be used to update weights for boosting), such as alignment error rate (AER), can only be obtained from a hand-aligned reference corpus. Another issue is determining the best way to combine alignments from the weak learning iterations.
In one example, Wu et al. (“Boosting statistical word alignment using labeled and unlabeled data,” Proc. COLING/ACL, Morristown, N.J., USA pp 913-920) proposed a strategy for boosting statistical word alignment based on a small hand-aligned (labeled) reference corpus and a pseudo-reference set constructed from unlabeled data. Theirs was a straightforward extension of the AdaBoost algorithm using AER as a measure of goodness. They used a weighted majority voting scheme to pick the best target word to be linked to each source word based on statistics gathered from the boosting iterations. On a small scale, Wu's strategy is practical, however, larger hand-aligned reference corpora are extremely expensive to construct and very difficult to obtain for resource poor language pairs.
In another example, Ananthakrishnan et al. (“Alignment entropy as an automated measure of bitext fidelity for statistical machine translation,” ICON '09: Proc. 7th Int. Conf. on Natural Lang. Proc., December 2009) proposed a technique for automatically gauging alignment quality using bootstrap resampling. The resamples were word aligned and a measure of alignment variability, termed alignment entropy, was computed for each sentence pair. The measure was found to correlate well with AER. Subsequently, they proposed a coarse-grained measure of phrase pair reliability, termed phrase alignment confidence, based on the consistency of valid phrase pairs across resamples.
There is a need for an automatic word alignment system that improves upon traditional alignment techniques for the purpose of creating corpora, for instance, that are more representative of hand aligned corpora.