This specification relates to machine learning.
Manual translation of text by a human operator can be time consuming and costly. One goal of machine translation is to automatically translate text in a source language to corresponding text in a target language. There are several different approaches to machine translation including example-based machine translation and statistical machine translation. Statistical machine translation attempts to identify a most probable translation in a target language given a particular input in a source language. For example, when translating a sentence from French to English, statistical machine translation identifies the most probable English sentence given the French sentence.
A commonly used training technique in statistical machine translation is the Minimum Error Rate Training (MERT) technique. The MERT technique is described, for example, in Franz Josef Och, “Minimum Error Rate Training in Statistical Machine Translation,” Proceedings of the 41 st Annual Meeting on the Association for Computational Linguistics, pages 160-167, July 2003.
Many conventional statistical machine translation systems use the MERT technique. The MERT technique trains parameters for a linear statistical machine translation model directly with respect to automatic evaluation metrics, i.e., metrics that do not require human evaluation, which can be time-consuming. Some examples of automatic evaluation metrics include word error rate, position independent error rate, National Institute of Standards and Technology (NIST) score, and Bilingual Evaluation Understudy (BLEU) score.
The MERT technique directly optimizes the objective function of interest and thereby avoids making approximations of other objective functions for example, likelihood or margin. However, the MERT technique is generally efficient for training model parameters (i.e., weights) for only a relatively small number of feature functions (e.g., less than 20 or 30 feature functions). The MERT technique is slow if a large number of feature functions are considered, because only one feature function is updated at a time and the computation involves iterating over the complete training corpus. Additionally, in the case of highly correlated features, the MERT technique tends to assign most of the weight to one of the correlated features, causing instability. Instability in the MERT technique occurs when different values of the initial weights result in very different final weights.