The exemplary embodiment relates to machine translation and finds particular application in connection with a system and method for incrementally training a reordering model for a statistical machine translation (SMT) system when new data becomes available.
Machine translation systems are trained on source and target language texts that are assumed to be a translation, in at least the source to target direction. Statistical machine translation systems rely on the availability of parallel corpora, in particular of the target domain. Parallel data for training SMT models is constantly being generated by both professional and casual translators. In-domain parallel data may become available, for example, as users of the SMT system post-edit the automatic translations. It is often desirable to incorporate new data into the SMT model as soon as possible. This is particularly a concern for Computer Assisted Translation (CAT) systems, where it is advantageous to reflect manual corrections promptly to avoid repeating translation errors that have already been corrected.
Typically, large amounts of parallel data are employed to produce good SMT models and training a model is an expensive process in terms of time and computational resources. Retraining the entire system at frequent intervals is thus often not feasible, leading to long lags between system updates. Approaches for incremental updating an SMT system, given new parallel data, instead of retraining it, have therefore been sought.
Typical phrase-based SMT systems use a log-linear combination of various features that primarily represent three models: a translation model (TM), responsible for the selection of a target phrase for each source phrase, a language model (LM), addressing target language fluency, and a reordering model (RM). The reordering model takes into account that different languages exercise different syntactic ordering. For example, adjectives in English precede the noun, while they typically follow the noun in French (the blue sky vs. le ciel bleu). In Modern Standard Arabic, the verb precedes the subject, and in Japanese the verb comes last. As a result, source language phrases often cannot be translated and placed in the same order in the generated translation in the target language as in the source text as phrase movements have to be considered. Estimating the exact distance of movement for each phrase tends to be too sparse. Instead, the lexicalized reordering model estimates phrase movements using only a few reordering types, such as a monotonous order (mono), where the order is preserved, and a swap, when the order of two consecutive source phrases is inverted when their translations are placed in the target side. See, for example, Philipp Koehn, “Statistical machine translation,” Cambridge University Press (2009), hereinafter, Koehn 2009.
Incremental training for SMT systems has mainly focused on updating the alignment of the parallel data, the most time-consuming step in SMT training. The alignment probabilities are needed for generating the translation model and the reordering model. GIZA++ is probably the best known alignment tool, and is also the tool used in the Moses translation system. See Franz Josef Och, et al., “A systematic comparison of various statistical alignment models,” Computational linguistics, 29(1):19-51 (2003). However, even with its multi-threaded version, MGIZA++ (see, Qin Gao, et al., “Parallel implementations of word alignment tool,” Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 49-57, ACL (2008)), alignment remains the longest step in the SMT model generation. GIZA, like other alignment tools, uses the Expectation Maximization (EM) algorithm to learn simultaneously the alignment and translation probabilities (see, e.g., A. P. Dempster et al., “Maximum likelihood from incomplete data via the EM algorithm,” J. Royal Statistical Society, Series B, 39(1):1-38 (1977); Peter F. Brown, et al., “The mathematics of statistical machine translation: parameter estimation,” Computational Linguistics, 19(2):263-311 (June 1993), hereinafter, Brown 1993) with HMM alignments (Stephan Vogel, et al., “HMM-based word alignment in statistical translation,” Proc. COLING, pp. 836-841 (1996)).
EM generally relies on having all the data available in advance. When incremental updates to the SMT system are desired, Online EM can be used to update the model parameters every time a new data point (e.g., a sentence pair) is introduced. This makes it feasible to perform more frequent updates with recent data. Several variants of online EM have been proposed. See, Percy Liang, et al., “Online EM for unsupervised models,” Proc. Human Language Technologies: The 2009 Annual Conf. of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 611-619 (2009); Olivier Cappé et al., “On-line expectation-maximization algorithm for latent data models,” J. Royal Statistical Society: Series B (Statistical Methodology), 71(3):593-613 (2009). For example, Stepwise EM has been used for updating the parameters of the translation and alignment models. See, Abby Levenberg, et al., “Stream-based translation models for statistical machine translation,” Proc. Human Language Technologies: The 2010 Annual Conf. of the North American Chapter of the Association for Computational Linguistics (HLTNAACL), pp. 394-402 (2010). In this approach, using IBM Model 1 (Brown 1993) with HMM alignments (Stephan Vogel, et al., “HMM-based word alignment in statistical translation,” Proc. COLING, pp. 836-841 (1996)), counts for translations and alignments are collected and updated by interpolating the statistics of the old and the new data. Rather than updating the alignment model for each data point, updates are performed for a set of bi-sentences, referred to as a mini-batch.
Once the alignments have been updated, it is possible to create new data-structures for the translation, reordering and language models, based on the entire data, a faster process than retraining.
Force alignment is a technique for aligning new data using an existing model. See, Qin Gao, et al., “A semi-supervised word alignment algorithm with partial manual alignments,” Proc. Joint 5th Workshop on Statistical Machine Translation and Metrics MATR, WMT '10, pp. 1-10, ACL (2010). This enables adding the source and its translation as additional training material. It does not, however, make any updates to the model.
An alternative to incrementally updating alignments, referred to as Quick Updates, is to create separate models from the new data, and use them as separate models (combined with the previous models through a log-linear combination) in the complete SMT log-linear model. This approach allows even faster updates, and in some settings yields comparable results to retraining the SMT model. See, Shachar Mirkin and Nicola Cancedda, “Assessing quick update methods of statistical translation models,” Proc. 10th International Workshop on Spoken Language Translation (IWSLT 2013), pp. 264-271 (December 2013). In contrast to the translation and language models, currently the most common SMT system, Moses, supports only a single reordering model. See, Philipp Koehn, et al., “Moses: Open source toolkit for statistical machine translation,” Proc. ACL Demo and Poster Sessions (2007). Hence, while it is possible to create small TMs and LMs quickly, this is not possible for the reordering model, resulting with suboptimal results when no such update is performed. In particular, bi-phrases absent from the reordering model receive a default score.
There remains a need for a system and method which allow for incremental updates of the reordering model within an SMT system, such as the Moses system.