Field
The present application relates to a novel machine translation system. More specifically, the present application relates to a system and method for training the machine translation system. In addition, the present application also relates to a system and method for online translation.
Related Art
Machine translation (MT), also known as automated translation, refers to a process that uses computers to translate text or speech from one language (“source language”) to another (“target language”). On a basic level, machine translation performs simple substitution of words in one language for words in another, but that alone usually cannot produce a good translation of a text. To achieve a good translation, recognition of whole phrases or complete sentences and their closest counterparts in the target language is needed.
Statistical machine translation (SMT) is the most widely studied and used MT method. For example, Google® Translate implements SMT. In SMT, translations are generated on the basis of statistic models whose parameters are derived from analysis of bilingual text corpora. The various statistic models, including word-alignment model, phrase-extraction model, and language model are basic components of the SMT technology. To build a good SMT system, a large amount of training data is needed to train each of these basic components. Similarly, upgrading an SMT system can involve repeated training of these basic components.
In conventional approaches, MT training typically is performed on a single standalone machine. However, accurate translations often depend on multiple training runs over a large amount of training data, making it inefficient to perform training on a single machine. For example, a good MT system may need a training corpus containing well beyond 10 million sentences. A complete training run of the MT system using a server with 32 cores and 128 GB RAM can require up to 60 hours. Because upgrading a commercial MT system can involve multiple iterations of training runs and tests, offline training of the various models (e.g., word-alignment model, phrase-extraction model, language model, etc.) can become the bottleneck for upgrading the MT system.
Moreover, the training results from a single machine are often loaded into the memory of a single machine to allow subsequent queries of the training results during the online translation process. However, loading the massive amounts of data containing the training results into a single machine can result in a slower query speed and, thus, less efficient translation.