Machine translations involve the translation of information from a source language to a destination language via a computing device. Machine translations may be used to translate, for example, advertisements, government documents, academic works, text messages and emails, social networking posts, recordings of spoken language, and numerous other works.
There may be more than one possible way to translate a word, phrase, or sentence into the destination language. Although each of these possible translations may be correct in certain circumstances, some translations may not make sense in the context of the full translation. For example, assume that the phrase “very good” is translated into German. The word “very” is typically translated as “sehr.” However, the word “good” may be translated in different ways depending on the way that it is used. For example, the “good” in “good morning” is typically translated as “guten,” whereas the “good” in “that food is good” may be translated as “gut.” In this case, both “gut” and “guten” are reasonable translations of the word “good,” but “sehr gut” is a more preferable translation than “sehr guten.”
Thus, multiple different translations may be generated for given source material. However, some possible translations may be more correct or more favored than others. Identifying which translations are favored and communicating this information in a way that a machine translation system can consistently apply can be a difficult and time-consuming process.
One traditional technique for improving a machine translation involves the use of the Bilingual Evaluation Understudy (BLEU) score. In this technique, a segment such as a sentence of phrase is translated by a machine into a destination language. The machine-generated translation is compared to one or more reference translations, typically good-quality translations prepared by a human. A score between 0 and 1 is assigned to the machine translation based on how well it approximated the human translation. When training a translation model for a machine translation system, the translation model may be evaluated in view of the BLEU score calculated over multiple translations, and modified to improve its BLEU-score-based performance.
The BLEU score remains the industry standard in evaluating machine translations. However, several problems exist with techniques that rely on the BLEU score. One problem with the use of the BLEU score is that it can be expensive to run evaluations. In order to accommodate different translations that are nevertheless correct, multiple reference translations may be used. Because each reference translation is typically generated by a human, producing these reference translations can be expensive and time-consuming. Moreover, there are questions as to how well the BLEU score measures translation quality. Among other issues, the BLEU score may not accurately capture whole sentence-level meaning, does not address grammatical correctness, and has difficulty evaluating translations involving languages that lack clear word-level boundaries.