The present invention relates generally to machine translators for performing translations of natural languages and more particularly to a method and apparatus for evaluating the quality of machine translation of natural languages performed by machine translators.
Efficient and effective development, selection, and/or maintenance of machine translation systems require some quantitative measure for evaluating their performance relative to a reference translation. Such a quantitative measure of machine translation performance may be used either by a system developer for tuning a machine translation system being developed or maintained, by a user that requires some measure of performance for choosing between existing machine translators, or by a machine translation system to self-tune its internal system parameters and thereby improve future translation performance.
An example of a similarity measure frequently used to score translations produced using a machine is the IBM BLEU score (which is described in detail in the publication by Papineni et al., entitled “Bleu: a Method for Automatic Evaluation of Machine Translation”, published in IBM Research Report RC22176 (W0109-022), Sep. 17, 2001). The IBM BLEU score is based on counts of contiguous word n-grams common the two sequences of symbols, where one sequence is a target translation and the other sequence is a reference translation.
More specifically, letting “c” be the length (in words) of the machine translation, and letting “r” be the length of the reference translation, a length penalty (LP) may be defined by:
  LP  =      {                                        1                                                              if                ⁢                                  :                                ⁢                                                                  ⁢                c                            >              r                                                                          ⅇ                              1                -                                  r                  c                                                                                        otherwise                                          ,      and the BLEU score is then defined as:
      BLEU    =          LP      ·                        ∏                      n            =            1                    N                ⁢                                  ⁢                  p          n                      w            n                                ,where pn is the n-gram precision and wn is the n-gram weight, which is usually set to 1/N. The BLUE score provides that n-gram precision equals the fraction of contiguous n-grams in the machine translation that match some n-grams in the reference translation. Intuitively, the BLUE similarity measure is assessed in terms of the number of shared contiguous n-grams of length up to some fixed N between a machine translation and a reference translation. A multiplicative correction factor is applied to penalize short translations.
Notwithstanding existing methods for scoring the quality of machine translations, such as the BLUE score, there continues to exist a need to provide improved measures of machine translation performance. Advantageously with improved measures of machine translation performance (i.e., the accuracy with which the measure reflects the perceived quality of translations a machine translation system produces), machine translation systems may be developed and/or self-tuned to produce translations with improved quality.