1. Field of the Invention
The present invention generally relates to machine translation, and specifically relates to phrase alignment in machine translation.
2. Description of the Related Art
Statistical machine translation has been attracting attention as a framework for statistically extracting translation knowledge from a large data of bilingual sentence pairs and realizing highly accurate machine translation with least labor cost.
Particularly, phrase-based statistical machine translation has been suggested as a method that takes care of the weaknesses of the word-based statistical machine translation such as an IBM model. The phrase-based statistical machine translation includes using a phrase as a translation unit instead of a word. The advantages of the phrase-based statistical machine translation over the word-based statistical machine translation has been reported in, for example, Japanese Patent Laid-Open No. 2005-25474, and “Pharaoh: a Beam Search Decoder for Phrase-Based Statistical Machine Translation Models”, Philipp Koehn, AMTA 2004., and “The alignment template approach to statistical machine translation”, Franz-Josef Och and Hermann Ney, Computational Linguistics, 30(4), pp. 417-449, 2004.
In the conventional phrase-based statistical machine translation, correspondence between words in a bilingual sentence pair is first acquired using the framework for the word-based statistical machine translation such as the IBM model, and all candidates for a phrase that can agree with the correspondence are stored in a bilingual phrase table.
To improve accuracy of the phrase-based statistical machine translation, there is a need of a phrase alignment technology that can automatically extract a better-matching bilingual phrase pairs from the bilingual sentence pairs. Such a need is not limited to the machine translation. That is, even in the field of a translation aid system that supports manual translation work, there is a need to extract better-matching, or linguistically more motivated bilingual phrases; because, the phrase-based translation most efficiently supports the human translation work.
However, the conventional technology does not take into account linguistic information, so that phrases in a phrase combination acquired with the conventional technology is a simple word string that sometimes has no tangible meaning. Bilingual phrases which are specific to a certain field of documents from which they are extracted often fail to be effectively transferred from a field to another.
One approach could be to introduce linguistic knowledge by using syntactic analysis in the phrase-based statistical machine translation. However, the syntactic analysis uses an existing syntactic analyzer based on words which works independently of word combination and probability of word translation. Therefore, an error in the syntactic analysis leads to erroneous extraction or lack of extraction of the bilingual phrase pairs.