1. Field of the Invention
The present invention relates to an apparatus, a method, and a computer program product for performing machine translation on a sentence input in a source language to obtain a sentence in a target language.
2. Description of the Related Art
Machine translation technologies have been developed to automatically translate an expression in a certain language to a semantically equivalent expression in a different language such as between Japanese and English. The machine translation system is widely used especially for written words. The technologies incorporated to realize the machine translation can be divided into two broad types, rule-based translation and corpus-based translation.
In the rule-based translation, rules are described for grammar and vocabulary information on each language and correspondence in vocabulary and sentence structure between any two languages that are dealt with in translation so that the conversion between the languages can be implemented according to the rules.
The development of rules and dictionaries for use in the rule-based translation requires not only a high level of knowledge of both a source language and a target language of the translation but also a high level of knowledge of semantic and grammatical relationship between these languages. Furthermore, because of infinite diversity of languages, the rule development requires enormous amounts of time and exhaustive works based the high-level knowledge. In addition, such a rule development needs to be performed for each pair of source and target languages. A problem also resides in that the outcome of the translation is automatic and unnatural because infinitely variable sentences are translated based on a finite number of rules.
As a solution of overcoming such problems in the rule-based translation, corpus-base translation is widely applied. In the corpus-based translation, a large number of examples of expression pairs in two languages that are semantically equivalent to each other are collected, and the language conversion is performed with reference to the collected examples. Systems such as translation memory (TM), example based machine translation (EBMT), and stochastic machine translation (SMT) are well known as the corpus-based translation.
The TM system searches for example pairs that include the same expression in the source language as the one that is input, and outputs a translation of the expression. The EBMT system searches for example pairs including an expression in the source language that is similar to the one that is input, and obtains a semantically equivalent expression in the target language, based on the translations of the searched examples. The SMT system obtains a translation of an expression input in the source language, based on statistical information from massive example data that has become available.
The corpus-based translation is advantageous in that the translated outcome is natural and reliable and also that the development is relatively easy, which makes multiple-language application easy. In the following description, the EBMT system is used as a typical example of the corpus-base translation unless otherwise specified.
In relation to the corpus-based translation, JP-A 2002-7392 (KOKAI) suggests a technology of setting a source language pattern and a target language pattern in accordance with a translation direction so that patterns do not have to be created for each translation language.
It should be noted that, when words are expressed, the meaning of the words can be interpreted not from the expression (literal sense of the words, or a string of characters) only, but from a combination of the expression and the situation in which the words are expressed.
It is this aspect of words that enhances the efficiency of words as a communication tool, with one word having various meanings depending on situations. The situation may include the standpoints, roles and relationship of a speaker and a listener, or the place, time, objects surrounding them, and already established conditions, and moreover, the knowledge and beliefs of the speaker and the listener, their knowledge and beliefs about each other, and many other factors.
For this reason, an expression in a language in an example incorporated in the corpus-based translation can be considered as having an equivalent meaning only under a limited situation that is specified for each translation pair.
In most cases, however, translation example pairs in the corpus-based translation include words for the examples in different languages, or “expressions” only, and the information on the situation in which such expressions are made is not included.
On the other hand, because the corpus-based translation requires a massive corpus of translations, it is difficult to exclude example pairs used in a particular context (situation) or ones with freely translated phrases or fixed phrases such as fable-based and idiomatic phrases from the corpus.
According to the conventional corpus-based translation technologies as described in JP-A 2002-7392 (KOKAI), an example pair is selected in consideration of similarity in phrases only, regardless of the situation where the words are uttered, which sometimes results in a translated sentence that is not semantically equivalent. In other words, the outcome of the translation may be unnatural or incorrect, and naturalness and high reliability that are supposed to be advantages of the corpus-based translation may not be attained.
To solve this problem, a method of automatically detecting part of the situation of utterance or attaching part of the situation to example pairs in advance may be considered. However, it is very difficult to mechanically perform these operations. Partial information on the situation may be manually attached to the example pair, but this does not support the advantage of the corpus-based translation in ease of development.