The present invention relates to a method for automatically extracting translation pairs of words to be registered in a bilingual dictionary which is used for a machine translation system or the like.
Machine translation systems necessitate a bilingual dictionary which contains pairs of a word in source language and its equivalent word in target language. Such a bilingual dictionary must have a sufficient word coverage in order to achieve high-quality translation.
Manufacturers of machine translation systems provide bilingual dictionaries of basic words in general, whereas system users need to create a bilingual dictionary of technical terms. It is expensive to create a technical term dictionary manually, and therefore a method for extracting translation pairs of words automatically from a bilingual pairs of texts is desired. A bilingual dictionary of technical terms is indispensable for not only a machine translation system but also a cross-language information retrieval system, and there is intense demand for automatic generation of a bilingual dictionary.
A method of automatic generation of a bilingual dictionary from a bilingual pair of texts is disclosed in Japanese patent publication JP-A-Hei-7-28819 (will be called "first prior art"), for example. This method uses a pair of a source language text and a target language text which are aligned sentence by sentence. It evaluates the frequency of each pair of a source language word and a target language word occurring together in pairs of aligned sentences. It also evaluates the occurrence frequency of each word in the source and target language texts. It calculates a correlation between each pair of a source language word and a target language word based on these frequencies, and selects pairs of words with high correlation as translation pairs of words.
Conventional methods including the one described in the above-mentioned patent publication JP-A-Hei-7-28819 require a bilingual pair of texts that are aligned sentence by sentence. However, a usually available bilingual pair of texts are not aligned sentence by sentence. They are merely translation of each other as a whole. On this account, the conventional methods oblige us to make the sentence-by-sentence alignment of a bilingual pair of texts prior to extracting translation pairs of words from the bilingual pair of texts. This task, if carried out by manpower, is very expensive.
In this situation, studies are under way with the intention of carrying out sentence-by-sentence alignment of a bilingual pair of texts by use of a computer, as described, for example, in an article entitled "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics, Vol.19, No.1, pp.75-102 (March 1993) (will be called "second prior art"). However, it is still impossible to carry out the sentence-by-sentence alignment of a bilingual pair of texts automatically at perfect accuracy. Because one sentence in the source language text sometimes corresponds to two or more sentences in the target language text, and vice versa. A sentence in the source language text may not have a counterpart in the target language text, and vice versa. On this account, human check and correction are inevitable for the result of computer-based sentence-by-sentence alignment, and therefore the conventional technique for generating a bilingual dictionary, even if used with the above-mentioned second prior art, is still costly.
To cope with this matter, studies are under way with the intention of generating a bilingual dictionary from a bilingual pair of texts that are not aligned sentence by sentence, as described, for example, in an article entitled "Extraction of Technical Term Bilingual Dictionary from Bilingual Corpus", by Y.Yamamoto and M.Sakamoto, published in Japanese in Technical Report of Information Processing Society of Japan, NL-94-12 (March 1993) (will be called "third prior art"). Consulting a bilingual dictionary of simple words, this method extracts translation pairs of compound words, each made up of two or more simple words, from a pair of a source language text and a target language text. It selects a pair of a compound word of source language and a compound word of target language when the constituent words of the compound word of source language can be coupled with those of the compound word of target language through a bilingual dictionary of simple words.