1. Field of the Invention
The present invention relates to a system for retrieving related sentences between sentences written in one language and sentences written in another language, and more particularly a system using sentences written in still another language as intermediate language sentences.
2. Description of the Related Art
With the improvement of the performance of computers, development of electronic dictionaries and progress of technology in natural language processing among other things, many machine translation techniques have been proposed so far.
However, it is still difficult to affirm that a machine translation system with a translation capability of sufficient accuracy has been implemented.
According to the related art, there has been proposed a technique by which a large number of pair data pieces in an original language (first language) sentence and a translated language (second language) are prepared; similar sentences to a first language input sentence are retrieved from the first language sentences in the pair data; second language sentences corresponding to the first language sentences as the result of this retrieval are then outputted from the pair data; and these outputted second language sentences are referred to the user, thereby to improve the quality of translation from the input sentence in the first language sentence.
As methods for obtaining sentences similar to the first language input sentence from the set of first language sentences in the pair data, there have been proposed a method to determine a sentence of high similarity based on the number of words commonly included in the input sentence and sentences to be retrieved; and, as disclosed by Japanese Published Unexamined Patent Application No. 9-50435 (1997), a method to determine a first language sentence having a vector close to the vector corresponding to the first language input sentence as the sentence of high similarity based on the vector space model, one of the similar document retrieving methods.
To obtain the same effect as the foregoing, a method to improve the translation quality is under study, by which each word in a query written in a first language is mechanically converted into a word or a phrase of a second language by using a dictionary, then a corresponding sentence(s) is retrieved from a set of second language sentences by utilizing the set of converted words and/or phrases of the second language, and the obtained second language sentence(s) is referred to the user.
However, the methods according to the related art, by which a similar first language sentence(s) is obtained merely on the basis of the words contained in the first language input sentence, involves the disadvantage that a second language sentence adequate as a translation of the first language input sentence, even if present in the set of pair data pieces, cannot be obtained as a result of the retrieval if the expression of the corresponding first language sentence in the sentence pair differs from that of the first language input sentence. Thus the methods can be effective only if the set of pair data contains a sentence composed of a set of words which are substantially the same as those contained in the first language input sentence.
Such a disadvantage becomes more noticeable as the number of the words contained in the input sentence becomes smaller, and consequently, in a case where a document including a large number of sentences is inputted, non-zero elements of the corresponding document vector significantly increase (substantial dimensions of the vector are raised), and accordingly the reliability of the retrieval result is enhanced, but in most cases actual pair data mostly consists of short sentences, and therefore it is practically difficult to obtain adequate related sentences (second language sentences) to refer to.
Furthermore, whereas a second language sentence(s) to refer to is obtained according to the related art by replacing individual words in the input sentence in the first language with words and/or phrases in the second language by using a dictionary, words and phrases in the second language available for expressing a given word in the first language are extremely diverse and, moreover, the adequate choice of a second language word out of those many alternatives depends on the meaning of the first language input sentence, making it virtually impossible to determine the choice in advance. Therefore, it is difficult to express in advance the relationship of correspondence between first language words and second language words in a comprehensive dictionary form, and it is difficult to obtain an adequate related sentence to refer to.
In view of this problem, the present applicant has already proposed a cross-lingual retrieval system capable of retrieving, on the basis of a query in a first language, a second language sentence(s) which is a more adequate related sentence(s) by using pair data (Japanese Unexamined Patent Application No. Hei 10-202788 [1998]).
This cross-lingual retrieval system stores in a paired sentence storing unit plural pairs each of a sentence in a first language and a corresponding translated sentence in a second language; when a query written in the first language is received from a query receiving unit, a first retrieval unit performs retrieval processing on a set of sentences in the first language sentences stored in the paired sentence storing unit according to the query. Then a second retrieval unit performs retrieval processing on a set of translated sentences written in the second language sentences stored in the paired sentence storing unit to find sentences similar to translated sentences written in the second language correspondingly to the sentences in the first language retrieved by the first retrieval unit.
In other words, retrieval based on the first language sentence is performed on the pair data and, using the second language sentences corresponding to the result of this retrieval, the retrieval of similar second language sentences is performed on the pair data. The successive double retrieval in the first and second languages using the pair data as a bridge makes it possible to retrieve second language sentences which are adequate translation of the query written in the first language without being greatly affected by any difference in expression or the number of words or phrases contained, moreover even if the input sentence in the first language is relatively short.