1. Field of the Invention
The present invention relates to a cross-lingual retrieval system for executing retrieval between a first language and a second language, in particular, relates to a cross-lingual retrieval system that uses a set of pairs each having a first language sentence and a second language having the same meaning (hereinafter, each pair is referred to as a pair data) to retrieve first language sentences according to a query written in the first language and then perform similar sentence retrieval of second language sentences which are similar to second language sentences paired with the retrieved first language sentences.
2. Discussion of the Related Art
With the improvement of the performance of computers, development of electronic dictionaries and progress of technology in natural language processing, many machine translation techniques have been proposed.
However, a machine translation system with a translation capability of sufficient accuracy has not yet been realized.
[Related Art 1]
In a proposed system, a large number of sentence pairs each having an original language (first language) sentence and a sentence translated from the original language into another language (second language) are prepared. A first language sentence is input to the system and similar sentences are retrieved from the first language sentences in the sentence pairs. Based on the retrieved first language sentences, corresponding second language sentences are then retrieved from the sentence pairs. A user can refer to the second language sentences output from the system and can improve a quality of translation from the first language sentence into the second language sentence.
For obtaining sentences similar to the first language sentence input to the system from the set of first language sentences in the sentence pairs, a method of determining a sentence of high similarity based on the number of words commonly included in the input sentence and sentences to be retrieved has been suggested. Also, Japanese Patent Application Laid-Open No. 9-50435 (1997) discloses a method of determining a first language sentence having a vector close to the vector corresponding to the input first language sentence as the sentence of high similarity based on the vector space model, one of the similar document retrieving methods.
A method of obtaining a sentence having high similarity to an input sentence according to the vector space model, described in "Information Retrieval (a Japanese translation of "New Horizons in Information Retrieval")", David Ellis, 1990, pp. 53-57, is now explained.
In the vector space model, each of the sentence to be an objet of retrieval and the sentence input as a query is represented as a vector. Suppose that there are N sentences to be the object of retrieval and M kinds of words (W1, W2, . . . , WM) in the N sentences. Then vectors corresponding to each of the N sentences (S1, S2, . . . , SN) are defined as M-dimensional vectors as shown in the following expression (1). If a word Wj exists in a sentence Si, Tij is 1. If the word Wj does not exist in the sentence Si, Tij is 0. EQU S1=(T11, T12, . . . , T1M), S2=(T21, T22, . . . , T2M), SN=(TN1, TN2, . . . , TNM) (1)
In a similar way, the vector corresponding to a query Q is defined as shown in the following expression (2). If a word Wi exists in the query Q, Ti is 1. If the word Wi does not exist in the query Q, Ti is 0. Here, it is assumed that each element of the vector takes 1 or 0, namely, a binary value. However, it may be possible to allot a real numeric value to each element in accordance with a degree of importance of the word in the sentence. EQU Q=(T1, T2, . . . , TM) (2)
In the vector space model, a sentence Si corresponding to the vector Si which has a close distance to the vector Q is determined to be the sentence having a high similarity to the query Q. Sentences are output in order of descending degree of importance as a result of retrieval. The distance D (Q, Si) between the vector Q and the vector Si is calculated in accordance with the following expression (3). Here, an expression (V, U) represents an inner product of a vector V and a vector U.
In the vector space model, ordinary, the words W1, W2, . . . , WM used for calculation are limited to content words. Function words such as postpositional particles (a part-of-speech in Japanese grammar) and auxiliary verbs are not taken into account. Moreover, a general word such as verb "be" in English (namely, a stop word) is not taken into account though it is the content word. EQU D(Q, Si)=(Q, Si)/((Q, Q)(Si, Si)).sup.1/2 (3)
[Related Art 2]
To obtain the same effect as the above-described [Related Art 1], a method of improving the translation quality has been suggested. In the method, each word in a query written in a first language is automatically converted into a word or a phrase of a second language by using a dictionary, and then a corresponding sentence(s) is retrieved from a set of the second language sentences utilizing the set of converted words or phrases of the second language. Thereby the user can refer to the corresponding second language sentence(s).
However, the above-described [Related Art 1] and [Related Art 2] have problems as follows.
The above-described [Related Art 1] obtains a similar first language sentence(s) based only on the words contained in the query of the first language. Therefore, although a second language sentence adequate as a translation of the query of the first language is present in the set of second language sentences, it cannot be obtained as a result of the retrieval if the expression of the corresponding first language sentence in the sentence pairs differs from that of the query. The [Related Art 1] is effective only if the sentence pairs contain a sentence composed of a set of words which are the same as those contained in the query of the first language.
The inadequacy becomes more pronounced as the number of the words contained in the query becomes smaller. Consequently, in the case where a document including a large number of sentences is input, non-zero elements of the corresponding document vector are increased (dimension of the vector is substantially raised), and accordingly, a highly reliable retrieving result is available. However, in most cases, actual translation data consists of short sentences, and therefore it is practically impossible to obtain adequate translations by the [Related Art 1].
As an example, a case is considered in which the first language is Japanese, the second language is English, and a Japanese sentence "{character pullout}{character pullout}{character pullout}" (having much the same sense as "It is gradually tapered.") is input. The content words extacted from the sentence are "{character pullout}" (gradually) and "{character pullout}" (tapered). The verb "{character pullout}" (be (not absolutely precise)) is the stop word and is eliminated from the following explanation.
According to [Related Art 1], Japanese sentences containing both "{character pullout}" and "{character pullout}" are obtained as sentences similar to the above query. However, it is impossible to obtain the sentences acceptable as adequate translations, having different expressions (using different words) but the same meaning as the query, such as the following examples (a) and (b). EQU "It tapers down to a point". (a) EQU "It tapers into a sharp point". (b)
The above-described [Related Art 2] obtains second language sentences to be referred to by converting each word in the query of the first language into a word or phrase of the second language by utilizing the dictionary.
However, a word of the first language can be expressed by a variety of words or phrases of the second language. Further, selection of the second language word adequate to substitute for the first language word depends on the context of the query of the first language and it is practically impossible to determine the words to be selected for substitution beforehand. Consequently, it is difficult to show the exhaustive information about correspondences between the first language words and the second language words in a dictionary. To obtain an adequate translation is thus difficult by utilizing the [Related Art 2].
As an example, under the same condition as the above example of [Related Art 1], "{character pullout}" and "{character pullout}" are converted into the English words or phrases (c) and (d), respectively, as follows. EQU "{character pullout}".fwdarw."gradually, by degrees, little by little, as soon as, order, the state of things, depends on" (c) EQU "{character pullout}".fwdarw."thin, narrow, fine, slim, slender" (d)
The proper English sentence shown as (a) or (b) cannot be obtained by an attempt to retrieve the sentence containing any word shown in (c) and (d). In fact, if an English sentence corresponding to "{character pullout}{character pullout}{character pullout}" is composed of the words in (c) and (d), an unnatural sentence is generated. The word "taper" contained in the English sentences (a) and (b) means "It gradually thins (or it gradually narrows)". or "It decreases (or it is gradually reduced)" by itself. In [Related Art 2], the sentences (a) or (b) cannot be obtained unless "taper" is regarded as a translation of "{character pullout}" or "{character pullout}".
However, "taper" includes meanings of both "gradually" and "thin (narrow, decrease or be reduced)", and it is inadequate as a translation of only "{character pullout}" or "{character pullout}". Like "taper", various English words, such as "peter", "wane" and "fade", include the meaning of "gradually" by itself. These words are inadequate as a translation of a single word "{character pullout}" by the same reason as "taper". This means that an English word properly corresponding to a Japanese word is determined depending upon the context of the query, and therefore it is impossible to prepare a dictionary in advance.
In the method disclosed by the above-described Japanese Patent Application Laid-Open No. 9-50435 using the vector space model, a vector is provided not to each sentence, but to each word in advance, and a sentence vector is represented as a sum total of vectors corresponding to respective words contained in the sentence.
In this case, if the same words do not exist in the query and a sentence to be the object of retrieval, the similarity is lowered and the problem of [Related Art 1] is not removed by the method. Further, as described above, the meaning of each word is determined depending upon the context of the query and it is impossible to determine a fixed vector for each word in advance. Accordingly, the method has same problem as [Related Art 2].