1. Field of the Invention
The present invention relates to an associate document retrieving apparatus and a storage medium for storing an associate document retrieving program. In particular, the invention relates to an associate document retrieving apparatus for executing retrieval of an associate document taking into consideration the similarity between a retrieval expression and each a set of keywords. The invention also relates to a storage medium for storing an associate document retrieving program which allows a computer to function as the associate document retrieving apparatus giving consideration to the similarity between a retrieval expression and each of a set of keywords.
2. Discussion of the Related Art
In a retrieval system which deals with an enormous amount of documents, a retrieval method using keywords is generally adopted. When an arbitrary keyword (retrieval word) is inputted into the retrieval system as a retrieval condition, all the documents containing the keyword are obtained as a result of the retrieval. The retrieval according to this method is called a full text retrieval. Also, another method is widely used in which one or more keywords for retrieval are added to each document in advance and the document having the keyword that matches an inputted retrieval word is obtained as a result of the retrieval.
However, according to the above-described retrieval systems, only the documents containing the retrieval word inputted by a user or the documents to which the retrieval word is added by a user, may be obtained. Therefore, these retrieval systems cannot comprehensively obtain all the documents requested by the user because a complete match between the retrieval word and the keywords is required.
Consequently, the technique of so-called associate document retrieval has been proposed. This technique outputs results of retrieval similar in meaning to the result of the retrieval directly obtained by the retrieval word as well as the result of retrieval completely matching the retrieval word inputted by the user.
The associate document retrieval can be realized by providing a value of similarity quantitatively analyzed according to some criterion, also called a degree of similarity between the words. When a user inputs a retrieval word, documents containing many words having large degrees of similarity to the retrieval word (namely, similar words) are outputted (that is, documents having a higher degree of matching are outputted) and thereby, associate document retrieval can be realized. The associate document retrieval has more advantages than the complete-match retrieval. These advantages include less oversight necessary for retrieval and the results of retrieval can be outputted in order of degree of matching.
In a general keyword retrieval system, the retrieval is executed using a retrieval expression which connects retrieval words by logical operators such as "and" and "or". To actually utilize the associate document retrieval in the retrieval system, it is necessary to execute the calculation of the degree of similarity for not just a single retrieval word, but a whole retrieval expression. In other words, it is necessary to calculate the degree of similarity between a retrieval expression and a document, hereinafter referred to as the degree of association.
Japanese Patent Application Laid-Open No. Hei. 2-41564 (1990) discloses a conventional method of associate document retrieval in which the degree of association for a retrieval word is used. For example, procedures of associate document retrieval according to the disclosed method assuming that the keywords are "word 1", "word 2" and "word 3" and the retrieval expression is "(word 1) or (word 2) or (word 3)", are as follows.
In the first step, the degree of similarity is provided to every binary combination in all of the keywords in advance. It is assumed that the degree of similarity between the keywords "word a" and "word b" is represented as S (a, b) (or S (b, a)).
In the second step, values of the degree of similarity Ri.sub.1, Ri.sub.2 and Ri.sub.3 between each of the keywords "word 1", "word 2" and "word 3" and a group of keywords contained in a document Di {word i.sub.1, word i.sub.2, . . . , word i.sub.m } are obtained by the following equations: EQU Ri.sub.1 =S(i.sub.1, 1).sym.S(i.sub.2, 1) .sym. . . . .sym.S(i.sub.m, 1) EQU Ri.sub.2 =S(i.sub.1, 2).sym.S(i.sub.2, 2) .sym. . . . .sym.S(i.sub.m, 2) EQU Ri.sub.3 =S(i.sub.1, 3).sym.S(i.sub.2, 3) .sym. . . . .sym.S(i.sub.m, 3)(1)
(".sym." in the equation indicates generalized sum operation).
In the third step, the degree of association Ki between the document Di and the retrieval expression "(word 1) or (word 2) or (word 3)" is obtained according to the following equation: EQU Ki=Ri.sub.1 .sym.Ri.sub.2 .sym.Ri.sub.3 (2)
(".sym." in the equation indicates generalized sum operation).
In the fourth step, the processes for the document Di in the second and third steps are applied to all documents which are the object of retrieval. The documents are outputted in descending order of the value of Ki.
According to the above procedures, it becomes possible to output the results of retrieval in the order of the degree of association. The results include not only the documents completely matching the retrieval expression "(word 1) or (word 2) or (word 3)" but also the documents closely associated with the retrieval expression.
However, in the associate document retrieval of the conventional art, the following problems arise because the degree of association is calculated based on the degree of similarity provided to the relation of the binary combination of the words in advance. Therefore, it is difficult to execute the associate document retrieval practically.
The first problem is that it is impossible to obtain the result of retrieval which reflects the relation among keywords connected by the logical expression in the retrieval expression.
In the case where the degree of association is calculated based on the degree of similarity provided to the relation of the binary combination of the words, there is no way to make the degree of association reflect the relation among the keywords connected by the logical operators in the retrieval expression except by algebraically calculating the degree of association. However, it is difficult to represent the relation among the keywords connected in the retrieval expression by an algebraic calculation. Accordingly, it is impossible to obtain an accurate result of associate document retrieval for a retrieval expression created by connecting the keywords with logical operators only on the basis of the degree of similarity of the binary combination of the words.
For example, it is assumed that "(airplane) or (aircraft) or (passenger plane) or (ship)" is given as a retrieval expression. In the associate document retrieval according to the conventional art, the sum of the values of the degree of similarity for each of the keywords "airplane", "aircraft", "passenger plane" and "ship" is obtained, and thereby the degree of association between the retrieval expression "(airplane) or (aircraft) or (passenger plane) or (ship)" and each of the documents is determined. According to this calculation method, values of the degree of similarity corresponding to respective keywords "airplane", "aircraft", "passenger plane" and "ship" receive a fair deal. Since "airplane", "aircraft" and "passenger plane" have many common similar words (words having a higher degree of similarity). Many documents related to the keyword "airplane" ("aircraft" or "passenger plane") appear in the result of associate document retrieval and only a small number of documents are related to "ship". In other words, using words for which the user knows many synonyms has great influence on the degree of association and therefore, the results of associate document retrieval.
In another example, assume that "(airplane) or (ship)" is given as the retrieval expression. In the associate document retrieval of the conventional art, the degree of association between the retrieval expression "(airplane) or (ship)" and each of the documents is determined by obtaining the sum of the values of the degree of similarity for the keywords "airplane" and "ship". Accordingly, in the result of associate document retrieval, the documents related to both "airplane" and "ship" have priority over the documents related to either "airplane" or "ship". However, the retrieval expression "(airplane) or (ship)" means that either "airplane" or "ship" is included. Therefore, it is inappropriate that the priority is given to the documents having a high degree of association with both "airplane" and "ship" as the result of retrieval. Giving a priority to the documents having a higher degree of association with both "airplane" and "ship" as a result of retrieval may be considered to correspond to the retrieval expression "(airplane) and (ship)".
A second problem arises in that it is impossible to execute retrieval effectively utilizing designation of bibliographic items included in a retrieval expression.
In many cases, the retrieval expression used in the actual retrieval system includes not only the keywords but also designation of various bibliographic items. However, in the associate document retrieval according to the conventional art, the degree of association is determined based only on the relation of binary combination of the words provided in advance. Therefore, the retrieval expression used in the calculation of the degree of association is limited to those consisting of keywords.
For example, consider the case where the associate document retrieval is executed according to the retrieval expression "(PD=19950101: 19951231) and (FK=game)". The retrieval expression is assumed to mean "a set of documents published in 1995 and containing the keyword "game"". In this case, what is desired as the result of associate document retrieval is the documents related to the "document published in 1995 and describing a game". Therefore, if a document is related to such a document, it is desirable to obtain it though it was published in a year other than 1995.
However, the associate document retrieval according to the conventional art cannot deal with such a retrieval expression. Even if the associate documents are obtained based on the keyword "game" and are then narrowed down by the condition "documents published in 1995", the result is that "documents published in 1995" among the documents related to "documents describing a game regardless of the year of publication" are obtained.