1. Field of the Invention
The present invention relates to a document retrieval system, and more particularly to a document retrieval system which searches documents coinciding with or corresponding to a retrieval request inputted by the user from among a large quantity of document data and ranks or classifies the documents on the basis of the degree of the coincidence therebetween.
2. Description of the Prior Art
The recent increase in the scale of the document data base which comes to a tremendous quantity frequently causes difficulty of easily searching the target document through the use of a prior key word searching technique or a global retrieval technique, and hence the reduction of the total retrieval time is not always possible even if it is capable of producing a retrieval result at a high speed. One possible way to accomplish the decrease in the number of the resultant documents may be to narrow down the candidates, such as additionally employing another key word, while difficulty is experienced to add appropriate key words so as not to miss the necessary documents. For this reason, in addition to paying attention to the presence or absence of the letter string (word) in the documents to be searched, there has known a noticeable method of ranking (sequencing) the searched documents on the basis of its frequency of occurrence to retrieve the target document with a high efficiency.
FIG. 27 is a block diagram showing an arrangement of a prior document retrieval system which sequences the retrieval results. As shown in FIG. 27, the document retrieval system is composed of document data 3101 under retrieval, a dictionary 3102, a word frequency index 3103 for retaining the frequencies of occurrence of the dictionary words in the document, a word frequency information extracting means 3104 for attaining the word occurrence frequency information from the document data 3101, a retrieval request inputting means 3105 for receiving a retrieval request inputted by the user, a word frequency calculating means 3106 for calculating the word occurrence frequency from the word frequency index 3103, a frequency score calculating means for calculating a frequency score of each document on the basis of the word occurrence frequency, a document score calculating means 3108 for calculating a document score indicative of the degree of coincidence between each document and the retrieval request on the basis of the frequency score, a document ranking means 3109 for rearranging the documents in the order of document score, and a retrieval result displaying means 3110 for displaying the resultant documents arranged in the order of score.
FIG. 28 is a flow chart showing a retrieval procedure a prior document retrieval system which sequences the retrieval results. First of all, before retrieval the word frequency information extracting means 3104 consults the document data 3101 to obtain word frequency information which in turn, is outputted together with the total number of documents and the number of occurring documents to the word frequency index 3103 where a word frequency index is made out in advance. At a step 4201, the user who intends to carry out the retrieval inputs the retrieval request through the retrieval request inputting means 3105, and at a step 4202, the word frequency calculating means 3106 refers to the word frequency index 3103 to calculate a frequency of occurrence TFij of the dictionary word Wi (i=1, 2, . . . , NW where NW corresponds to the number of dictionary words included in the retrieval request) included in the retrieval request inputted through the retrieval request inputting means 3105 in a document Dj (j=1, 2, . . . , ND) and further to calculate the number of documents NDi in which that word appears.
Furthermore, at a step 4203 the frequency score calculating means 3107 calculates a frequency score SFj of the document Dj according to an equation (1) on the basis of the output of the word frequency calculating means 3106. ##EQU1## where IDFi designates a parameter representative of a bias of the word Wi in all the documents.
Still further, a step 4204 the document score calculating means 3108 obtains a document score Sj indicative of the degree of coincidence between the document Dj and the retrieval request on the basis of the frequency score SFj of the document Dj outputted from the frequency score calculating means 3107. In the prior retrieval system, the document score Sj is the frequency score SFj as found from an equation (2). EQU Sj=SFj (2)
Moreover, at a step 4205 the document ranking means 3109 rearranges the retrieval results in the order of the document score calculated in the document score calculating means 3108, then followed by a step 4206 where the retrieval result displaying means 3110 shows the retrieval results to the user.
However, according to the above-mentioned prior arrangement, in cases where as shown in FIG. 29 one word included in the retrieval request occurs at an extremely high frequency, a problem arises in that even a document against the user's retrieving intention is ranked with a higher order. In addition, since the calculation of the score used for ranking the documents under retrieval is made in units of document irrespective of its field, it becomes difficult that the information such as the heading of a paper article or the title of an invention in the patent application is put to practical use.
Besides, there are various problems: in the case of making a plurality of retrieval requests, the priority can not be given to these retrieval requests to make it difficult to flexibly express the user's requests; in the case that a group of words including all the necessary words is given as the retrieval request, if one word occurs at an extremely high frequency, that document comes to a high order; and difficulty is encountered to express a group of words, requiring the occurrence in the close condition, as the retrieval request and to search them.