Quantitative retrieval involves calculating scores for retrieval objects based on an information retrieval query and, then, outputting the retrieval objects sequentially from the one with the highest score on down. The retrieval objects may refer to documents or patent information, etc. Order-ranking is done to flexibly address queries which are sometimes ambiguous or imperfect by quantitatively handling the retrieval conditions. An example of quantitative retrieval is shown in FIG. 5. Here, the description is taken with documents as an example of retrieval objects. Each document is assumed to be given a keyword which is used for its retrieval and a numerically represented attribute such as the publication year. The retrieval conditions are assumed to be that the document shall have K1 or K2 as its keyword, and that the value of the publication year C1 shall be as large as possible. A score V1 is given when a document contains K1, a score of V2 when it contains K2, and V3 in correspondence with the value of C1, and the total score is given by their sum, whereby it is possible to give each document its score.
An expression of the score calculation in a numerical formula is given by EQU V1*K1+V2*K2+.function.(C1).
If a document contains K1 and K2, 1 is substituted for K1 and K2; if not, then 0 is substituted. .function.( ) is assumed to be a function which is 1 when the publication year is the latest and which has smaller values, the older the publication year. In this way, the score may be calculated. By calculating the scores for all documents to rank them by order, it is possible to output the results in sequence from the document which best meets the conditions.
The formula for calculating the score is not limited to an operation such as calculating a sum. A system of selecting the larger of two values, for example, max(V1*K1, V2*K2) may be applied. In this way, various score calculating methods may be contemplated for retrieval requests.
Generally, the number of retrieval objects is very large and it is impractical to access all of retrieval objects every time a query is made because of the need to access a large amount of external storage. As introduced in an article by Salton, et al., "Extended Boolean Information Retrieval", Communications of the ACM, Vol. 26 No. 12, 1983, a transposed file is utilized for making retrievals at high speed in this type of information retrieval. This procedure is shown in FIG. 2. The transposed file (21) permits retrieval objects having the values of keywords and numerical values to be traced with these values as indexes. The original file on which documents are arranged in good sequence is called a sequential file (24). It does not matter whether the contents of each document are contained in the sequential file or are stored outside of it. In the latter case, the information about the position where the contents of the document are stored should be included in the sequential file. Using this transposed file, all document identifiers which have at least one of the keywords and/or the numerical values necessary for calculating scores are determined. This may be done by summing sets of document identifiers corresponding to respective keywords and/or numerical values (22). The sequential file is accessed for respective document identifiers; sets of keywords are obtained; and the scores are calculated, ranked by order, and output (23). When a transposed file is used in this way, the need for accessing all documents to calculate the scores is generally obviated enabling the retrieval to be made more rapidly. Although the transposed file also needs to be stored in an external storage, information having the same indexes are stored in physical proximity, thus enabling the necessary content to be extracted with small access frequency through the indexes.
Since the sequential file and the transposed file are very large in information retrieval, they are kept in an external storage and parts of them are transferred to the internal memory to make decisions as to the retrieval conditions and for calculation of scores, etc. Since a relatively long time is required for accessing the external storage, it is preferable to access the external storage as infrequently as practical for attainment of higher retrieval speed. In the above-described conventional method, the frequency of access to the external storage is reduced by narrowing the range of the related objects of retrieval by initially utilizing a transposed file. However, in quantitative retrieval generally outputting objects with sufficiently high scores is required, but those with low scores are not necessary. On this account, accessing objects low in rank order of the sequential file according to the above-described method is unnecessary. In the case of numerical values, where many factors have a bearing on the score, accessing all such related objects will reduce the range-narrowing effect. A new method and system for quantitative retrieval are needed to provide an effective way to reduce the frequency of access to the external file.