1. Field of the Invention
The present invention relates to an information retrieving method and apparatus, and, more particularly, to a method and apparatus for performing a full-text search on document data having a large volume.
2. Description of the Background Art
Using a word-based full-text search is known as a method for retrieving, at high speed, documents desired by a user. In this retrieving method, a document that is a subject of search and retrieval is decomposed into words in advance by a morphological analysis or the like, and then a database (typically containing indices) for examining a relationship between the words and the document is constructed. The database is searched for a word that is desired by a user.
In particular, when the amount of documents is very large, a single computer may be incapable of high-speed processing, and thus ineffectively performs. A common method for solving this problem is to arrange the retrieving apparatuses in a distributed array.
A common dividing method is to divide a retrieving apparatus according to keys, which are one of term (or word) identifiers and document identifiers. The former are used for dividing the retrieving apparatus by terms and the latter are used for dividing the retrieving apparatus by documents.
When term identifiers are used as keys, each term is concentrated in a single retrieving apparatus but each document is distributed over a plurality of retrieving apparatuses. Therefore, when a search is made with respect to a single term, (e.g., term A), only a retrieving apparatus containing that term is used, and hence the load on the retrieving apparatus is light. However, when searching for documents that contain two terms A and B that are not assigned to a single retrieving apparatus, there arises a problem that a transfer of a large amount of information occurs between two retrieving apparatuses to thereby extremely reduce the search speed.
Specifically, when apparatus division is made by using term identifiers as keys, if not all of terms appearing in a search formula are assigned to a single retrieving apparatus, an information transfer occurs between the related retrieving apparatuses. For example, consider a retrieval request for "Internet AND personal computer." If the terms "Internet" and "personal computer" are held by different retrieving apparatuses, an information transfer is needed between the two apparatuses to execute the AND function. That is, a retrieving apparatus that has searched for "Internet" must transfer retrieval results to the other retrieving apparatus that holds the term "personal computer," or vice versa.
On the other hand, in the method of using document identifiers as keys, one document is concentrated in one retrieving apparatus but the same term is distributed over a plurality of retrieving apparatuses. Therefore, it is necessary to use all retrieving apparatuses even for a search for one term. Obtaining results produced by the total system of retrieving apparatuses is problematic, because it is necessary to collect and arrange results of all the retrieving apparatuses.
Specifically, when apparatus division is made by using document identifiers as keys, an information transfer may occur due to a different reason than in the above example. That is, there arises a problem that each retrieving apparatus cannot determine what number of retrieval results it should collect in its search range. For example, for a request for the 100 highest-rank (in terms of evaluation value) retrieval results that satisfy "Internet AND personal computer," each retrieving apparatus cannot determine a specific number of retrieval results it should collect. Therefore, each retrieving apparatus sends the 100 highest-rank retrieval results it has produced to a controller (for instance, a central management apparatus), and the controller sorts the retrieval results received from retrieving apparatuses by the evaluation value and disregards retrieval results whose ranks among all the received retrieval results are lower than the 100th rank. This means that the transfers of the disregarded retrieval results are unnecessary.
Additionally, each of the above-described two methods does not sufficiently consider the output order of retrieval results according to evaluation values. That is, each method requires outputting retrieval results according to an evaluation formula that is derived from a search formula so as to reflect it in a faithful manner. In determining an evaluation formula, it is important that it be suitable for the user. Although the processing speed can be increased by generating an evaluation formula that is favorable for implementation, an evaluation formula contrary to the user's intuition is difficult for the user to accept.
For example, consider a case where a first retrieving apparatus searches documents in which the weight of a term is 0.5 or more and a second retrieving apparatus searches documents in which the weight of a term is smaller than 0.5. In making a search for "Internet AND personal computer," the first retrieving apparatus searches documents in which the weights of both terms are 0.5 or more. Retrieval results are sorted so as to be arranged in order of the sum of the weights of the two terms and then output in the order thus determined. Next, retrieval results that have not been produced by the first retrieving apparatus (i.e., documents in which the weight of one of the two terms is smaller than 0.5) are sorted so as to be arranged in order of the sum of the weights and then output in the order thus determined. In this case, because of the evaluation scheme according to the sum of the weights of the terms, a document in which the weights of the terms "Internet" and "personal computer" are 1.0 and 0.49, respectively, has a higher rank than a document in which the weights of both terms are 0.5. However, since the former document is searched for by the second retrieving apparatus and the latter document is searched for by the first retrieving apparatus, the retrieval results of the latter document are output first.
As exemplified above, users intuitively feel an evaluation formula that is not continuous at boundary lines between retrieving apparatuses is unnatural. Therefore, an evaluation formula is desired to be continuous at each boundary line between adjacent ones of hierarchical retrieving apparatuses.