The present invention relates to a document retrieval system for retrieving documents containing keywords specified by a user, a document number subsequence acquisition apparatus, and a document retrieval method, and in particular, relates to a document retrieval system for retrieving documents at high speed by analyzing a large quantity of electronic documents and using a transposed index, a document number subsequence acquisition apparatus, and a document retrieval method.
A technique called a transposed index method is known as a document retrieval method for retrieving documents containing keyword strings (retrieval keyword) specified by the user from among a large quantity of electronic documents. In the transposed index method, a transposed index having a data structure in which keyword strings that could be specified by the user and information indicating a set of documents containing the keyword strings are paired is created in advance. The information indicating a set of documents mapped to keyword strings is associated with each document contained in the set of documents by an array of document numbers (document number array) uniquely identifying each document to be retrieved.
When, in a document retrieval apparatus using the transposed index method, documents containing a single retrieval keyword should be retrieved from a large quantity of documents, keyword strings corresponding to the retrieval keyword are retrieved from a transposed index. Next, the document retrieval apparatus retrieves a document number array corresponding to the detected keyword strings from the transposed index. Then, the document retrieval apparatus extracts information about documents indicated by each document number contained in the acquired document number array (for example, URL (Uniform Resource Locator) or titles of the documents) from a document database before the information is output as a retrieval result.
Incidentally, with an increasing capacity of storage devices in recent years, it has become possible to store a vast quantity of documents in the storage device. Moreover, with advancing information communication technology, it has also become possible to retrieve documents stored on many computers connected via a network. As a result, the quantity of documents to be retrieved in document retrieval is steadily on the increase. The increase in quantity of documents to be retrieved causes an increase in the quantity of data in a transposed index. Then, it takes a lot of time for a document retrieval apparatus to perform processing to fetch a document number array of documents containing relevant keyword strings from the transposed index.
Thus, in order to improve the speed of processing of fetching a document number array from the transposed index, parallelization of processing of fetching a document number array has been attempted.
FIG. 19 is a diagram exemplifying a conventional document retrieval system. The parallelized document retrieval system is provided with a plurality of transposed index storage devices 93a, 93b, 93c, and 93d. Transposed indexes are stored in the transposed index storage devices 93a, 93b, 93c, and 93d. Similar keyword strings are set to each transposed index and document numbers of documents containing each keyword string are registered with one of the transposed indexes. That is, document numbers corresponding to a vast quantity of documents are distributively stored in the plurality of transposed index storage devices 93a, 93b, 93c, and 93d. 
A retrieval keyword (“information” in the example shown in FIG. 19) input into a retrieval keyword input apparatus 91 is delivered to each of a plurality of document number subsequence acquisition apparatuses 92a, 92b, 92c, and 92d (Step S91). The document number subsequence acquisition apparatuses 92a, 92b, 92c, and 92d retrieve keyword strings corresponding to the retrieval keyword from the mapped transposed index storage devices 93a, 93b, 93c, and 93d respectively (Step S92).
Further, the document number subsequence acquisition apparatuses 92a, 92b, 92c, and 92d acquire document number arrays mapped to the detected keyword strings from the mapped transposed index storage devices 93a, 93b, 93c, and 93d respectively (Step S93). Then, the document number subsequence acquisition apparatuses 92a, 92b, 92c, and 92d deliver the acquired document number arrays to a document number array summarization output apparatus 94 (Step S94). The document number array summarization output apparatus 94 summarizes the received document number arrays and outputs a summarization result as a retrieval result.
Thus, with document numbers being distributively stored in the plurality of transposed index storage devices 93a, 93b, 93c, and 93d, document number arrays can be fetched in parallel using the plurality of document number subsequence acquisition apparatuses 92a, 92b, 92c, and 92d. 