The volume of documents in databases is rapidly expanding. It has been estimated that in excess of 90% of all desired intelligence information is available in documents residing in accessible databases. In order for the information in databases to be useful, a user must be able to locate specific documents relating to specific queries. Existing information retrieval systems make use of inefficient techniques for returning relevant documents. Generally, the existing techniques miss highly relevant documents associated with the users' query. For example, many systems use Boolean logic based query execution techniques wherein key words are connected together via logical or proximity operators. Such a Boolean system merely returns a list of documents, each of which include one of the key word combinations.
The result of a Boolean search is not a quantifiable measure of how similar the returned documents are to the query. Quantifiable measures of similarity are very useful in retrieving documents from databases because documents can be ranked according to the quantifiable measure. In response to the shortcomings of Boolean type searches, vector space type search systems have been developed. In a vector space type search system, a score related to a particular query is computed for each document in the database. In general, a query "Q" and a document "D" can be compared by computing the shared and disjoint features of the query and the document over an orthogonal space of T terms. In such a comparison, for example, a similarity score can be computed by the following formula: ##EQU1## where Q.sub.i refers to the terms in the query and D.sub.j refers to the terms in the document.
A quantifiable similarity score for a document and query such as computed above is useful because the scores over various documents for a single query can be compared against each other. However, as is clear from an examination of the scoring formula, this scoring formula is significantly affected by variations in the number of terms per document. Since documents in a database typically have a wide range of sizes (e.g., from less than one page to more than hundreds of pages), the scoring must be normalized by size. One way to normalize the scoring is to divide individual documents into subdocuments having approximately the same size. The scoring is then computed on the basis of the subdocument. In this way, mere differences in the number of terms do not significantly skew the similarity analysis.
There are a variety of ways to create subdocuments from documents. A simple way is to create subdocuments that have precisely the same number of terms. Another way is to create subdocuments that have the same number of sentences. Each of these techniques helps to solve the problem of differing size documents. However, each of these techniques ignores the content of the text of the document in creating the subdocument. A technique for creating subdocuments that both forms comparable size subdocuments and takes account of the content of the subdocuments is to make the subdocuments correspond to the paragraphs in the document.
One result of calculating similarity scores for subdocuments of large databases is that the computer must process a very large number of subdocuments. In particular, prior to any scoring or search query processing, the large database to be searched is inverted. Inverting a database involves creating a list of all the terms in the database and the identity of the subdocuments in which those terms appear. Creating this inverted database is difficult because the memory space occupied by the database in disk storage is much larger than the memory space of the main memory in the computer. The computer cannot, therefore, operate directly on the database in main memory because the entire database cannot be written to main memory. As a result, the computer must repeatedly access the database in disk storage to invert it. Since the time required to access data in disk storage is much longer than the time required to access data in the main memory, a significant factor in creating the inverted database is minimizing how often the computer must access disk storage.
A technique for minimizing how often disk storage is accessed during the creation the inverted database is to assemble the large inverted database from several smaller subdatabases. This is a desirable technique because the computer is more efficient in creating smaller subdatabases. However, there still must be some way to merge the smaller subdatabases back together to create the large inverted database. Merging the inverted subdatabases together is a problem because the subdatabases are randomly organized with respect to each other. Attempting to directly merge these subdatabases would require a large number of operations to access the disk storage. This is because for each term in a first inverted subdatabase, the corresponding term in a second inverted subdatabase would have to be located in disk storage and it might be randomly placed in disk storage. Performing the large number of disk storage access operations required in a direct merge process significantly slows down the process of merging the subdatabases. In order to merge the subdatabases in a more efficient way, a sorting technique is required to organize the randomly ordered inverted subdatabases. The sort technique must be structured to that the subdatabases can be merged without requiring a large number of disk storage access operations.