The volume of documents in databases is rapidly expanding. It has been estimated that in excess of 90% of all desired intelligence information is available in documents residing in accessible databases. In order for the information in databases to be useful, a user must be able to locate specific documents relating to specific queries. Existing information retrieval systems use inefficient techniques for returning relevant documents. Generally, the existing techniques miss highly relevant documents associated with a user's query. For example, many systems use Boolean logic-based query execution techniques wherein keywords are connected together via logical or proximity operators. Such a Boolean system merely returns a list of documents, wherein each of the documents includes one of the keyword combinations.
The result of a Boolean search has no quantifiable measure of how similar the returned documents are to the query. Quantifiable measures of similarity are very useful in retrieving documents from databases because the documents can be ranked by the quantitifiable measure. In response to the shortcomings of Boolean type searches, vector space type search systems have been developed. In a vector space type search system, a score related to a particular query is computed for each document in the database. In general, a query "Q" and a document "D" can be compared by computing the shared and disjoint features of the query and the document over an orthogonal space of T terms. In such a comparison, for example, a similarity score can be computed by the following formula: ##EQU1##
Where Qi refers to the terms in the query and Dj refers to the terms in the document.
A quantifiable similarity score for a document and query such as computed above is useful because the scores over various documents for a single query can be compared against each other. However, as is clear from an examination of the scoring formula, this scoring formula is significantly affected by variations in the number of terms per document. Since documents in a database typically have a wide range of sizes (e.g., from less than one page to more than hundreds of pages), the scoring must be normalized by size. One way to normalize the scoring is to divide individual documents into subdocuments having approximately the same size. The scoring is then computed on the basis of the subdocument. Also, scores between subdocuments are then analyzed. In this way, mere differences in a number of terms do not significantly skew the similarity analysis.
There are a variety of ways to create subdocuments from documents. A simple way is to create subdocuments that have precisely the same number of terms. Another way is to create subdocuments that have the same number of sentences. Each of these techniques helps to solve the problem of differing size documents. However, each of these techniques ignores the content of the text of the document in creating the subdocument. A technique for creating subdocuments that both forms comparable size subdocuments and takes account of the content of the subdocuments, is to make the subdocuments correspond to the paragraphs in the document.
The result of calculating similarity scores of text based on subdocuments is that a large list is generated that associates a score with a subdocument identifier and a document identifier. The number of entries on this list is much larger than the number of documents in a database because there may be many subdocuments for each document. Additionally, this list is not sorted relative to the subdocument score. Since the reason for calculating the similarity score is typically to operate on a rank ordered (by score) list of subdocuments, this entire list must be sorted by score before any other analysis can be started. The sort operation is generally an inefficient and time consuming process because a complete sort requires N log N operations where N represents the number of subdocuments.