Digital computers are often used to identify documents which contain particular textual elements such as words or phrases. Typically, an operator searching through a "document base" will be trying to identify those documents in the document base which are directed to particular topics. The topics will be characterized by one or more search queries, each of which contains a word, a phrase of one or more words comprising a text string, a relation of one word in a selected proximity relationship to another word or phrase, or the like.
In a conventional serial computer, a search in response to a query may proceed in a number of ways. In one way, the computer may, for each document in the document base, try to perform an initial search in which it compares the first word in the query, QW.sub.1, with each successive word DW.sub.i in the document. When the computer finds a word DW.sub.I in the document which matches the first word QW.sub.1 of the query, it proceeds to compare the successive words DW.sub.i+k in the document against the successive words QW.sub.1+k in the query, for each index "K," and if they compare positively, the document is identified as one which meets the query. On the other hand, if the computer determines that a word DW.sub.i+k in the query document does not favorably compare to the corresponding word QW.sub.1+k in the search query, it resumes the initial search with the word DW.sub.i+1 in the document text, following the one which gave rise to the previous initial match, comparing it with the first word QW.sub.1 of the query. The computer may perform similar operations if the query requests identification of documents in which a word has a selected proximity relationship to another word or phrase, except that it may not perform the comparison with respect to each subsequent word of the document, but only with respect to words which have the required proximity relationship to the first word of the query.
On a conventional serial computer, the search mechanism described above can be very time-consuming, and modern document search and retrieval systems have been developed to speed up searching. Most such systems have document bases which have the same three basic components, namely, a dictionary, an inverted index and a document textbase. The dictionary contains a list of all of the words which may be used in a query, which generally will be all of the words in the texts of all of the documents except for certain short or oft-used words such as articles ("a," "an" and "the"), pronouns, and the like. The inverted index lists the words which can be searched in, for example, alphabetical order, and accompanying each word are pointers which identify the particular documents which contain the word as well as the locations in each document at which the word occurs. Finally, the document textbase contains the actual text of each document which may be searched. To perform a search, instead of scanning through the documents in word order, the computer locates the pointers for the particular words identified in the query and processes them to identify the documents in which they have the required order or proximity relationship. After identifying the documents which satisfy the query, the computer may use the pointers from the inverted index to locate in the document textbase the particular documents, and locations in each document, which satisfy the query, and provides them to an operator.
In many document retrieval systems, the computer files which contain the components of the document base become quite large, and a significant amount of time may be required to transfer portions required from secondary storage into main memory for processing.