Digital computers are often used to identify documents which contain particular textual elements such as words or phrases. Typically, an operator searching through a "document base" will be trying to identify those documents in the document base which are directed to particular topics. The topics will be characterized by one or more search queries, each of which contains a word, a phrase of one or more words comprising a text string, a relation of one word in a selected proximity relationship to another word or phrase, or the like.
In a conventional serial computer, a search in response to a query may proceed in a number of ways. In one way, the computer may, for each document in the document base, try to perform an initial search in which it compares the first word in the query, QW.sub.1, with each successive word DW.sub.i in the document. When the computer finds a word DW.sub.I in the document which matches the first word QW.sub.1 of the query, it proceeds to compare the successive words DW.sub.i+k in the document against the successive words QW.sub.1+k in the query, for each index "K," and if they compare positively, the document is identified as one which meets the query. On the other hand, if the computer determines that a word DW.sub.i+K in the query document does not favorably compare to the corresponding word QW.sub.1+K in the search query, it resumes the initial search with the word DW.sub.i+1 in the document text, following the one which gave rise to the previous initial match, comparing it with the first word QW.sub.1 of the query. The computer may perform similar operations if the query requests identification of documents in which a word has a selected proximity relationship to another word or phrase, except that it may not perform the comparison with respect to each subsequent word of the document, but only with respect to words which have the required proximity relationship to the first word of the query.
On a conventional serial computer, the search mechanism described above can be very time-consuming, and other mechanisms have been developed to speed up searching. In one well-known arrangement, the computer initially establishes inverted index files, in which the words which can be searched are listed in, for example, alphabetical order. Accompanying each word are pointers which identify the particular documents which contain the word as well as the location in each document at which the word occurs. To perform a search, instead of scanning through the documents in word order, the computer locates the pointers for the particular words identified in the query and processes them to identify the documents in which they have the required order or proximity relationship. This mechanism is generally faster than in the previously-described serial document search mechanism, but it can still be slow.
More recently, massively parallel computers have been developed which incorporate a large number of processing elements which perform processing generally in parallel. In an adaptation of the serial document search mechanism, the document base is divided among the processing elements, and a control processor broadcasts the query words of the query the processing elements. The processing elements perform the comparison operations in a manner similar to that described above with respect to the serial computer. This mechanism is feasible when the query length or rate, that is, the number to be processed per unit time, is relatively small, but can become unwieldy for long queries or if the query rate is large. If the amount of text in the document base is small, it may be more efficient to store the queries in the processing elements and broadcast the document base to the processing elements. However, neither of these mechanisms is efficient when both the document base and the number of queries is large.