Information retrieval systems, such as search engines, run queries against an index of documents generated from a document corpus (e.g., the World Wide Web). A typical inverted index includes the words in each document, together with pointers to their locations within the documents. A document processing system prepares the inverted index by processing the contents of the documents, pages or sites retrieved from the document corpus using an automated or manual process. The document processing system may also store the contents of the documents, or portions of the content, in a repository for use by a query processor when responding to a query.
In some information retrieval systems, freshness of the results (i.e., the turnaround from when a document is updated to when the updated document is available to queries) is an important consideration. However, there are several obstacles to providing fresh results. One obstacle is the expense or overhead associated with rebuilding the document index each time the document repository is updated. For example, significant overhead is often associated with building small indexes from new and updated documents and periodically merging the small indexes with a main index, and furthermore such systems typically suffer long latencies between document updates and availability of those documents in the repository index. A second obstacle is the difficulty of continuously processing queries against the document repository while updating the repository, without incurring large overhead. One aspect of this second obstacle is the need to synchronize both the threads that execute queries and the threads that update the document repository with key data structures in the data repository. The need to synchronize the query threads and repository update threads can present a significant obstacle to efficient operation of the document repository if document updates are performed frequently, which in turn is a barrier to maintaining freshness of the document repository.