Index-based search systems have been developed to facilitate the identification of desired information within increasingly large quantities of electronic data. An index-based search of millions of electronic documents may produce a search result within a time that may be noticeable but acceptable, e.g., few seconds. A search result identifies relevant documents, and may also contain metadata such as a title and a description of the relevant documents, as well as snippets of relevant content.
The search index of an index-based search system typically contains structured files that store the contents of documents within field-value pairs. Such an index is often referred to as an “inverted index” because each record thereof contains a mapping from a unique word (i.e., a field) to a list (i.e., values) of document identifiers of all documents in the repository that contain this word. During operation, a search engine receives search terms, queries the search index to determine identifiers of documents which contain the search terms, and returns the documents identifiers in an order based on their relevance to the search terms.
An inverted index is optimized for quick response. For example, a search index may return a list of document identifiers in response to a search query within imperceptible time, e.g., tens of milliseconds. Once the list is returned, the identified documents are opened to retrieve the metadata and other fields (e.g., description, content snippets) to be returned in the search result. Opening the documents and retrieving information therefrom is by far the most time-consuming component of the search process.
In order to increase the speed at which this information is retrieved, some search systems store a blueprint of each stored document in a set of flat files, known as a content store. Using a content store, document information may be retrieved in tens of milliseconds. Document identifiers stored in the search index are used to reference corresponding document information from the content store.
Conventional systems require complete regeneration of a content store every time the search index is updated. In this regard, the document identifiers in the search index may change during an update to the search index, rendering the references in the content store invalid. By completely regenerating the entire content store each time the search index is updated, the content store references are always valid.
Moreover, the content store files are typically sequential in order to facilitate fast retrieval of information therefrom. However, using conventional techniques, updating the content store files based only on search index updates while maintaining the sequential nature of the files is often more time- and resource-consuming than simply regenerating the entire content store based on the entire search index.
Completely regenerating a content store in response to search index updates results in a significant delay between changes to a document repository and reflection of the changes within the content store. This delay is exacerbated over time because the time required to generate a content store scales poorly as the number of documents in the search index increases.
Systems are desired to efficiently update a content store based on changes to a search index. For example, systems are desired to efficiently update a content store in response to the addition, deletion and/or modification of document identifiers within a search index.