As the amount of information stored in computer memory continues to grow, the ability to efficiently search through and find relevant documents becomes more and more important. The World Wide Web, for example, provides access to millions of documents covering just about every topic imaginable. Thus, when searching the Web for information, a challenge in computer science is to efficiently find those documents that the searcher is interested in.
Many of the most popular search techniques utilize indexing to keep track of documents on a network (such as the World Wide Web). Indexing is the process by which search engines extract information from a document repository so that its content can later be searched. In effect, a document repository index is a local snapshot in time of the repository's content. The index can then be quickly searched to find the most relevant documents in the repository for a given search query. Since the information in the document repository may be constantly changing, it is important to frequently update the index so that the snapshot is as current and accurate as possible.
There are many indexing techniques known in the art. An inverted index is the indexing technique of choice for most web documents. Search engines use an inverted index for HyperText Markup Language (HTML) documents, and Database Management Systems (DBMS) use it to support containment queries in eXtensible Markup Language (XML) documents. An inverted index is a collection of inverted lists, where each list is associated with a particular word. An inverted list for a given word is a collection of document IDs of those documents that contain the word. If the position of a word occurrence in a document is needed, each document ID entry in the inverted list also contains a list of location IDs. Positional information of words relative to the document is usually stored because it is needed for proximity queries and query result ranking. Omitting positional information in the inverted index is therefore a serious limitation. Positional information is usually stored in the form of location IDs. The location ID of a word is the position in the document where the word occurs. An entry in an inverted file is also called a posting, and it encodes the information in a tuple (word_id, doc_id, loc_id).
Since web documents change frequently, keeping inverted indexes up-to-date is crucial in making the most recently indexed documents searchable. A crawler (also referred to as a spider) is a program that collects web documents to be indexed. It has been shown that an in-place, incremental crawler can improve the freshness of the inverted index. However, the index rebuild method commonly used to update the inverted index cannot take advantage of an incremental crawler because the updated documents crawled in between rebuilds have to wait until the next index rebuild before they are searchable.
One solution known in the art for keeping document repositories up-to-date is to rebuild the index more frequently. As the interval between rebuilds gets smaller, the magnitude of change between the two snapshots of the indexed collection also becomes smaller. A large portion of the inverted index will remain unchanged, and a large portion of the work done by the rebuild is redundant. A frequent rebuild solution, however, is inefficient because it rebuilds portions of the index that did not change.
Another approach known in the art is to store the updates in between rebuilds in a searchable update log. This is similar to a ‘stop-press’ technique used to store the postings of documents that need to be inserted into the indexed collection. Each entry in this update log will either be a delete posting or an insert posting operation. Query processing requires searching both the inverted index and the update log and merging the results of both. If positional information is stored in each posting, which is often the case, the size of the update log can become prohibitively large. Thus, the update log technique is generally unsatisfactory because it is too large and affects query response time.