With the advent of the Internet, search engines were created to assist users in locating information from among the millions of documents, mostly web pages, created and available through the use of the Internet. Due to the number of documents available for searching on the Internet, many search engines use distributed databases to store the search index. Distributed databases are used to store records in many different machines that do not share a common central processing unit. The search engines also often use an inverted index, or posting list, to locate documents responsive to a query. An inverted index is an index that maps content to locations (e.g., an index that stores a mapping of words to the location of documents that contain the words) rather than mapping the locations to content. For example, a library may contain many books and each book contains words. One way to digitize the library is to store the words in the books by book (a forward index), so that the words are grouped by book. Another way to digitize the library is to store the books by words (an inverted index) so that the books are grouped by words, similar to a concordance.
One difficulty in maintaining an inverted index (i.e., a posting list) occurs when a non-key value (e.g., the book) is removed from the index. The difficulty occurs because the entire index (every posting list) must be searched to find all occurrences of the non-key value. A large distributed database with frequent updates exacerbates this problem. For example, in a large distributed database a posting list for a term (e.g., a word or phrase) may be stored on many different machines, especially with common terms. The system may know what machines store each key value, for example each word, but the system does not know what non-key values, for example books, are stored on each machine.
Consequently, in a large distributed database, deleting non-key values from the inverted index consumes a large amount of resources. This occurs because the distributed database index knows where key values are stored, but does not know where each non-key value, such as a particular book, is stored without retrieving each posting list for each key value. Thus, for example, deleting a book from the library may result in the system sending a request to each machine in the distributed database to delete the book, if it exists, from any of the posting lists that the machine stores. While this approach may work for a fairly small static library, it is too slow and consumes too many resources for a large library with millions, or even billions, of deletions per day, as exists as for documents on the Internet.