Search engines assist users in locating information found in a collection of data, including, for example, web pages, PDFs, word processing documents, images, other types of files, etc. Such files may generally be referred to as documents. In order to quickly and effectively search the various documents for relevant information, search engines may index the contents of the documents and use the index to respond to search queries. The index may be included as part of a repository, which is a processed, encoded, and indexed version of a collection of data. A repository for a large collection of data, such as the Internet, may include billions of documents. Thus, some repositories are distributed amongst many machines, dividing the repository into smaller, more manageable pieces. The index for the repository may be an inverted index that is comprised of posting lists, with each posting list representing a term and documents that contain the term. To represent a document, the posting list typically uses a document identifier or other pointer to the document The posting list can also include other information, such as a position in which the term appears, or other information about the terms or documents. The index may also include metadata for the posting lists and documents.
In a distributed environment, the index may be divided amongst many machines. The division may be by document or by term. An index divided by document, or document-sharded index, minimizes network traffic between the computing devices but increases input/output (I/O) operations. An index divided by term, or term-sharded index, optimizes I/O operations, but increases network traffic. The purpose and size of the index may generally determine whether an index is term-sharded or document-sharded.