The present disclosure relates to techniques for organizing and retrieving information and more particularly to indexing of electronic documents to facilitate search query processing.
With the proliferation of electronic documents and communications, there has been an increased need to assist users in finding relevant documents. A search engine can scan documents in a corpus and extract text; however, real-time scanning of a large corpus of documents is impractical. Accordingly, it is now common practice for search engines to scan a corpus and create an index—a condensed representation of document content that can be readily searched. A typical indexing process usually involves creating a “forward” index in which each document is associated with a list of words that appear in the document, then processing the forward index to create an “inverted” index, in which each word is associated with a list of documents that contain that word. (The inverted index is usually condensed using hashing techniques or the like to reduce storage requirements and also to facilitate locating a given word.) The inverted index is most often used as the starting point for processing search queries where the user specifies a particular word or words to be found.
The size of the index can become a limiting factor in both indexing and search processes. For example, the time needed to invert a forward index will generally scale with the number of documents or number of words. The time needed to search an index will also increase with the number of words and/or documents in the index.
One way to speed up indexing and search processes is to provide multiple indexes and to assign different documents to different indexes. Index construction and search processes can then be performed in parallel on multiple smaller indexes, resulting in faster performance. In systems where multiple search indexes are used, a given document can be randomly or arbitrarily assigned to one of the indexes.