The World Wide Web (“web”) contains a vast amount of information that is ever-changing. Existing web-based information retrieval systems use web crawlers to identify information on the web. For example, a web crawler may receive feeds of documents from webmasters.
A web crawler may also exploit the link-based structure of the web to browse the web in a methodical, automated manner. A web crawler may start with addresses (e.g., Uniform Resource Locators (URLs)) of links to visit. For each address on the list, the web crawler may visit the document associated with the address. The web crawler may identify outgoing links within the visited document and add addresses associated with these links to the list of addresses.
An indexer creates an index of the documents identified by the web crawler. A problem that indexers face is how to select documents to place in the index. The amount of space in the index is limited. Also, some documents might not be worth the cost (monetary and/or time) of indexing and serving. Therefore, only a subset of the documents identified by the web crawler get placed in the index.