Development of a search engine that can index a large and diverse collection of documents, yet return to a user a short, relevant list of result documents in response to a query has long been recognized to be a difficult problem. Various metrics of relevance of a document have been developed in an attempt to solve this problem. One class of such metrics is the query-independent metrics. The metrics represent the relative importance or relevance of a document to a user independent of any query submitted. Examples of query-independent metrics include, but are not limited to, simple criteria based on intrinsic properties of the document itself (i.e., the length of the document), ad-hoc rules for assigning relevance based on preassigned authority of a hosting site, and automatic determinations of relevance based on extrinsic information about the document. An example of an automatic relevance criterion based on extrinsic information is PageRank, described in detail in U.S. Pat. No. 6,285,999, hereby incorporated by reference in its entirety.
One goal of search engine design is to index documents in such a way that a list of documents returned in response to a query is approximately ordered in decreasing relevance. This task is made easier if the list of documents is ordered in terms of decreasing query-independent relevance. For computational efficiency, it is desirable that the internal representation of documents in the index reflect such an ordering. In this way, the list of documents returned to the user will contain the most highly relevant documents (as measured by a query-independent relevance metric), even when only the first few documents in the index are returned. Extracting only the first few documents from the index has advantages in computational efficiency, a critical factor when hundreds of millions of queries are served per day.
In search engine systems that retrieve (“crawl”) and evaluate the entire contents of a collection of documents before building an index, the index is readily assembled to return documents in order of decreasing query-independent relevance. Some indexes employ an internal representation of a particular document, referred to as a document identification tag. In some systems, the document identification tags are integers. By examining the query-independent relevance of a document relative to the collection of documents prior to the assignment of a document identification tag to the document, it is possible to assign a document identification tag that encodes this information. For example, assuming sufficient computational resources, the entire collection of documents could be sorted in order of decreasing query-independent relevance and document identification tags assigned in sequential order to documents in the sorted list.
However, as the number of documents in the Internet grows, it takes ever longer time periods between the time when a page is crawled by a robot and the time that it can be indexed and made available to a search engine. Furthermore, it takes ever longer time periods to replace or update a page once it has been indexed. Therefore, what is needed in the art are systems and methods for crawling and indexing web pages to reduce the latency between the time when a web page is either posted or updated on the Internet and the time when a representation of the new or updated web page is indexed and made available to a search engine.
Given the above background, it is desirable to devise a system and method for assigning document identification tags to documents to be indexed before retrieval of the entire contents of a collection of documents is complete. Furthermore, it is desirable to devise systems and methods for assigning document identification tags before a crawl is complete in such a way that the document identification tags encode information about the query-independent relevance of the document relative to the collection of documents.