Search engines provide a powerful source of indexed documents from the Internet that can be rapidly scanned. However, as the number of documents in the Internet grows, it takes ever longer time periods between the time when a page is crawled by a robot and the time that it can be indexed and made available to a search engine. Furthermore, it takes ever longer time periods to replace or update a page once it has been indexed. Therefore, what is needed in the art are systems and methods for crawling and indexing web pages to reduce the latency between the time when a web page is either posted or updated on the Internet and the time when a representation of the new or updated web page is indexed and made available to a search engine.
In addition to problems associated with the latency between the time the content of a web page changes and the time that content can be indexed, the growth of the number of documents on the Internet poses additional challenges to the development of an effective search engine system. When a user submits a query to a search engine system, he expects a short list of highly relevant web pages to be returned. Previous search engine systems, when indexing a web page, associate only the contents of the web page itself with the web page. However, in a collection of linked documents, such as resides on the Internet, valuable information about a particular web page may be found outside the contents of the web page itself. For example, so-called “hyperlinks” that point to a web page often contain valuable information about a web page. The information in or neighboring a hyperlink pointing to a web page can be especially useful when the web page contains little or no textual information itself. Thus, what is needed in the art are methods and systems of indexing information about a document, the information residing on other documents in a collection of linked documents, so as to produce an index that can return a list of the most highly relevant documents in response to a user-submitted query.