Search engines provide a powerful tool for locating documents in a large database of documents, such as the documents on the Internet or the documents stored on the computers of an Intranet. In the context of this application, a document is defined as a combination of a document address, e.g., a universal resource locator (URL), and a document content.
A typical structure of a web search engine comprises a front end and a back end. The front end includes a query server for receiving a search query submitted by a user and displaying search results to the user, and a query processor for transforming the search query into a search request understood by the back end of the web search engine. The back end includes one or more web crawlers for retrieving documents from the Internet, a scheduler for providing addresses of the documents to the web crawlers, an indexer for indexing the documents retrieved by the web crawlers and one or more databases for storing information of the retrieved documents, e.g., the indexes of the documents. Upon receipt of a search request, the front end searches the databases, identifies documents whose contents match the search request and returns them as the search results to the requester.
There are billions of documents accessible through the Internet. The life expectancy of a document's content (after which its contents may be replaced or changed) may vary from a few years, to a few seconds. Every day, many thousands of new and revised documents are posted by various web servers all over the world, while other documents are deleted from their hosting web servers and are therefore no longer accessible. As a result, at least some of the document information stored in a web search engine is likely to be stale, even if the web search engine is continuously crawling the web so as to update its database. Stale content in a search engine database is said to be visible when the search engine returns a result (e.g., in response to search query) that is based on stale information. In some cases, the stale content in the search engine may have no particular significance, because the changes to the documents listed in a search result are minor, or the relevance of the documents remains substantially the same. However, in other cases the search result may include links to documents that no longer exist, or whose content has changed such that the result is no longer relevant to the query (or has lower relevance to the query than the prior content of the documents). For purposes of this document, stale content is assumed to be visible, whenever search results are returned based on the stale content, even if the search results are still useful to the user.
In general, it would be desirable to keep the document information in a search engine's databases as fresh as possible, while avoiding needless refreshing of content that is highly static. More generally, it would be desirable to schedule documents for downloading by a web crawler so as to minimize the visibility of stale document information in the databases of the search engine.