The World Wide Web is a network of computers and information resources, typically with some information resources referring to other information sources via hyperlinks. For example, text or images can be encoded such that they refer to a network address (e.g., an URL, or Uniform Resource Locator) of other information resources.
The World Wide Web has grown explosively over the last few years to become a very large scale, distributed, evolving repository of information resources. With this growth has come increased difficulty in identifying relevant information resources. To address this need, search engines have become a core capability of the Internet. For example, in performing a search, an Internet user may enter a word, a phrase, or a set of keywords into a web browser software, or a thin client toolbar running on the user's computer. The search engine, specifically its query processor, may find matching information resources, such as web pages, images, documents, videos, and so on, and provide a response to the user. Search engines have also become prevalent in Intranets, i.e., private enterprise networks, where keywords are used by such search engines to locate documents and files.
The search engine query processor may leverage a search database and content repository, populated by a type of software called a web crawler or web spider, operating in conjunction with an indexer. The web crawler follows hyperlinks to find locations on the web; visits each location it finds; and the indexer relates the web page content to the location where it was found for the purpose of responding to a search query at a later time. Content may be stored in compressed or uncompressed fashion in the content repository for the query handler to serve up in conjunction with the locations. Unfortunately, there are information resources that may exist and may be accessible without a hyperlink, but are not necessarily “published” via a hyperlink. For example, a user may access these unpublished locations by typing the Uniform Resource Locator (URL) into a location text box in a browser application. Furthermore, the content stored at a published location may also change without advance notice. Since the web crawler may not revisit the site for an extended period of time, the search engine is unaware of the change and may continue to respond to queries based on expired information.
For greater freshness, a web crawler may visit sites more often, but such an approach adds load to the network as well as the web servers populating it.