Information technologists have long recognized the need to properly index electronic content such that the content can be easily found by interested parties. In recent times search engine tools and technologies have evolved to address the need to discover and index electronic content published and accessible through computer networks, such as the Internet or private networks.
In order to automate the discovery of electronic content, software tools commonly known as “crawlers” traverse computer networks by navigating from electronic document to electronic document along hyperlinks embedded in the documents that indicate the locations of other documents. In this manner crawlers seek, acquire, and index electronic document content for later use by search engines.
A crawler often begins with a seed list that contains uniform resource locators (URLs) indicating the locations of electronic documents that are to be indexed. Seed lists are often prepared by publishers of electronic content who wish to make their content known to search engines so that others may access the content. Where seed lists are used, a crawler is often configured to perform an initial “full” crawling session of all electronic documents that are discoverable using a given seed list. Thereafter, and usually at scheduled intervals, the crawler is provided with seed lists that contain the URLs of only those electronic documents that have been updated since the previous crawling session. This reduces both the amount of time required to update the index as well as the load on the computer processing and storage infrastructure. While under this arrangement it is clear that the more frequently crawling sessions are scheduled to occur, the more up to date the index will be, this comes at a cost of placing a greater overall load on the computer processing and storage infrastructure than would be the case were crawling sessions scheduled to occur less frequently.