The present invention relates to crawling (or traversing) of hyperlinked documents. More specifically, the invention relates to techniques for the distributed crawling of hyperlinked documents that can perform rate limiting of hosts and adapt to actual retrieval times of the hosts.
The World Wide Web (or “Web”) contains a vast amount of information in the form of hyperlinked documents (e.g., web pages). One of the reasons for the virtually explosive growth in the number of hyperlinked documents on the Web is that just about anyone can upload hyperlinked documents, which can include links to other hyperlinked documents. Although there is no doubt that there is a vast amount of useful information on the Web, the unstructured nature of the Web can make it difficult to find the information that is desired.
Search engines allow users to enter queries (e.g., key words) that describe the information users are seeking. The search engines then scan the Web for hyperlinked documents that best satisfy the query. With literally millions of hyperlinked documents on the Web, web crawlers are typically utilized to scan, index and store information regarding hyperlinked documents on the Web so that the search engines can execute queries more efficiently.
As the size of the Web continues to increase, it becomes increasingly more desirable to have innovative techniques for efficiently crawling the Web. Additionally, it would be beneficial to have web crawling techniques that are efficient yet do not impose unnecessary burdens on hosts on the Web.