The present invention relates to a system and method for accessing documents, called web pages, on the world wide web (WWW) and, more particularly, to a method for scheduling web crawlers to efficiently download web pages from the world wide web.
Documents on interconnected computer networks are typically stored on numerous host computers that are connected over the networks. For example, so-called xe2x80x9cweb pagesxe2x80x9d are stored on the global computer network known as the Internet, which includes the world wide web. Each web page on the world wide web has a distinct address called its uniform resource locator (URL), which identifies the location of the web page. Most of the documents on the world wide web are written in standard document description languages (e.g., HTML, XML). These languages allow an author of a document to create hypertext links to other documents. Hypertext links allow a reader of a web page to quickly move to other web pages by clicking on their respective links. These links are typically highlighted in the original web page. A web page containing hypertext links to other web pages generally refers to those pages by their URL""s. Links in a web page may refer to web pages that are stored in the same or different host computers.
A web crawler is a program that automatically finds and downloads documents from host computers in networks such as the world wide web. When a web crawler is given a set of starting URL""s, the web crawler downloads the corresponding documents, then the web crawler extracts any URL""s contained in those downloaded documents and downloads more documents using the newly discovered URL""s. This process repeats indefinitely or until a predetermined stop condition occurs. As of 1999 there were approximately 500 million web pages on the world wide web and the number is continuously growing; thus, web crawlers need efficient data structures to keep track of downloaded documents and any discovered addresses of documents to be downloaded. One common data structure to keep track of addresses of documents to be downloaded is a first-in-first-out (FIFO) queue. Using FIFO queues, URL""s are enqueued as they are discovered, and dequeued in the order enqueued when the crawler needs a new URL to download.
A high-performance web crawler typically has the capability to download multiple documents in parallel, either by using asynchronous I/O or multiple threads. A thread is an abstraction for an execution entity within a running computer program. When a running computer program is composed of more than one thread, the program is said to be xe2x80x9cmulti-threaded.xe2x80x9d The threads of a multi-threaded program run in parallel and share the same memory space, but each thread in a multi-threaded program executes independently of the others. Each thread in a multi-threaded program has its own program counter and stack.
Discovered URL""s from any particular web page often tend to refer to documents located on the same host computer. Therefore, if a FIFO queue is used by a web crawler to store those discovered URL""s, sequentially dequeued URL""s could cause multiple parallel requests to the same host computer. Sending multiple parallel requests to the same host computer may overload the host, diminishing its responsiveness to page requests, or may even cause the host to crash, either of which may create a bottleneck in the web crawl and reduce the crawler""s effective parallel processing.
Examples of known prior art methods aimed at preventing the issuance of multiple parallel requests to one host computer include the Internet Archive web crawler and the Scooter web crawler used by AltaVista.
The Internet Archive crawler keeps a separate FIFO queue per web host. During a crawling process, 64 FIFO queues are selected and assigned to the process. The 64 queues are processed in parallel with the crawler dequeuing one URL at a time from each queue and downloading the corresponding document. This process ensures that no more than one URL from each queue is downloaded at a time and that the crawler makes at most one request to each host computer at a time. The FIFO queues in the Internet Archive web crawler have a one-to-one correspondence with the number of web hosts on the Internet; therefore, this approach requires a staggering number of queues, easily several million. However, this approach only processes 64 queues at a time; thus, not only are millions of queues sitting idle, this process also puts a prolonged load on a small fraction of the Internet""s web hosts.
The Scooter web crawler used by AltaVista uses a different approach. Scooter keeps a first list of URL""s of web pages to be downloaded, and a second list of host computers from which downloads are in progress. Newly discovered URL""s are added to the end of the first list. To locate a new URL to download, Scooter compares items in the first list with the second list until it finds a URL whose host computer is not in the second list. Scooter then removes that URL from the first list, updates the second list, and downloads the corresponding document. One of the disadvantages of this approach is the time wasted scanning through the first list of URL""s each time a thread in the crawler is ready to perform a download.
This present invention provides more efficient web page downloading methods that avoid certain of the disadvantages and inefficiencies in the prior art methods.
The present invention provides a method and system for downloading data sets from among a plurality of host computers.
A given set of web pages typically contains addresses or URL""s of one or more other web pages. Each address or URL typically includes a host address indicating the host computer of the particular web page. Addresses or URL""s discovered during the process of downloading data sets are enqueued into a number of queues based on predetermined policies.
In this invention, a web crawler may have multiple first-in-first-out (FIFO) queues and use multiple threads to dequeue from those queues and to download documents from the world wide web. Each queue is assigned a single, fixed thread that dequeues URL""s from that queue until it becomes empty. While a thread dequeues URL""s from its assigned queue, it also enqueues any URL""s discovered during the course of processing downloaded documents. In the exemplary embodiments, all URL""s with the same host component are enqueued in the same queue. As a result, when all the threads are dequeuing in parallel from each of their respectively assigned queues, no more than one request to one host computer is made at the same time.
In a first exemplary embodiment, when a thread discovers a new URL (i.e., in a document it has downloaded from a web site), a numerical function is performed on the URL""s host component to determine the queue in which to enqueue the new URL. Each queue may contain URL""s referring to documents stored on different host computers; however, as stated previously, URL""s referring to documents stored on the same host computer are always enqueued into the same queue.
In a second exemplary embodiment, the mechanism for enqueuing URL""s is based on a dynamic assignment of hosts to queues. When a new URL is discovered, the new URL is generally first enqueued into a main FIFO queue, and is later enqueued into one of the underlying FIFO queues based on the dynamic assignment of hosts to queues. However, if the main queue is empty, the new URL may be directly enqueued into one of the underlying queues. In this embodiment, not only are all URL""s having the same host component enqueued into the same underlying queue, but all URL""s in any particular one of the underlying queues have the same host component.
In the second exemplary embodiment, in which hosts are dynamically assigned to queues, when one of the underlying queues becomes empty, a different host may be assigned to it. For example, when a queue becomes empty, the empty queue""s corresponding thread begins enqueuing URL""s from the main queue into the underlying queues until the thread finds a URL whose corresponding host is not yet assigned to any underlying queue. The host of the new URL is assigned to the empty queue, and the new URL is enqueued into that queue in accordance with the new assignment. If the main queue becomes empty, the thread becomes idle and is blocked.
Both embodiments allow for the case where there are more queues than threads, in which case some threads will be assigned to dequeue from a set of multiple queues. In such embodiments, each thread dequeues URL""s from each of its assigned queues until each of those queues becomes empty.