Web documents, herein called Web pages, are stored on numerous server computers (hereinafter "servers") that are connected to the Intenet. Each page on the Web has a distinct URL (universal resource locator). Many of the documents stored on Web servers are written in a standard document description language called HTML (hypertext markup language). Using HTML, a designer of Web documents can associate hypertext links or annotations with specific words or phrases in a document and specify visual aspects and the content of a Web page. The hypertext links identify the URLs of other Web documents or other parts of the same document providing information related to the words or phrases.
A user accesses documents stored on the WWW using a Web browser (a computer program designed to display HTML documents and communicate with Web servers) running on a Web client connected to the Internet. Typically, this is done by the user selecting a hypertext link (typically displayed by the Web browser as a highlighted word or phrase) within a document being viewed with the Web browser. The Web browser then issues a HTTP (hypertext transfer protocol) request for the requested document to the Web server identified by the requested document's URL. In response, the designated Web server returns the requested document to the Web browser, also using the HTTP.
As of the end of the 1995, the number of pages on the portion of the Internet known as the World Wide Web (hereinafter the "Web") had grown several fold during the prior one year-period to at least 30 million pages. The present invention is directed at a system for keeping track of pages on the Web as the Web continues to grow.
The systems for locating pages on the Web are known variously as "Web crawlers," "Web spiders" and "Web robots." The present invention has been coined a "Web scooter" because it is so much faster than all known Web crawlers. The terms "Web crawler," "Web spider," "Web scooter," "Web crawler computer system," and "Web scooter computer system" are used interchangeably in this document.
Prior art Web crawlers work generally as follows. Starting with a root set of known Web pages, a disk file is created with a distinct entry for every known Web page. As additional Web pages are fetched and their links to other pages are analyzed, additional entries are made in the disk file to reference Web pages not previously known to the Web crawler. Each entry indicates whether or not the corresponding Web page has been processed as well as other status information. A Web crawler processes a Web page by (A) identifying all links to other Web pages in the page being processed and storing related information so that all of the identified Web pages that have not yet been processed are added to a list of Web pages to be processed or other equivalent data structure, and (B) passing the Web page to an indexer or other document processing system.
The information about the Web pages already processed is generally stored in a disk file, because the amount of information in the disk file is too large to be stored in random access memory (RAM). For example, if an average of 100 bytes of information are stored for each Web page entry, a data file representing 30 million Web pages would occupy about 3 Gigabytes, which is too large for practical storage in RAM.
Next we consider the disk I/O incurred when processing one Web page. For purposes of this discussion we will assume that a typical Web page contains 20 references to other Web pages, and that a disk storage device can handle no more than 50 seeks per second. The Web crawler must evaluate each of the 20 page references in the page being processed to determine if it already knows about those pages. To do this it must attempt to retrieve 20 records from the Web information disk file. If the record for a specified page reference already exists, then that reference is discarded because no further processing is needed. However, if a record for a specified page is not found, an attempt must be made to locate a record for each possible alias of the page's address, thereby increasing the average of number of disk record seeks needed to analyze an average Web page to about 50 disk seeks per page.
If a disk file record for a specified page reference does not already exist a new record for the referenced page is created and added to the disk file, and that page reference is either added to a queue of pages to be processed, or the disk file entry is itself used to indicate that the page has not yet been fetched and processed.
Thus, processing a single Web page requires approximately 20 disk seeks (for reading existing records and for writing new records). As a result, given a limitation of 50 disk seeks per second, only about one Web pages can be processed per second.
In addition, there is a matter of network access latency. On average, it takes about 3 seconds on average to retrieve a Web page, although the amount of time is highly variable depending on the location of the Web server and the particular hardware and software being used on both the Web server and on the Web crawler computer. Network latency thus also tends to limit the number Web pages that can be processed by prior art Web crawlers to about 0.33 Web pages per second. Due to disk seek limitations, network latency, and other delay factors, a typical prior art Web crawler cannot process more than about 30,000 Web pages per day.
Due to the rate at which Web pages are being added to the Web, and the rate at which Web pages are being deleted and revised, processing 30,000 Web pages per day is inadequate for maintaining a truly current directory or index of all the Web pages on the Web. Ideally, a Web crawler should be able to visit (i.e., fetch and analyze) at least 2.5 million Web pages per day.
It is therefore an object of the present invention to provide an improved Web crawler that processes millions of Web pages per day. It is a related goal of the present invention to provide an improved Web crawler that overcomes the aforementioned disk seek limitations and network latency limitations so as to enable the Web crawler's speed of operation to be limited primarily by the processing speed of the Web crawler's CPU. It is yet another related goal of the present invention to provide a Web crawler system than can fetch and analyze, on average, at least 30 Web pages per second, and more preferably at least 100 Web pages per second.