The present invention relates to a system and method for accessing documents, called web pages, on the world wide web (WWW) and, more particularly, to a method for associating an extensible set of data with each document downloaded by a web crawler.
Documents on interconnected computer networks are typically stored on numerous host computers that are connected over the networks. For example, so-called xe2x80x9cweb pagesxe2x80x9d are stored on the global computer network known as the Internet, which includes the world wide web. Each web page on the world wide web has a distinct address called its uniform resource locator (URL), which identifies the location of the web page. Most of the documents on the world wide web are written in standard document description languages (e.g., HTML, XML). These languages allow an author of a document to create hypertext links to other documents. Hypertext links allow a reader of a web page to quickly move to other web pages by clicking on their respective links. These links are typically highlighted in the original web page. A web page containing hypertext links to other web pages generally refers to those pages by their URL""s. Links in a web page may refer to web pages that are stored in the same or different host computers.
A web crawler is a program that automatically finds and downloads documents from host computers in networks such as the world wide web. When a web crawler is given a set of starting URL""s, the web crawler downloads the corresponding documents, extracts any URL""s contained in those downloaded documents and downloads more documents using the newly discovered URL""s. This process repeats indefinitely or until a predetermined stop condition occurs. As of 1999 there were approximately 500 million web pages on the world wide web and the number is continuously growing; thus, web crawlers need efficient data structures to keep track of downloaded documents and any discovered addresses of documents to be downloaded.
After a document is downloaded by the web crawler, the web crawler may extract and store information about the downloaded page. For instance, the web crawler may determine if the downloaded page contains any new URL""s not previously known to the web crawler, and may enqueue those URL""s for later processing. In addition, pages downloaded by the web crawler may be processed by a sequence of processing modules. For instance, one processing module might determine whether the document has already been included in a web page index, and whether the page has changed by more than a predefined amount since its entry in the web page index was last updated. Another processing module might add or update a document""s entry in the web page index. Yet another processing module might look for information of a specific type in the downloaded documents, extract the information and store it in a directory or other data structure.
During the course of processing a downloaded document, various data can be collected about it. Examples include the date and time of the download, how long it took to perform the download, whether the download was successful, the document""s size, its MIME type, the date and time it was last modified, its expiration date and time, and a checksum of its contents. These data can be used for a variety of purposes, including, but not limited to:
passing information from one processing module to a later processing module in a processing pipeline;
collecting statistics about the downloaded documents; and
in the context of a continuous web crawler, the collected data can be used as a basis for determining when a document should next be downloaded (refreshed).
After a document has been processed, its associated data can be saved to disk and analyzed off line.
A continuous web crawler is one that automatically refreshes a database of information about the pages it has downloaded. A web page can have an assigned or purported expiration date and time, which indicates when the page should be assumed to be no longer valid. Furthermore, a web crawler can be configured to assume that certain types of pages, such as pages on certain types of web sites, cannot be valid for more that a particular length of time. Thus, pages on a news web site might be assumed to be valid for only a few hours, while pages of an online encyclopedia might be assumed to be valid for a much longer time, such as month.
In the context of a continuous web crawler, it may be advantageous to record not only the data associated with a document""s most recent download, but also with its previous downloads. How complete a document download history to keep may vary depending on the user""s requirements.
The Scooter (a trademark of AltaVista Company) web crawler saves a fixed set of data for each document it discovers and downloads, namely, the document""s URL, the number of attempts that have been made to download it, the date and time of the last download attempt, the HTTP status code of the last download, and the document""s last modification date and time.
The Sphinx web crawler developed by Bharat and Miller allows document classifiers to associate name/value pairs with a downloaded page. However, Sphinx discards any name/value pairs associated with a document once the document has been processed. Moreover, the values must be strings, not values of arbitrary types.
It would be desirable to provide a much more flexible mechanism that enables application programs that process downloaded pages to determine what information to save for each document downloaded. In that way the data structure for storing such information would be dynamically determined, and the manner in which that information is used would be dynamically determined, without having to customize the code of the web crawler for each application.
Every web crawler must maintain a data structure or set of data structures reflecting the set of URL""s that still must be downloaded. In this document, that set of data structures is called xe2x80x9cthe Frontier.xe2x80x9d The crawler repeatedly selects a URL from the Frontier, downloads the corresponding document, processes the downloaded document, and then either removes the URL from the Frontier or reschedules it for downloading again at a later time. The latter scheme is used for so-called xe2x80x9ccontinuousxe2x80x9d web crawlers.
When selecting a URL from the Frontier, the inventors have determined that it would often be desirable for the crawler to preferentially select certain URL""s over others so as to maximize the quality of the information processed by the other applications to which the web crawler passes downloaded documents. For instance, the web crawler may pass downloaded pages to a document indexer. An index of documents on an Intranet or the Internet will be more accurate or higher quality if the documents of most interest to the users of the index have been preferentially updated so as to make sure that those documents are accurately represented in the index. To accomplish this, the web crawler might preferentially select URL""s on web servers with known high quality content. Alternately, heuristics might be used to gauge page quality. For instance, shorter URL""s might be considered to be better candidates than longer URL""s.
In the context of a continuous web crawler, it may be desirable to prefer URL""s on web servers whose content is known to change rapidly, such as news sites. It may be desirable to prefer newly-discovered URL""s over those that have been previously processed. Among the previously processed URL""s, it may be advantageous to prefer URL""s whose content has changed between the previous two downloads over URL""s whose content has not changed, and to prefer URL""s with shorter expiration dates over those with longer expiration dates.
As alluded to earlier, web crawlers are traditionally used to collect documents from the world wide web, as well as from Intranets, for some purpose, the most common of which is to build an index for a search engine. However, since many of the documents on the web and on Intranets change over time, at any given point in time, some fraction of any web index will contain stale content.
There are two obvious approaches to refreshing an index. One is to perform repeated complete or xe2x80x9cscratchxe2x80x9d crawls to rebuild the index from scratch. The disadvantage of this approach is that many of the documents may not have changed between the two scratch crawls, in which case valuable computer resources will be wasted unnecessarily refetching and processing documents. Another approach is to perform a more targeted crawl, but it is difficult to know a priori which documents need to be refetched, since the web does not include an invalidation mechanism. That is, the only way to discover that a page has changed is to query its web server.
Therefore it would be desirable to have a mechanism for keeping the results of a crawl up to date, using a continuous crawl that is somehow biased toward pages that are most likely to have been changed since the last time the crawler fetched them.
A web crawler downloads documents from among a plurality of host computers. The web crawler enqueues document addresses in a data structure called the Frontier. The Frontier generally includes a set of queues, with all document addresses sharing a respective common host component being stored in a respective common one of the queues. Multiple threads substantially concurrently process the document addresses in the queues.
The web crawler includes a set of tools for storing an extensible set of data with each document address (URL) in the Frontier. These tools enable the applications to which the web crawler passes downloaded documents to store a record of information associated with each download, where each record of information includes a set of name/value pairs specified by the applications. The applications also determine how many records of information to retain for each URL, when to delete records of information, and so on.
In another aspect of the present invention, the Frontier includes a set of parallel xe2x80x9cpriority queues,xe2x80x9d each associated with a distinct priority level. Queue elements for URL""s to be downloaded are assigned a priority level, and then stored in the corresponding priority queue. Queue elements are then distributed from the priority queues to a set of underlying queues in accordance with their relative priorities. The threads then process the queue elements in the underlying queues.
In yet another aspect of the present invention, the web crawler performs a continuous crawl. The URL element for each downloaded document is assigned a priority level and then reinserted into the Frontier, in the priority queue corresponding to the assigned priority level. The priority level is determined as a function of the extensible set of data stored with the queue element. Each queue element for a newly found URL is also assigned a priority level. That priority level is based on the fact that it is a newly found URL and may also be based on properties of the URL itself, or the web page on which the URL was found.