Documents on interconnected computer networks are typically stored on numerous host computers that are connected over the networks. For example, so-called "web pages" may be stored on the global computer network known as the Internet, which includes the world wide web. Web pages can also be stored on Intranets, which are typically private networks maintained by corporations, government entities, and other groups. Each web page, whether on the world wide web or an Intranet, has a distinct address called its uniform resource locator (URL), which at least in part identifies the location or host computer of the web page. Many of the documents on Intranets and the world wide web are written in standard document description languages (e.g., HTML, XML). Theses languages allow an author of a document to create hypertext links to other documents. Hypertext links allow a reader of a web page to quickly move to another web page by clicking on the links. These links are typically highlighted in the original web page. A web page containing hypertext links to other web pages generally refers to those pages by their URL's. Links in a web page may refer to web pages that are stored in the same or different host computers.
A web crawler is a program that automatically finds and downloads documents from host computers in an Intranet or the world wide web. When a web crawler is given a set of starting URL's, the web crawler downloads the corresponding documents, then the web crawler extracts any URL's contained in those downloaded documents. Before the web crawler downloads the documents associated with the newly discovered URL's, the web crawler needs to find out whether these documents have already been downloaded. If the documents associated with the newly discovered URL's have not been downloaded, the web crawler downloads the documents and extracts any URL's contained in them. This process repeats indefinitely or until a predetermined stop condition occurs.
Typically, to find out whether the documents associated with a set of discovered URL's have already been downloaded, the web crawler checks a directory of downloaded document addresses. The directory stores the URL's of the downloaded documents, or representations of the URL's. The set of downloaded document addresses could potentially contain addresses of every document on the world wide web. As of 1999 there were approximately 500 million web pages on the world wide web and the number is continuously growing. Even Intranets can store millions of web pages. Thus, web crawlers need efficient data structures to keep track of downloaded documents and any discovered addresses of documents to be downloaded. Such data structures are needed to facilitate fast data checking and to avoid downloading a document multiple times.
One example of a known prior art method designed to facilitate fast data checking and to avoid downloading a document multiple times is the method implemented by the Scooter web crawler used by Alta Visa. In the Scooter web crawler, the set of downloaded document addresses is represented by a set of corresponding fingerprints. Each fingerprint in the set of fingerprints is a fixed-size numerical checksum, calculated directly from its corresponding URL.
For fast data access, the Scooter web crawler stores the set of fingerprints entirely in main memory. Due to the volume of documents on the world wide web, Scooter requires an extremely large main memory for storage of the directory of known web pages. The present invention provides more efficient document address representation and storage methods that avoid certain of the disadvantages and inefficiencies in the prior art.