Web crawlers and other systems as well maintain a database of information about web pages or documents accessible via a network. The network for which the database is maintained may be the Internet or may an Intranet or other network, but for convenience we will herein refer to this database as a “web database.” The web database will generally store the address of each known web page, as well as information about the outbound links in the web page to other web pages. The addresses of web pages are often called universal resource locators (URL's). Some web databases also store for each page information about all the links (herein called inbound links) in other web pages that point to this web page. Of course, the web database can also store additional information about the web pages, such as the last time the page was downloaded, the page's stated expiration date, a fingerprint or sketch or other representation that allows the page to be efficiently compared with other pages without having to compare the actual contents of the pages, and so on.
Referring to FIG. 1, there is shown a small portion of a hypothetical set 50 of interlinked pages 51–65 in a network. This figure shows inbound links and outbound links for each page. The present invention is directed to a memory space efficient system and method for storing the outbound and/or inbound link information for a set of pages in a network.
If the number of web pages in the network is large, the amount of memory required to store the URL's and links in the web database will be correspondingly large. In systems in which it is important or desirable to store the entire web database in high speed random access memory, such as in web crawler systems, it is desirable for the link information to be stored efficiently so as to reduce the amount of memory required to store the web database.