The core of the World Wide Web (WWW) comprises several billion interlinked web pages. Accessing information on almost any of these web pages would be essentially impossible without the aid of systems that enable a user to search for specific text, or textual identifiers. Indeed, such systems, generally known as “search engines,” have increased in popularity as the WWW has grown in size.
However, to provide reasonable response times, search engines cannot search billions of web pages by accessing each page every time a user searched for a term. Instead search engines typically rely on locally stored information that represents the relevant data, such as the text, from each web page. Thus, to identify one or more web pages that are responsive to a user's search query, a search engine need only access information local to the search engine.
Unfortunately, when dealing with billions of individual web pages, storing even a few kilobytes of data per page can require a total storage capacity of several terabytes. For example, a web page can be uniquely identified by its Uniform Resource Locator (URL). Thus, when storing relevant information about a web page, a search engine can identify the web page from which such information was obtained by its URL. Because a search engine may collect information from a single web page in multiple databases or data structures, it may need to reference that information using the web page's URL multiple times. A typical URL, expressed as plain text, can be a hundred bytes or more. Thus, for billions of web pages, the mere use of the URL to identify information obtained from the web page can, by itself, require several terabytes of storage capacity. Consequently, instead of using a text-based URL to identify a web page, search engines more commonly use a hash of the URL to identify a web page for purposes of storing information into their local search databases. Mathematically, at least 35 bits are required to uniquely identify between 16 and 32 billion web pages, and many search engines uses hashes that result in hash values that can be as large as 80 bits, or ten bytes. Nevertheless, even a ten byte identifier for a web page can save terabytes of storage capacity when compared with a hundred byte textual URL.
The problem of storing a large quantity of uniquely identifiable information is not unique to WWW search engines. For example, modern operating systems include an analog of a WWW search engine for providing users with an efficient interface to the users' ever increasing collection of digital data. If each file is identified by its file path within the file system, and its name, such information alone can require a hundred bytes, or more. If a hundred thousand of the user's files are cataloged, the identification information alone can require several megabytes. Similarly, a large database comprising information associated with millions of individual entries can require several megabytes merely for the storage of identification information for those entries. In such cases, hashing often provided a mechanism by which the identifying information could be transformed into a value that required less storage space. Unfortunately, the hashing mechanisms themselves often consumed a large amount of storage space, offsetting some of the storage efficiency gains realized by using hashes in place of less space-efficient information.