World Wide Web (“Web”) search engines typically operate on very large data sets. For instance, it is not uncommon for a Web search engine to maintain more than 20 billion uniform resource locators (“URLs”) in its database. Each URL corresponds to a unique Web page. The URLs are variable-sized, ranging from approximately 5 to 1000 characters or more, and on average, are approximately 80 characters in length. As a result, the mass storage capacity needed to simply store 20 billion 80-character average length URLs is in excess of 1.6 terabytes. Due to its extremely large size, it is also very computationally expensive to perform processing operations on such a large set of URLs.
In order to more efficiently perform processing functions on a large set of URLs, such as performing page rank computations, Web search engines commonly distribute the URLs over a group of server computers. The URLs assigned to each server computer are then mapped to contiguous integers locally on each of the computers. The integers are called rank identifiers (“rank IDs”). The rank IDs are utilized instead of the URLs to uniquely reference the corresponding Web pages because computers tend to be more efficient at processing integers than strings. In this way, identifiers for each of the Web pages can be stored and operated on in a manner that utilizes significantly less space than storage of the actual URLs and improves performance.
The process of distributing the URLs over the group of server computers and mapping the URLs to rank IDs on each server computer is, however, very computationally expensive. In fact, the process of mapping the URLs to rank IDs can take up to 25-30% of the total computation time of the page rank computation using previous solutions. Moreover, in order to exchange information regarding the URLs between the server computers, a rank ID local to one server computer must first be converted back to the corresponding URL, and then converted to a rank ID local to another server computer. Corresponding local rank IDs may be pre-computed, but this also is a computationally expensive process.
It is with respect to these considerations and others that the disclosure made herein is presented.