Web search services allow users to submit queries, and in response, they return a set of links to web pages that satisfy the query. Because a query may potentially produce a large number of results, search engines typically display the results in a ranked order. There are many ways to rank-order the links resulting from a query, including content-based ranking, usage-based ranking, and link-based ranking. Content-based ranking techniques determine how relevant the content of a document is to a particular query. Usage-based ranking techniques monitor which result links users actually follow, and boost the rank of these result links for subsequent queries. Link-based ranking techniques examine how many other web pages link to a particular web page, and assign higher ranks to pages with many incoming links. Examples of link-based ranking algorithms include PageRank, HITS, and SALSA.
Link-based ranking algorithms view each page on the web as a node in a graph, and each hyperlink from one page to the other as a directed edge between the two corresponding nodes in the graph. There are two variants of link-based ranking algorithms: query-independent ones (such as PageRank) that assign an importance score (independent of any particular query) to all the web pages in the graph, and query-dependent ones (such as HITS and SALSA) that assign a relevance score with respect to a particular query to each web page returned in the result set of a query. Query-independent scores can be computed prior to the arrival of any query, while query-dependent scores, by their very nature, can only be computed once the query has been received.
Users expect to receive answers to a query within a few seconds, and all major search engines strive to provide results in less than one second. Therefore, any query-dependent ranking algorithm desirably has to compute scores for all pages in the result set in under one second, and ideally within less than 100 milliseconds. However, the seek time of modern hard disks is on the order of 10 milliseconds, making them too slow to be used as a medium to store the web graph. In order to meet the time constraints, the web graph (or at least the most frequently used portions of it) has to be stored in memory, such as RAM, as opposed to disk storage.
A graph induced by the web pages stored in the corpus of a major search engine is extremely large. For example, the MSN Search corpus contains 5 billion web pages, which in turn contain on the order of 100 billion hyperlinks; the Google corpus is believed to contain about 20 billion web pages containing on the order of 400 billion hyperlinks. A web graph of this size cannot be stored in the memory of a single machine, even if the most effective compression techniques are applied. Therefore, the graph is distributed (“partitioned”) across multiple machines. Distributing the graph is orthogonal to compressing it; in practice, one does both.
U.S. patent application Ser. No. 10/413,645, filed Apr. 15, 2003, entitled “System and method for maintaining a distributed database of hyperlinks”, and incorporated herein by reference in its entirety, describes a scheme for distributing a database of hyperlinks across multiple machines, such as database processors. An embodiment is referred to as the Scalable Hyperlink Store, or SHS.
SHS represents a web graph as three databases or “stores”: a uniform resource locator (URL) store, a forward link store, and a backward link store. Each store is partitioned across multiple machines; each machine will hold corresponding fractions (“partitions”) of each store in main memory to serve queries. The role and the layout of the stores as well as the partitioning algorithm are described in more detail herein.
Computers may fail for a variety of reasons, such as the failure of a hardware component (e.g., disk drives, power supplies, processors, memory, etc). Distributed systems composed of multiple computers are more vulnerable to failure: in a distributed system of n computers, where each individual computer fails with probability p during a given time interval, the probability that at least one of the constituent computers has failed is 1−(1−p)n, which is greater than p and increases with increasing n. Therefore, distributed systems should be designed to be fault-tolerant; that is, they should continue to function even if one or more of their constituent elements have failed.