Web search services allow users to submit queries, and in response, they return a set of links to web pages that satisfy the query. Because a query may potentially produce a large number of results, search engines typically display the results in a ranked order. There are many ways to rank-order the links resulting from a query, including content-based ranking, usage-based ranking, and link-based ranking. Content-based ranking techniques determine how relevant the content of a document is to a particular query. Usage-based ranking techniques monitor which result links users actually follow, and boost the rank of these result links for subsequent queries. Link-based ranking techniques examine how many other web pages link to a particular web page, and assign higher ranks to pages with many incoming links. Examples of link-based ranking algorithms include PageRank, HITS, and SALSA.
Link-based ranking algorithms view each page on the web as a node in a graph, and each hyperlink from one page to the other as a directed edge between the two corresponding nodes in the graph. There are two variants of link-based ranking algorithms: query-independent ones (such as PageRank) that assign an importance score (independent of any particular query) to all the web pages in the graph, and query-dependent ones (such as HITS and SALSA) that assign a relevance score with respect to a particular query to each web page returned in the result set of a query. Query-independent scores can be computed prior to the arrival of any query, while query-dependent scores can only be computed once the query has been received.
Users expect to receive answers to a query within a few seconds, and all major search engines strive to provide results in less than one second. Therefore, any query-dependent ranking algorithm desirably has to compute scores for all pages in the result set in under one second, and ideally within less than 100 milliseconds. However, the seek time of modern hard disks is on the order of 10 milliseconds, making them too slow to be used as a medium to store the web graph. In order to meet the time constraints, the web graph (or at least the most frequently used portions of it) has to be stored in memory, such as RAM, as opposed to disk storage.
A graph induced by the web pages stored in the corpus of a major search engine is extremely large. For example, the MSN Search corpus contains 5 billion web pages, which in turn contain on the order of 100 billion hyperlinks; the Google corpus is believed to contain about 20 billion web pages containing on the order of 400 billion hyperlinks. A web graph of this size cannot be stored in the memory of a single machine, even if the most effective compression techniques are applied. Therefore, the graph is distributed (“partitioned”) across multiple machines. Distributing the graph is orthogonal to compressing it; in practice, one does both.
U.S. patent application Ser. No. 10/413,645, filed Apr. 15, 2003, entitled “System and method for maintaining a distributed database of hyperlinks”, and incorporated herein by reference in its entirety, describes a scheme for distributing a database of hyperlinks across multiple machines, such as database processors. An embodiment is referred to as the Scalable Hyperlink Store, or SHS (used herein to refer to any distributed hyperlink database).
SHS represents a web graph as three databases or “stores”: a uniform resource locator (URL) store, a forward link store, and a backward link store. Each store is partitioned across multiple machines; each machine will hold corresponding fractions (“partitions”) of each store in main memory to serve queries.
Major search engines crawl the web continuously, causing their view of the web to change over time. These changes are reflected in the search engine's index in a timely fashion. A hyperlink database such as SHS is also timely updated.
Continuous crawling can change the search engine's view of the web graph as new pages are discovered that should be added to the hyperlink database, pages become irretrievable and should be deleted from the hyperlink database, the links in newly discovered pages should be added to the hyperlink database, the links in deleted pages should be deleted from the hyperlink database, and the links contained in changed pages should be updated in the hyperlink database. Currently, it is prohibitively complex and expensive to perform incremental updates on an existing hyperlink database or URL store, for example. Supporting incremental updates in hyperlink databases is challenging and expensive because of, for example, the linear data structures used and the order of the URLs or links.