1. Field of the Invention
The present invention generally relates to the analysis of data on the worldwide web (Internet) and in particular to a web-scale analysis of structured data.
2. Brief Description of Related Developments
The explosive growth of the web, and the difficulty of performing complex data analysis tasks on unstructured data, has led to interest in providing relational views of certain types of web-wide information. A number of approaches have emerged for specifying these relational views, and addressing the resulting performance questions.
The most direct line of research focuses on applying traditional relational approaches to the web, by defining a set of useful relations, such as the HTML “points to” relation, and allowing users to define queries over these relations using a structured query language. These approaches provide a great deal of power and generality, but at a cost in efficiency, for two reasons. First, the particular algorithms of interest in this domain may have efficient special-purpose solutions that even the most sophisticated query optimizers cannot hope to recognize. Second, the general set of neighborhood-type queries that can be phrased using the HTML “points to” relation in particular represent a large set of graph-theoretic problems that can in general be intractable. However, the relation is not arbitrary—the underlying structure of the web graph contains a number of regularities that can be exploited by particular algorithms, but are not apparent to a generic engine that treats “points to” as an arbitrary binary relation.
Thus, a second line of research focuses on solving particular web-scale analysis problems without invoking the machinery of a generic query language. Such approaches are numerous. To date, one of the most successful instances of this line of research focuses specifically on the “points to” relation on web pages. All of these examples can be implemented as sequences of “joins across relations” derived from “points-to,” but all give more efficient specific algorithms drawn from graph theory or numerical computation.
PageRank™ is a static ranking of web pages initially presented in, and used as the core of the Google™ search engine (http://www.google.com). It is the most visible link-based analysis scheme, and its success has caused virtually every popular search engine to incorporate link-based components into their ranking functions.
The convergence rate of the traditional iterative approach to PageRank™ computation, for various parameters is slow enough to cast doubt on this well-established technique. After performing a fairly significant number of iterations of this computationally intensive operation, the average error per page is roughly 50% of the value of that page. The reason for the slow convergence yields a characterization of the repetitive small-scale structure in the web graph. Web sites have connectivity properties with other web sites that allow fast convergence of PageRank™-style algorithms, and the network of linkages between sites is in some sense optimized for effective browsing. On the other hand, the structure within individual web sites is limiting from the perspective of PageRank™, as well as from the perspective of an individual attempting to browse and discover the pages on the site.