1. Field of Invention
The invention includes a method, system, and computer program product for ranking information sources which are found in a distributed network with hypertext links. In particular, the present invention relates to link-analysis-based ranking of hits from a search in a distributed network environment. The software/firmware implementations of the method constitute one component of a system for searching a distributed information system by way of link-analysis-based ranking of hits from a search in a distributed network environment. The methods are applicable to environment in which documents or other files are related by links, such as the Internet.
2. Discussion of the Background Art
FIG. 1 is a basic representation of the internet, showing the parts commonly used to build a search engine for the World-Wide Web (WWW). The crawler 1 collects information about web pages out on the WWW 2. All relevant textual information is fed into an inverted index 3, used as an off-line snapshot of the information available on the crawled part of the WWW. Information about the link structure—that is, which other web pages each web page is pointing to—is saved in a link database 4. When a user performs a search, she issues a search query 5, which is sent to the inverted index 3. The results from scanning the inverted index are an unprioritized list of hits. This hit list is then ranked according to text relevance 6 and link structure 7. The two ranking measures are then merged into one prioritized and ranked list 8, which is returned to the user originating the search query as a prioritized search result 9.
When the query results are obtained from the inverted index, they will in general contain hits/documents that reside on different WWW domains on the Internet. The documents' mutual way of making a reference (pointing) to each other, implicitly builds a directed graph. This directed graph consists of the documents as nodes and the hypertext links as directed edges, and it is this directed graph that is used in link-based ranking. Link-based ranking then evaluates “weight” or “importance” of the hits (documents), based not on their own content, but on how they are located in the larger information network (the directed graph).
Link-analysis-based ranking is useful in any context where the documents to be ranked are related by directed links pointing from one document to another, and where the links may be interpreted as a form of recommendation. That is, a link pointing from document u to document v implies that a user interested in document u may also be interested in document v. Link analysis allows one to combine, in a useful way, the information contained in all such ‘recommendations’ (links), so that one can rank the documents in a global sense. The outstanding example of this kind of approach is the application of GOOGLE's PAGERANK method to the set of linked documents called the World Wide Web.
There are several alternative ways of doing link-based ranking and finding document ‘weights’. All methods are based on finding the principal eigenvector (eigenvector associated with the largest eigenvalue) of the graph's adjacency matrix A, in different modifications. GOGGLE's PAGERANK method (discussed below) obtains a ranking for each of the documents by computing the principal eigenvector of the transposed adjacency matrix with the columns normalized to unity. In the HITS method, due to Kleinberg (discussed below), two rankings are obtained: 1) the hub score is obtained by computing the principal eigenvector of the adjacency matrix composed with the transpose matrix of itself and 2) the authority score is calculated by obtaining the principal eigenvector of the transpose adjacency matrix composed with the adjacency matrix itself. However, there are no implementations that address rankings that arise from the unmodified adjacency matrix (used alone) and its transpose (also used alone).
The various methods for link analysis are most easily explained by defining two simple operators—F (Forward) and B (backward)—and their normalized versions, respectively f and b. In the spirit of a random walk, it is possible to imagine a certain weight (a positive number) associated with each node on a directed graph. The F operator takes the weight w(u) at each node u and sends it Forward, i.e., to all the nodes that are pointed to by node u. The B operator sends w(u) against the arrows, i.e., to each node that points towards node u. B is the adjacency matrix A, while F is its transpose f is the column-normalized version of F; it takes the weight w(u) at node u, divides it by the outdegree kout,u of node u, and sends the result w(u)/kout,u to each node pointed to by node u. Similarly, b is the normalized version of the backward operator B.
PAGERANK uses the f (normalized forward) operator, supplemented by the ‘random surfer’ operator (see below). The HITS method uses the composite operator FB to obtain Authority scores, and BF to obtain Hub scores. The present invention can be used with any operator which is subject to the problem of sinks—in particular, any of the basic operators F, B, f, or b.
One issue that all link-based ranking schemes must handle is the case of ‘sinks’ in the directed link graph structure. A ‘sink’ is a node, or a set of nodes, that has only links pointing into it, and no links pointing from the set of sink nodes to other nodes lying outside the set of sink nodes. Typically, sinks are composed of a set of nodes rather than one node; such a set is called a ‘sink region’. Also, every node in a sink region will be termed a ‘sink node’.
A problem with random walks on a directed graph is that they are easily trapped in sinks—regions of the graph that have a way in, but no way out. PAGERANK corrects for sinks by adding completely random hops (independent of the links) with a certain probability, while WISENUT corrects for sinks by employing a “page weight reservoir,” which is afictitious node connected bidirectionally to every other node in the graph. Sinks exist in general in distributed hypertext systems; hence every method involving random walks on the directed graph must deal with this problem somehow.
A different approach has been patented (U.S. Pat. No. 6,112,202, the contents of which are incorporated herein by reference) by Jon Kleinberg of Cornell University (USA), based on work done with IBM's CLEVER project. The algorithm is often called HITS (“Hypertext Induced Topic Selection”).
There is no known problem with sinks in the HITS approach since, in applying either of the HITS operators BF or FB, one alternates between following the arrows (directed arcs), and moving against them. This approach, and variations of it, is addressed in several patents (e.g., U.S. Pat. Nos. 6,112,203, 6,321,220, 6,356,899, and 6,560,600, the contents of which are incorporated herein by reference).
A simple graph with 13 nodes (documents) is shown in FIG. 2. In FIG. 2, there are two sink regions: one sink region consists of the set of nodes (6,7,8) and the other sink region consists of the set of nodes (10,11,12,13). Any movement which only follows the arrows will be trapped in either of these sink regions, once it first arrives there.
The presence of sinks presents a significant practical problem for importance ranking by link analysis. The problem is that, for some approaches, sink nodes or sink regions can accumulate all the weight, while the other non-sink nodes (documents) acquire zero weight. This way it is not possible to obtain a stable, non-zero, positive distribution of weight over the whole directed graph. Without such a weight distribution, a meaningful ranking of documents becomes impossible. That is, documents are typically ranked, via link analysis, by computing a positive, nonzero ‘importance weight’ for each node—obtained from the principal eigenvector for the chosen modification of the adjacency matrix—and then using this ‘link-analysis importance weight’, along with other measures of importance (such as relevance of the text to the query), to compute an overall weight, which in turn gives a ranking of the documents. When there are sinks, the principal eigenvector is prone to having zero weight over large parts of the graph. Importance ranking based on such an eigenvector is not useful.
Mathematically, to say that a graph has sinks is equivalent to saying that the graph is not strongly connected. A directed, strongly connected graph is one such that, for any pair u and v of nodes in the graph, there exists at least one directed path from u to v, and at least one directed path from v to u. These paths do not necessarily run through the same set of graph nodes. More colloquially: in moving through a strongly connected graph, following the directed links, one can reach any node from any starting place. The existence of sink nodes, or sink regions, violates this condition: one can get ‘stuck’ in the sink, and never come out. Hence, any graph with sinks is not strongly connected, and so any remedy for the sink problem seeks to make the whole link graph strongly connected.
GOGGLE's PAGERANK algorithm remedies the sink problem by adding a link from each node to any other node. That is, for each node in the graph, to every other node there is added an outlink that is given a small weight. This modification is called the ‘random surfer’ operator, because it mimics the effects of a Web surfer who can, from any page (node), hop at random to any other page.
Conceptually, when the random surfer operator is used, the original link graph is perturbed by a complete graph structure. A complete graph is a graph that has a directed link from any node to any other node in the graph. The perturbation of the link graph by a complete graph results in a new graph that is also complete. The sink problem is thus solved—since the new graph is strongly connected—and one is thus assured a global ranking for all the nodes. However, this does not come without a price. The price paid is that the sparse structure of the link graph is sacrificed and replaced by a new perturbed graph, which is dense. This can lead to two possible types of problems: 1) The algorithm used to compute the ranking normally becomes more time consuming when the matrix is dense, and 2) the structure of the link graph is altered.
The first problem does not arise for the PAGERANK method. Because of the special (complete and symmetric) structure of the random surfer operator, its effects can be computed very easily. Hence the computation time of the PAGERANK algorithm is not significantly increased by the addition of the complete graph structure.
The second problem remains. Of course, it is not possible to change a non- strongly-connected graph to a strongly connected graph, without changing its structure somehow. However there is a real sense in which the PAGERANK modification is ‘large’. That is: suppose the original graph is large—suppose it has a million nodes. (The number of documents in the World Wide Web is in the billions.) Then to say that the graph is ‘sparse’ is to say that the total number of links in the graph is roughly proportional to the number of nodes—in this case, some number times a million. (This number is the average node degree.) After performing the PAGERANK modification, however, the number of links is around a million times a million—about one trillion.
FIG. 3 illustrates the effect of the random surfer operator on the graph of FIG. 2. Here only the outlinks which are added to node 1 are shown. That is, after the addition of the random surfer links, node 1 has 12 outlinks rather than 2. Every other node in the graph will also have 12 outlinks. All the other random surfer links in FIG. 2 have not been drawn, simply to avoid visual clutter (there are 135 random surfer links in total for this graph).
In short, the PAGERANK sink remedy involves adding a potentially huge number of new links to the original graph. This change, while in some sense large, can at least be done in an unbiased fashion, by giving equal weights to all the added links. The presently disclosed methods also seek to make the graph strongly connected, also in an unbiased way, but by adding only a small number of links to the original graph.
Another algorithm, used by the WISENUT search engine (US patent application 2002-0129014), is somewhat similar to PAGERANK. The WISENUT method (termed WISENUT) also adds a large number of links, by connecting every node bidirectionally to a “page weight reservoir” (denoted R). This allows every node to reach every other; and in fact, in the algorithm, the two hops u→R→v are collapsed to one. Hence, topologically, this is the same as PAGERANK. However the probabilities of using the hops through R are different in the WISENUT rule—nodes with lower outdegree have a higher probability of using R. Nevertheless, it appears from the patent application that the non-sparseness of the resulting WISENUT matrix is manageable in the same way as that found in the PAGERANK matrix. Thus, the same advantages and disadvantages exist with WISENUT as mentioned above for PAGERANK.
A third approach to link analysis is by Jon Kleinberg (U.S. Pat. No. 6,112,202, the contents of which are incorporated herein by reference) of Cornell University (USA), based on work done with IBM's CLEVER project. The algorithm is often called HITS (“Hypertext Induced Topic Selection”). The HITS algorithm does not use the adjacency matrix directly; instead, it uses composite matrices, which are so structured as to not have sinks. Hence the HITS method may be said to include a method for avoiding the sink problem. However the composite matrices have their own problems. For one, they can give an ‘effective graph’ which has no connection between nodes which are linked in the original graph. In some cases this can lead to a connected original graph giving rise to a disconnected effective graph. There is no way to obtain a meaningful, global importance function for a disconnected graph; hence further assumptions or modifications are then needed in such cases.
The composite matrices will also connect many pairs of nodes which are not connected in the original graph. Hence there are many more nonzero entries in the composite matrices than there are in the original adjacency matrix. However, empirical studies suggest that these composite matrices are still sparse: in one example, where there were on average around 8 links for each node in the original adjacency matrix, there were found to be about 43 links in the effective graph for each node. Hence the HITS method appears to give a manageable numerical computation also.
Finally, the use of composite matrices has had little or no commercial use, while the non-composite PAGERANK approach has been enormously successful. In Applicants' own tests (unpublished), the HITS method gives rather poor results, while both PAGERANK and the method of U.S. patent application Ser. No. 10/687,602 gave good results. (In these tests, ‘good results’ means giving a high ranking to the best nodes.) Thus it seems that HITS and related methods, while mathematically elegant, do not give good performance in terms of ranking. One feature that is desired, as discovered by the present inventors, is an approach which does not rely on using composite matrices.