1. Field of Invention
The invention includes a method, system, and computer program product for ranking information sources which are found in a distributed network with hypertext links. The software/firmware implementations of the method constitute one component of a system for searching a distributed information system aimed at giving a node ranking based on the disclosed method of hypertext link analysis. A complete system may also have several other components including tools which give ranking scores based on text relevancy; an indexing system; a crawler; and a user interface.
2. Discussion of the Background Art
A problem addressed by many devices and algorithms concerns ranking of hits after a search over a distributed information database. That is, in those cases where the search is guided by topic (keywords)—rather than searching for a specific document—there can often arise more matches to the search criteria—“hits”—than the user can evaluate or even look at. Hits may number in the thousands, or even higher. Therefore ranking of the hits is crucial—without some guide as to which hits are most relevant or valuable, good hits may be lost in a sea of mediocre or irrelevant hits.
When one ranks hits obtained from a keyword search through a hyperlinked database, there are two general types of ranking possible: text relevance ranking, and ranking based on link analysis. Typical search engines use both—although, in many cases, the simplest possible link analysis technique, namely link popularity, is used.
Text relevance ranking is based upon the content of the documents ranked, ie, the relevance of that content to the keywords of the search. Thus, text relevance ranking is mostly insensitive to whether one looks at the entire set of documents (the “whole graph”, or WG), or only a subset of documents (a “subgraph”).
In contrast, link analysis ranks documents based on their position in a hyperlinked network—a type of “community of documents.” Some documents are found to have a “high” or “central” position in the linked network, and so are given high ranking. Because link analysis ranking (except for the naïve link popularity technique) is sensitive to the overall structure of the network (graph), the ranking results are sensitive to whether one looks at the whole graph, or only at a subgraph.
FIGS. 1-4 illustrate the relationships between text relevance ranking and link analysis ranking, for the two cases just described: (i) link analysis ranking based on the whole graph (FIGS. 1 and 2); and (ii) link analysis ranking based on a subgraph (FIGS. 3 and 4). FIGS. 1 and 3 give a simplified general picture for cases (i) and (ii), respectively, while FIGS. 2 and 4 give more details of the system architecture for each case.
We begin with FIG. 1. In this figure, as in all of FIGS. 1-4, we assume that a crawler or other technique has built up a database which describes both the content and the link structure for the whole graph WG. In FIG. 1, we see that link analysis 113 is applied to the whole-graph database 103, so that link analysis ranking of the documents is based on their position in the whole graph, and is thus independent of search terms. Search terms 101 are then used to pick out a set of hits 105, which are then given a text relevance ranking 107. Finally, a ranking from the whole-graph link analysis 113 and the text relevance ranking 107 are combined to give a prioritized hits list 111 net ranking score for each document.
In FIG. 2 the whole-graph database 103 is broken up into its two chief components: a content database 103a, and a link structure database 103b. Here the link analysis ranking 113a is done based on the whole graph and results in a link analysis database 113b. Again we see that keywords 101a are used by a hits list generator 105a to select a hits list 105b. This list 105b is then subjected to text relevance ranking 107a and given a text relevance ranking 107b, using information from the content database 103a. The two rankings 113b, 107b are then merged 111a, using any of a number of different possible rules, and yield a net ranking score for each document in the hits list. Finally, the ranked list is truncated to a predetermined size 101b, so that only the highest-ranked documents 111b are stored and presented.
FIG. 3 portrays in schematic form the use of text relevance ranking, in combination with link analysis ranking, when the latter is applied only to a subgraph. The hits list 105 is ranked according to text relevance 107, and then truncated, before link analysis ranking 113 is performed. The truncated list (subgraph) is fed to the link analysis routine 113, which also needs information (dashed line) from the WG database 101. The resulting subgraph link analysis ranking is finally combined with the text relevance ranking for the same subgraph, to give a merged ranking score 111 for the selected subgraph.
FIG. 4 shows this in more detail. In contrast to FIG. 2, here the hits list 105b that is generated by the hits list generator 105a with the search terms 101a is given a text relevance ranking 107a1, and truncated with a truncation size 101b, before link analysis ranking is performed. The truncated list 107b1 is sent to a subgraph generator 113c, which will enlarge the list into an expanded subgraph 113d in such a way as to give a coherent linked “community” of topic-related documents. This expanded subgraph 113d is then subjected both to link analysis ranking 113a and to text relevance ranking 107a2 to produce an expanded subgraph relevance ranking 107b2 and an expanded subgraph link analysis ranking 113e. Finally, the resulting ranking scores are merged 111a to give a single ranking 111b for all documents in the subgraph.
The present invention is directed to a novel method, apparatus, and computer program product for link analysis ranking. As no details about the method of link analysis ranking are shown in any of FIGS. 1-4, the figures do not describe the invention, but rather only give the context in which the present invention, or any other method of link analysis ranking, may be applied.
Currently, there are two broad classes of methods for ranking hits. The first evaluates relevance of the hit according to an analysis of the text in the found document, known as text relevance analysis. For example, if the search keywords are “Norwegian elkhounds”, then an algorithm is used to attempt to evaluate the relevance of the search terms in the found document. While this kind of ranking is effective, it can be “fooled” by authors of the documents, who seek a high ranking by repeating important keywords (artificially) many times.
The second class of algorithms evaluates “weight” or “importance” of the hits, based not on their own content, but on how they are located in the larger information network. That is, this class of algorithms employs link analysis to determine how “central” a given hit (document or node) is in a linked network of documents. The present invention is a type of hypertext link analysis.
In hypertext link analysis, hypertext links may be viewed simply as directed arrows pointing from one document to another. The set of documents and hypertext links, taken together, form a directed graph. One then seeks a rule for assigning a weight or importance to each node (document) in the graph, based on the link structure (topology) of the directed graph.
For example, a node with many nodes pointing to it is said to have high indegree. One might assign a weight to each node based solely on its indegree. However, this simple weighting approach—often called the “link popularity” method—is easily fooled, since one can create a large number of spurious documents, all pointing to a single document and giving it artificially high indegree. Nevertheless link popularity ranking is used by a number of commercial search engines, probably due to its simplicity.
Another method, used by both the PageRank algorithm of Google (U.S. Pat. No. 6,285,999, the contents of which are incorporated herein by reference), and by the search engine WiseNut (U.S. Patent Application 2002-0129014, the contents of which are incorporated herein by reference), involves finding the fraction of time a random walker, moving over the graph and following the directed links between nodes, would spend at each node. Clearly, high indegree will contribute positively to this score; however other aspects of the neighborhood of each node are also important. For instance, those nodes pointing to a node having high indegree must also have significant weight; otherwise the high indegree gives little weight to the node in question. Hence the random-walker approach is more sensitive to the overall topological structure of the graph.
One problem with random walks on a directed graph is that they are easily trapped in “sinks”—regions of the graph that have a way in, but no way out. PageRank corrects for sinks by adding completely random hops (independent of the links) with a certain probability while WiseNut corrects for sinks by employing a “page weight reservoir,” which is a fictitious node connected bidirectionally to every other node in the graph. Sinks exist in general in distributed hypertext systems; hence every method involving random walks on the directed graph must deal with this problem somehow.
A different approach has been patented (U.S. Pat. No. 6,112,202, the contents of which are incorporated herein by reference) by Jon Kleinberg of Cornell University (USA), based on work done with IBM's CLEVER project. The algorithm is often called HITS (“Hypertext Induced Topic Selection”).
HITS is most easily explained by defining two simple operators: F (Forward) and B (backward). In the spirit of a random walk, it is possible to imagine a certain weight (a positive number) associated with each node on a directed graph. The F operator takes the weight w(i) at each node i and sends it Forward, i.e., to all the nodes that are pointed to by node i. The B operator sends w(i) against the arrows, i.e., to each node that points towards node i.
Next we explain the use of compound operators. Suppose for instance we wish always to first use the F operator, and then follow with the B operator. Using standard matrix notation, this compound operator (F followed by B) would be denoted BF. (Matrix operators act on vectors to the right; hence the rightmost operator acts first.) Similarly, a compound operator composed of B followed by F would be denoted FB.
Henceforth, we use the term “non-compound operator” to refer to the operators F and B (and to their normalized versions, denoted f and b). Of course, any product of operators (matrices) is a new operator (matrix), which can be used to redistribute weights on a graph. However, the compound operators BF and FB have the special property that they always alternate the direction of the “flow” of weight distribution, between flowing “with” the arrows of the hyperlinks, and “against” these arrows. The non-compound operators B and F, in contrast, may each be used in isolation from the other, so that the flow is never reversed. We will see that this difference can have large effects on the results of application of these operators for document ranking.
The HITS algorithm uses repeated application of the compound operators BF and FB, to obtain two importance scores for each node. For instance, after many repetitions of FB, the weights at each node will converge to a stable value, which is then called their “Authority score”. Similarly repeated operation by BF gives a “Hub score.” Thus, one may say that “good Authorities are pointed to by good Hubs”. That is, a node has a high Hub score if it points to many good (or a few VERY good) Authorities—i.e., nodes with relevant content. Also, a node has a high Authority score if it is pointed to by many good (or a few very good) Hubs. Thus the two scores are defined mutually.
There is no known problem with sinks in the HITS approach since one alternates between following the arrows (directed arcs), and moving against them. This approach, and variations of it, are addressed in several patents (e.g., U.S. Pat. Nos. 6,112,203, 6,321,220, 6,356,899, and 6,560,600, the contents of which are incorporated herein by reference), and variations of HITS appear to be in use in the commercial search engines Teoma and AltaVista. This statement is based on examination of publicly available documents about existing search engines, including patents owned by them—in particular, AltaVista has several US patents based on variations of the HITS method.
An important feature of the HITS method is that the operators F and B are not “normalized”. A normalized operator does not change the total amount of “weight” present on the graph. For example, a normalized F operator (which we will write as f) will take the weight w(i) and redistribute it to all the nodes “downstream” of node i. That is, for the f operator the total weight sent out from node i is equal to the weight found at node i. In contrast, the (non-normalized) F operator sends a “copy” of weight w(i) to each node found downstream from i—so that the total weight sent out is w(i), multiplied by the outdegree of i.
This feature may seem small, but it can have very large effects. There is an algorithm called SALSA (SALSA: The Stochastic Approach for Link-Structure Analysis, ACM Transactions on Information Systems 19(2), PP. 131-160, April 2001, the contents of which are incorporated herein by reference) which is essentially identical to the HITS algorithm, with the one exception that it uses the normalized operators fb and bf. This small change turns out to be highly nontrivial: the Hub and Authority scores for the SALSA algorithm turn out to be, respectively, simply the outdegree and indegree for each node. Thus, normalizing the HITS algorithm (making it “weight-conserving”) completely eliminates any sensitivity of the approach to the structure of the graph as a whole—instead, the results are equivalent to the naïve link-popularity approach.
A similar result holds for undirected graphs (where F and B become the same). Here a normalized version simply gives node degree, while the non-normalized version gives a score (“eigenvector centrality”) which is nontrivial, and sensitive to the overall graph structure.
One might conclude from this that normalized operators cannot give useful results in ranking nodes on graphs. This conclusion is however not correct. The PageRank algorithm used by Google—described above as a random walk—is equivalent to using the f operator (supplemented by completely random hops to escape sinks). Google is the dominant search engine on the Web today, and its PageRank algorithm is one of the important reasons for that dominance: it gives meaningful and useful ranking results.
One other normalized operator (b) has been briefly mentioned in a research paper by Ding et al. (LNBL Tech Report 49372, updated September 2002, the contents of which are incorporated herein by reference). Ding et al. offer an extremely short (one sentence) discussion of the performance of document ranking based on this operator, implying that it gives similar results to the Hub scores for the HITS algorithm. We use ‘DHHZS’ (first letters of the authors' last names) to refer to the study of the b operator in this paper.
In the following we summarize the above discussion of methods for ranking using hypertext link analysis. Two methods (SALSA and HITS) use compound operators. Both methods give two types of scores for each document. SALSA however is equivalent to link popularity, while HITS gives nontrivial results that depend on the overall link structure. PageRank uses only a normalized Forward operator, and yields a single score which is also more useful than naïve link counting. Finally, the paper of DHHZS mentions a normalized Backward operator, which also yields a single, nontrivial score.
Shortcomings of the four categories of algorithms listed above (i.e., normalized combined forward/backward; normalized backward only, normalized forward only; non-normalized combined backward/forward) are discussed below.
Some methods do not use link analysis at all in their ranking procedure. These methods include text relevance ranking (discussed above); paid rankings; and ranking according to human judgment.                Paid ranking is a very simple system which has a very different marketing approach and audience. Engines using paid rankings are employed by users for other purposes than finding the best information.        Ranking according to human judgment has the obvious disadvantage that it is too slow and expensive for covering very large systems such as the World Wide Web.        Text ranking is used by all commercial search engines. We expect text ranking to be an important component of any good ranking system. In fact, the best search systems will include both a text ranking system and a system of ranking by link analysis (see, e.g., the Google search engine).        
Most, if not all, methods for ranking pages (i.e., documents), which employ hypertext analysis—in use, and/or patented—are based upon one of three methods.                Link popularity. Here one simply counts the number of pages that are linked to a given page (its “degree”). Hyperlinks have a direction; hence each node has two measures of link popularity: indegree (the number of pages pointing to the given page) and outdegree (the number of links coming from the given page). These two different measures of link popularity roughly correspond, respectively, to the Authority and Hub scores in the HITS method.        PageRank. Here a page's rank is roughly equal to the fraction of time a “random surfer” would visit the page. The random surfer follows outlinks only (with a certain probability); otherwise this surfer makes random jumps to a new page. Because PageRank follows only outlinks, its results are more like Authority scores than Hub scores. That is, a high PageRank score indicates that many good pages point to the given page.        HITS. Here there are two “mutually reinforcing” scores. In fact, they are mutually defined: a page is a good Authority if it is pointed to by (many) good Hubs; and a page is a good Hub if it points to (many) good Authorities. The basic idea is similar to link popularity, in that good Authorities are likely to have high indegree, and good Hubs are more likely to have high outdegree.        
It is possible to compare the different known methods for ranking by hypertext link analysis. Link popularity has the clear shortcoming described above—that it is too susceptible to artificial means for raising one's own score by simply adding multiple inlinks to a site. The only advantage of link popularity over the other methods is its simplicity. The other two approaches—HITS and PageRank—are both promising techniques. It is more sensible to compute PageRank scores for a huge network such as the Web, than it is to compute Authority and Hub scores. The HITS method gets around this problem, typically, by doing the link analysis on a smaller subgraph of the whole graph. This subgraph is composed of the set of hits, their in- and out-neighbors, and the links between these documents.
In summary, the PageRank link analysis technique is applied to the whole graph, as in FIGS. 1 and 2. HITS and related techniques are, in contrast, applied to topic-related subgraphs, as shown in FIGS. 3 and 4. The tight coupling of the two types of scores in the HITS approach makes the application of the HITS method to the whole graph of dubious benefit. PageRank on the other hand has not to our knowledge been applied to subgraphs, and it is not clear what sort of results would be obtained.
What is required, as discovered by the present inventors, is an algorithm that may be used for the entire Web graph (as may PageRank), and yet one which (unlike PageRank) yields two distinct scores for each document. That is, the new algorithm should not use compound operators (thus avoiding known problems with the HITS method), and it should be possible to apply it either to the whole graph, or to a subset of documents which are confined to a single theme.