1. Field of Use
The present invention is a novel method and device for evaluating the relevance of an electronic document, as compared to other electronic documents in an electronic document set. Relevance may be defined with reference to a search (e.g., a set of keywords); alternatively, relevance may be defined independently of a search, but refers to a measure of the document's relevance with respect to the entire document set (in which case relevance may be termed ‘importance’).
2. Description of the Background
In the field of electronic document and text searching, a challenge exists to find ways for a machine to evaluate relevance of documents or other information objects (the term ‘document’ being construed to mean any kind of information object in a searchable set). Most methods fall into one of two categories: text relevance analysis (TRA), and link analysis (LA).
Text relevance analysis involves electronically analyzing the content of a given document D, and using computer-based methods to determine the document's relevance (typically, with respect to keywords in an electronic search).
In contrast, link analysis electronically analyzes the context in which a document lies (rather than its content). More specifically, this context is defined in terms of links between documents. Link analysis methods exploit information about the structure of the set of links between documents, in order to obtain a measure of the importance (i.e., relevance) of each document. The word centrality is illustrative here: Link analysis seeks to measure the “centrality” of each document, as seen in the context of all the relations (links) between documents.
The present invention is an improvement on at least one known method for link analysis. The famous PAGERANK algorithm of GOOGLE is an outstanding example of a link analysis approach. PAGERANK is search-independent—that is, the importance of a Web page is evaluated by the PAGERANK algorithm using only information on the set of links in the Web graph. Thus, a page's PAGERANK score does not depend on the search being performed.
A PAGERANK score is obtained by finding the principal eigenvector of a large matrix, where information about the hyperlinks in the Web graph is stored in this matrix. Computing the PAGERANK scores of all documents in a hyperlinked graph can be very time consuming (and also memory intensive) if the number of documents is large. One operation that is very time consuming is the multiplication of the matrix by a vector. This operation (matrix×vector) must be used repeatedly for finding the principal eigenvector of the matrix. For sparse matrices (which is typically the case for reasonable document sets), the time to multiply (matrix×vector) scales roughly linearly with the number of documents (N). Hence, for large document sets, the desire for computational efficiency drives interest in finding ways to reduce the number of (matrix×vector) operations that are needed (hereinafter, the number of iterations).
An alternate method to PAGERANK and related approaches for link analysis is termed ‘link popularity’. With link popularity, the importance of a document D (in a hyperlinked set) is considered to be proportional to the number of documents which link into the document D. The idea here is that a link pointing to D is a type of recommendation of D by the author of the document P which points to D. Hence, a computing device can estimate the importance or (search-independent) relevance of D by simply counting the number of recommendations (in-links) that D has.
An advantage of link popularity is that link popularity can be determined quickly. In fact, computing link popularity for an entire document set can be easily shown to be equivalent (in terms of time needed) to a single iteration of a simple matrix operation (matrix×vector). Thus, if the document set is large, finding link popularity scores may be performed much more quickly than finding the PAGERANK scores.
Link popularity has, however, an important weakness—link popularity gives each recommendation the same weight. That is, no consideration is given to the quality of the page P which is making the recommendation (i.e., pointing to D). This is contrary to typical human experience: normally a person wants to know the quality of the recommender before deciding how much weight to give to the recommendation.
Also, one can ‘spam’ the link popularity approach by laying in a set of worthless ‘dummy’ pages whose only function is to point to a document D and to thus increase the document's importance rating. The possibility of spamming link popularity scores is well known. A common fix for this problem is to make the weight given to a recommendation proportional to the importance of the recommending document. This eliminates the simple from of spamming described above. However, the mathematical result of implementing this requirement is that one must find the principal eigenvector of the link topology matrix. Thus, one comes back to a need for many (matrix×vector) iterations rather than just one (matrix×vector) iteration.
The inventors of the present invention have (in earlier co-pending applications Ser. Nos. 10/687,602, 10/918,713, 11/227,495, and 11/349,235) exploited a type of link analysis different from PAGERANK type methodologies. These prior inventions involve using hyperlinks and various aspects of link analysis and document importance scores, including: methods for analyzing hyperlinked document sets; methods for link-poor domains, where one builds and exploits a similarity graph; and also methods for a hybrid case, in which a computing device exploits both existing hyperlinks and similarity links. The methods of co-pending applications Ser. Nos. 10/687,602, 10/918,713, 11/227,495, and 11/349,235 for hyperlink analysis have been shown in tests to be at least as good at giving relevance scores as is the PAGERANK method of GOOGLE. Furthermore, the methods using similarity links provide useful relevance scores for documents in link-poor domains.
One technique used in the approaches of co-pending applications Ser. Nos. 11/227,495 and 11/349,235 is to build links between documents in document sets which do not have pre-existing links (e.g., do not have the pre-existing hyperlinks of the World Wide Web, which have been laid down by millions of Web page authors). This technique of co-pending applications Ser. Nos. 11/227,495 and 11/349,235 is predicated on the idea that one cannot ask a computing device to build one-way recommendations from one document to another, but that one can get a computing device to give a reasonable measure of how related (similar) two documents are. Hence, in the approaches of co-pending applications Ser. Nos. 11/227,495 and 11/349,235, a similarity link may be established between documents D and E by a machine-implementable algorithm. This similarity link has two properties which distinguish the similarity link from a typical hyperlink: (i) the similarity link of co-pending applications Ser. Nos. 11/227,495 and 11/349,235 is (typically) symmetric—the similarity link ‘points both ways’; and (ii) the similarity link of co-pending applications Ser. Nos. 11/227,495 and 11/349,235 may have a weight which is equal to the similarity score s(D,E) which is calculated for a document pair (D,E). Thus, the similarity link of co-pending applications Ser. Nos. 11/227,495 and 11/349,235 may be viewed as a kind of ‘two-way recommendation’ between D and E: if one is interested in D, and s(D,E) is large, then one is likely also to be interested in E (and vice versa).
In the technique of co-pending applications Ser. Nos. 11/227,495 and 11/349,235, the set of similarity links forms a graph, which is called the similarity graph; and the weights may be stored in an N×N matrix termed the similarity matrix S. In co-pending applications Ser. Nos. 11/227,495 and 11/349,235, methods for link analysis which exploit the similarity matrix S are described. In particular, application Ser. No. 11/349,235 describes methods which involve finding the principal eigenvector of the entire similarity matrix S, and using this vector to give importance scores for documents. In contrast, each of application Ser. No. 11/227,495 and application Ser. No. 11/349,235 discloses finding the principal eigenvector for a subgraph S' of the similarity matrix, and using this vector to give importance scores for documents.
The scores obtained from the principal eigenvector of a graph with weighted, symmetric links (such as the similarity graph S or a subgraph S' of the similarity graph) are often called eigenvector centrality scores (abbreviated EVC). This term (eigenvector centrality or EVC) will also be used here to denote the scores obtained from the principal eigenvector of the similarity matrix (whole or sub). The term is illuminating, since the scores do measure a type of graph centrality—and they are obtained from an eigenvector.
In addition to employing similarity links, both application Ser. No. 11/227,495 and application Ser. No. 11/349,235 discuss combining information about the hyperlink structure (where hyperlinks exist) with information about the similarity links, giving a ‘hybrid’ method (i.e., one using both hyperlinks and similarity links). In each case, the structure of the network of similarity links is exploited to estimate (by assigning scores) the importance of documents.
When the entire similarity matrix is used (as described in application Ser. No. 11/227,495 or as in application Ser. No. 11/349,235), the resulting importance scores are search-independent. Using the entire similarity matrix tends to give the greatest importance to documents which are ‘central’ in terms of the whole graph—regardless of the topic of the search. Such documents tend to be rather ‘generic’; but they may be useful for some purposes. On the other hand, application Ser. No. 11/227,495 and application Ser. No. 11/349,235 also define a subgraph of the similarity graph, obtained from a hit list of a search. Thus, this subgraph is topic-focused. A document which is most central in this subgraph is thus a document which may be regarded as ‘central’ with respect to the search topic—but not necessarily with respect to other search topics. In short: importance scores from a subgraph, as determined by the methods described in application Ser. No. 11/227,495 and in application Ser. No. 11/349,235, are search-dependent.
Depending on circumstances, it is often desirable to use the similarity subgraph technique of application Ser. No. 11/227,495 or application Ser. No. 11/349,235, so that importance scores are evaluated relative to the topic of the search.
However, the subgraph methods of application Ser. No. 11/227,495 and of application Ser. No. 11/349,235 have a disadvantage when compared to the whole-graph methods of these same Applications. In the subgraph methods of application Ser. No. 11/227,495 and application Ser. No. 11/349,235, it is very difficult to compute ahead of time (and offline) the importance scores with regard to every possible subgraph which might be generated for every possible search topic. That is, these subgraph-based scores must be computed in real time, after a search request is presented to a search system. However, if the hit list is large, the computation of the topic-focused relevance scores from the hit-list subgraph may become very time consuming for real time applications.
In contrast to the topic-focused subgraph methods of application Ser. No. 11/227,495 and application Ser. No. 11/349,235, scores for the whole-graph methods of these same Applications need only be computed once—as they are search-independent. These whole-graph scores may thus computed offline, and then updated (also possibly offline) whenever the document set is deemed to have changed significantly. However, the size of the matrix for these whole-graph methods is always equal to the number N of documents. Hence the whole-graph methods can be computationally complex if N is very large.
Methods for scoring documents which rely on hyperlinks—such as the conventional PAGERANK and link popularity methods (and other conventional approaches using hyperlinks)—have, in addition to the previously identified shortcomings, one further fundamental limitation: these methods cannot be used unless the hyperlinks have already been laid down. That is, conventional PAGERANK and link popularity methods cannot be used in so-called ‘link-poor domains.’
PAGERANK and other conventional approaches directed to finding the principal eigenvector of a matrix which represents the entire document set also have the disadvantage that the required calculation is very time consuming when the document set is very large. Link popularity lacks this problem (having a time requirement equivalent to one iteration). But, as noted above, link popularity tends to give poorer results, and is subject to spamming.
As noted above, the whole-graph methods of application Ser. No. 11/227,495 and application Ser. No. 11/349,235 may be calculated offline. However these whole-graph methods have the disadvantage—in common with PAGERANK—that the matrix used can be very large if the document set is large, so that each iteration can be very time consuming.
The subgraph methods of application Ser. No. 11/227,495 and application Ser. No. 11/349,235 use a smaller matrix in general than does any whole-graph method. However, the importance scores from any subgraph method must be calculated in real time. The need for real time calculations can, in some cases, offset the advantage of the subgraph methods over the whole-graph methods of application Ser. No. 11/227,495 and of application Ser. No. 11/349,235 that accrues due to the smaller matrix involved when only a subgraph is used. For example, if the document set is large and the search is not tightly focused, then the hit list of the methods of application Ser. No. 11/227,495 and application Ser. No. 11/349,235 which defines the topic-focused subgraph may have millions of hits. In such cases, it is not desirable to perform—in real time, while the user waits—the many iterations required to obtain importance scores from finding the principal eigenvector of the subgraph, because the subgraph itself may be too large.
Thus, the present inventors have conceived of a new approach which addresses the deficiencies noted above relative to conventional link analysis and link popularity approaches, and that improves upon the methods of co-pending applications 10/687,602, 10/918,713, 11/227,495, and 11/349,235.