1. Field of Use
A computer implemented method and device for intelligent on-line document searching, similarity scoring and retrieval. In particular, an efficient method of calculating similarity scores between electronic documents. Also, a plurality of embodiments for combining a similarity graph and a hyperlink structure graph for ranking hit lists from searches over a set of electronic documents.
2. Description of the Background
The following discusses methods both for computing similarity scores of electronic documents and for performing link-based analysis of electronic documents.
Known methods for computing similarity scores of electronic documents (e.g., Mining the web—Discovering Knowledge from Hypertext Data, Soumen Chakrabarti, Morgan Kaufmann Publishers, 2003) are commonly based on normalized word frequencies. Such document similarity scores can serve many purposes. However, finding similarity scores for all document pairs (i.e., calculating the entire similarity matrix) of a large document set is not part of the state of the art, since conventional (brute force) methods are not able to cope with the computational burden engendered by a large document set. As we show below, the present invention offers a method for determining the whole similarity matrix efficiently. In order to give a complete view, the full procedure to be used in the determination of a similarity matrix of a document set is described below. It should be emphasized that this procedure is a preview of the way that the calculation is in fact performed; hence some of the steps are prior art, and some are novel. However it is convenient to describe the current state of the similarity technology in terms of the entire procedure. Hence, below, each step (denoted as A, B, C, D, and E) will be discussed in order, including its status in terms of prior art or novelty.
Step A—Build a corpus of words. A word corpus consists of the words that are considered important in the analysis. These words are stored in a form independent of the documents the words occur in. Substeps include:                a. List all the unique words in all the documents considered.        b. Remove stop-words (unimportant words) etc.        c. Perform stemming to reduce the set of words admissible to the corpus.        d. Perform other possible operations on the word corpus in order to decrease the size of the corpus (number of words).The preprocessing step of building the corpus, as described here, is well known to any practitioner in the field.        
Step B—Build a document description vector. Two choices are: building the document's word frequency vector, or mapping the document to a set of concepts. Concepts may be represented as weighted collections of words, and in this sense the two choices are common descriptions of the same document, but represented in different basis. For example, we can for each document Di build a word number vector {right arrow over (N)}(Di). Each element in the word number vector consists of an integer counting the number of times the corresponding word in the corpus vector occurs in the document Di. It is also possible to give extra weight for word occurrences which imply more importance for the word, for example occurrences in the document's title, in bold text or italic text, etc.
The various choices for Step B, as described here, constitute known technology.
Step C—Normalize the document description vector. This step is optional, and does not need to be performed on the document description vectors. However, normalization will keep documents with many words from overwhelming documents with fewer words. In the example with the word number vector {right arrow over (N)}(Di) of document Di, this can be normalized with respect to the size of the document. This is done as follows:
                    n        ->            ⁡              (                  D          i                )              =                            N          ->                ⁡                  (                      D            i                    )                                      ∑          j                ⁢                              N            j                    ⁡                      (                          D              i                        )                                ,giving rise to the (normalized) word frequency vector {right arrow over (n)}(Di). Here the sum in the denominator is over all words j; hence the denominator is simply the total number of word occurrences found in the document (not counting words not in the corpus). Step C as described here is known to practitioners of the art.
Step D—Calculate a Similarity Score. The state of the art includes a number of methods for quantifying the similarity between two documents. Here we give an example method. In this example, the similarity score between two documents A and B is calculated based on the two documents' normalized word-frequency vectors:
      s    ⁡          (              A        ,        B            )        =            ∑      i        ⁢                                                      n              i                        ⁡                          (              A              )                                ⁢                                    n              i                        ⁡                          (              B              )                                          .      Other methods are used to calculate a similarity score, but this formula has the following useful properties: s(A, A)=1; 0≦s(A, B)≦1; and s(A, B)=s(B, A). The choice of formulae presented here is the one disclosed in the present inventors' co-pending application Ser. No. 11/227,495, and for the purposes of this invention is the preferred method for calculating a similarity score.The three steps (steps B, C, and D) following the preprocessing step A are prior art. Any practitioner in the field will have to build some version of a document description vector for the corpus words in each document, and also define a measure of similarity between two such document description vectors.
Step E—Calculate a Similarity Matrix. This step is non-conventional, since to our knowledge no method has been presented for performing this determination efficiently. Based on the similarity scores calculated pair-wise among the documents, one can create a similarity matrix. Suppose we have a set of documents {Di}, and the number of documents is m. Then the m×m symmetric similarity matrix S based on the document set {Di} is:
  S  =      [                            1                                      s            ⁡                          (                                                D                  1                                ,                                  D                  2                                            )                                                            s            ⁡                          (                                                D                  1                                ,                                  D                  3                                            )                                                …                                      s            ⁡                          (                                                D                  1                                ,                                  D                  m                                            )                                                                        s            ⁡                          (                                                D                  2                                ,                                  D                  1                                            )                                                1                                      s            ⁡                          (                                                D                  2                                ,                                  D                  3                                            )                                                …                                      s            ⁡                          (                                                D                  2                                ,                                  D                  m                                            )                                                                        s            ⁡                          (                                                D                  3                                ,                                  D                  1                                            )                                                            s            ⁡                          (                                                D                  3                                ,                                  D                  2                                            )                                                1                          …                                      s            ⁡                          (                                                D                  3                                ,                                  D                  m                                            )                                                            ⋮                          ⋮                          ⋮                          ⋱                          ⋮                                                  s            ⁡                          (                                                D                  m                                ,                                  D                  1                                            )                                                            s            ⁡                          (                                                D                  m                                ,                                  D                  2                                            )                                                            s            ⁡                          (                                                D                  m                                ,                                  D                  3                                            )                                                …                          1                      ]  Step E is an extremely attractive goal in the field of document similarity computing, as it gives a global view of the textual relations among all documents in the document set. However, this goal is unattainable for large document sets, unless some good method for streamlining the calculation is found. That is, for large document sets, both the calculation time and the storage requirement grow as the square m2 of the number of documents m. Hence, when the number of documents in the collection reaches millions or even billions, it is not feasible to calculate the similarity matrix using conventional methods. Thus, using known methods, it is practically impossible to use the information contained in the full similarity matrix, unless the document set is sufficiently small. We offer a solution to this problem, which is disclosed herein.
These previously identified methods perform machine evaluation of the ‘importance’ of electronic documents (e.g., conventional methods for ranking a hit list from a search over an interlinked document set) consist of two main activities:                1. Link analysis, in which the hyperlinked structure among the documents is analyzed to yield a link analysis score for the documents, based only on how they are positioned in the network which is formed by the links between the documents.        2. Text analysis. Each individual document is analyzed with regards to textual relevance as compared to the supplied search keywords to produce a text analysis score.Conventional methods then combine the two scores (i.e., the link analysis score and the text analysis score) into one net score, which is used for ranking the documents.        
Deficiencies with conventional methods for calculating similarity scores among a set of documents; and deficiencies with conventional methods for ranking hit lists from searches over an interlinked document set are discussed below.
As noted above, known methods for calculating similarity scores among a set of documents are very computationally intensive. In order to calculate the whole similarity matrix, one will need on the order of m2/2 similarity computations, where m is the number of documents. This becomes a very daunting task when the number of documents in the document set reaches millions or even billions. Due to the fact that many of these similarity scores will also be zero (or very small), a lot of computational time is wasted on just calculating very small numbers (including zeros). This is clearly highly undesirable.
As an alternative to calculating the entire similarity matrix, one can choose to calculate similarities with respect to a single document of interest (hence calculating only one row of the matrix). This approach can be useful when one wishes to find documents which are similar to a given “working” document; and it gives a large saving in computational burden. However, this calculation must then be done in real time, when a suitable document of interest is chosen by the user. Also, for a given document, one is mostly interested in only those documents which are most similar to the given document. Hence even in this case it would be of great benefit to be able to avoid calculating many small or zero similarity scores with respect to the given document. In the absence of a method for avoiding the calculation of these small scores, one uses known methods to find the similarity of all the other documents to the given document, in order to be sure that no highly similar documents have been overlooked. In short: (i) finding only similarities with respect to a working document can be useful for some purposes, but not for similarity-based link analysis; (ii) even when only this one row of the matrix is needed, it is useful to find efficient ways for only calculating the largest similarity scores, and avoiding calculation of small or zero scores.
Deficiencies associated with conventional methods for ranking hit lists from searches over an interlinked document set become clear when one looks carefully at how link analysis and text relevance are combined in the ranking of the search hits.
Link analysis can be performed in essentially two different ways: whole-graph and sub-graph. These two approaches are discussed below.
Whole graph link analysis means that each document is scored depending on the intrinsic link structure among all the documents in the document set. For example, the search engine Google uses a whole-graph-based version of link analysis (PageRank—U.S. Pat. No. 6,285,999, the contents of which are incorporated herein by reference)) for scoring web pages. This way of performing link analysis is independent of the key words supplied in a search for any of these documents. The scores of all the documents can thus be calculated off-line, independently of the users' activity in searching for information.
Another way of performing link analysis is to restrict the link analysis to a subgraph of the document graph. Here, by subgraph, we simply mean a subset of the documents, and all links between the documents in this subset. There are many possible ways of defining such subgraphs. Most typically, the subgraph is defined by the keywords used in a search query, such that only the documents containing the search keywords are considered (along with the links among this subset of documents). Since the link analysis in this case is dependent on the keywords, the ranking of the documents has to be performed on the fly when the actual search is performed. As with whole-graph link analysis, it is the network context of a document that decides the score obtained through link analysis; and there is no explicit recourse to text relevance, other than using the keywords to define the subgraph.
Text analysis, on the other hand, does not consider a document's network context. Text analysis is an assessment of the relevance of a document, using the text in the document, as evaluated with respect to the search query keywords. Good text relevance analysis is difficult to achieve using a machine—it involves asking a machine to estimate the relevance and/or quality of a given document, with respect to a given set of keywords. Hence state-of-the-art search engines also use link analysis.
Similarity link analysis may be viewed as having elements of both text analysis and link analysis. That is, the similarity of two documents depends obviously on their text; and yet it is a property of pairs of documents, and so introduces some sense of the context of a document. However, a major shortcoming of the conventional similarity analysis technology (apart from the present inventors' co-pending application Ser. No. 11/227,495) is the lack of any application of link analysis to the similarity matrix. Current link analysis methods depend entirely on pre-existing hyperlinks.
In the present inventors' co-pending application Ser. No. 11/227,495 the use of the full similarity matrix is considered. Upon considering the computational issues discussed above, the present inventors have discovered an efficient new way of calculating the whole similarity matrix. This efficient method for obtaining the full document similarity matrix renders possible the use of the entire similarity matrix for the purpose of scoring documents for ranking purposes—even when the document set is very large.
State of the art link analysis is performed on the intrinsic hyperlink matrix, whereas the present inventors' co-pending application Ser. No. 11/227,495 allows for a combination of the hyperlink matrix and the document similarity matrix. The present invention offers a number of additional novel approaches for combining the hyperlink matrix (whole graph or subgraph), together with the document similarity matrix (whole graph or subgraph). These methods are not mentioned in the present inventors' co-pending application Ser. No. 11/227,495 or in any conventional art.