The present invention relates to a similar document retrieving apparatus which designates one or plural document data from a document database (i.e., set or assembly of document data) which is electronically stored as strings of character codes and machine treatable or processible, or designates an arbitrary sentence not involved in this database, as a typical example. The similar document retrieving apparatus retrieves one or more documents similar to the designated typical example from the document database. Furthermore, the present invention relates to a relevant keyword extracting apparatus which extracts one or more keywords relating to the xe2x80x9ctypical examplexe2x80x9d from the document database. The relevant keyword extracting apparatus presents the extracted keywords to the users of this document database as an aid for comprehension of the retrieved document contents, or as a hint for preferable retrieval conditions (i.e., queries). Especially, the present invention makes it possible to perform highly accurate document retrieval and keyword extraction.
Due to recent spread of wordprocessors and personal computers as well as large-scale and low-cost storage media, such as CD-ROM and DVD-ROM, and development of network, such as Ethernet, all of the documents or most of character information can be practically stored as strings of character codes in a full text database. Such database is now widely used.
According to a conventional full text database, in retrieving the documents, a Boolean expression of keywords is generally designated as queries. It is checked whether or not a designated keyword appears in the documents. And, a document set satisfying the Boolean expression is obtained as a retrieval result.
Recently, a so-called document ranking technique is introduced and practically used. According to this ranking technique, the relevancy between each document in the obtained document set and the retrieval conditions (i.e., queries) is obtained according to a so-called xe2x80x9ctfxc2x7idfxe2x80x9d method or the like. Then, the documents are ranked in order of relevancy and are presented to the users.
However, this conventional full text database system is disadvantageous in the following points.
(1) When no appropriate keywords come up in mind or are found, it is difficult to designate appropriate retrieval conditions (i.e., queries).
(2) Describing a complicated Boolean expression requires a high skill and enough time.
(3) For the synonymy problem, there will be a possibility that an intended document cannot be retrieved.
In view of these problems, research and development for a similar document retrieving system or a relevant keyword extracting system has recently become vigorous so as to effectively retrieve documents similar to a designated typical example or to extract and display relevant keywords relating to the designated documents or word set.
U.S. Pat. No. 4,839,853 discloses a conventional method for retrieving similar documents, which is called as LSI (latent semantic indexing) method.
To make clear the difference between the present invention and the LSI method, the gist of the LSI method will be explained.
When applied to a document database D containing N document data, the LSI method mechanically extracts a keyword, i.e., a characteristic word representing each document, to record the frequency of occurrence (i.e., the number of times) of each keyword appearing in each document. It is now assumed that a total of M kinds of keywords are extracted from the document database D.
Extracted keywords are aligned according to a dictionary order or an appropriate order. Then, a frequence-of-appearance fdt of a t-th keyword is expressed as an element of d-th line and t-th row of a matrix F. Then, trough a matrix operation called as incomplete singular value decomposition, this matrix F is approximately decomposed into a product of a matrix U of N lines and K rows having document-side singular vector in each row, a diagonal matrix xcex9 of K lines and L rows having singular values aligned as diagonal elements, and a matrix V of K lines and M rows having a keyword-side singular vector in each line. In this case, K is sufficiently small compared with N and M. As a result, the original frequency-of-occurrence matrix F can be approximately expressed by a lower-rank matrix.
A total of K document-side singular vectors are obtained through the above decomposition. Thus, a feature vector Ud of the document d is obtained as a K-dimensional vector containing respective d-th components of the obtained K document-side singular vectors. Similarly, a total of K keyword-side singular vectors are obtained through the above decomposition. Thus, a feature vector Vt of the keyword t is obtained as a K-dimensional vector containing respective t-th components of the obtained K keyword-side singular vectors.
Subsequently, calculation of similarity and relevancy is performed according to the following three procedures so as to obtain documents and keywords having higher similarities and relevancies, thereby realizing the similar document retrieval and the relevant keyword extraction.
(1) The similarity between two documents a and b is obtained by calculating an inner product Uaxc2x7Ub between the document feature vectors Ua and Ub of these documents a and b.
(2) The relevancy between two keywords Ka and Kb is obtained by calculating an inner product Vaxc2x7Vb between two keyword feature vectors Va and Vb of these keywords Ka and Kb.
(3) Keyword extraction result from an arbitrary (external) document is represented by a M-dimensional vector E having components representing frequency-of-occurrence values of M keywords appearing in this document. A retrieval condition document feature vector Pe corresponding to this external document is represented by an expression Ue=xcex9xe2x88x921VE. Then, the similarity between this external document and the document d in the document database is obtained as a product Udxc2x7Ue. The above-described procedures are a fundamental framework of the LSI method.
However, if the keyword frequency-of-appearance fdt is directly used in the application of the LSI method to an actual document database, the feature vector obtained will be somewhat deviated due to presence of longer documents or frequently appearing keywords. This will significantly worsen the accuracy of similar document retrieval.
Hence, the LTC method conventionally used in the relevant ranking of a document retrieving system or a comparative method is introduced to convert or normalize the keyword frequency-of-occurrence fdt. Then, a frequency-of-occurrence matrix F is created so as to contain the normalized frequency-of-occurrence values. Then, the incomplete singular value decomposition is performed to obtain a feature vector.
For example, according the LTC conversion, the following equation is used to calculate a frequency-of-occurrence LTC (fdt) based on the actual frequency-of-occurrence fdt the number nt of documents containing the keyword t. A matrix containing this value is subjected to the incomplete singular value decomposition.                               LTC          ⁡                      (                          f              dt                        )                          =                                            (                              1                +                                                      log                    2                                    ⁢                                      f                    dt                                                              )                        ⁢                                          log                2                            ⁡                              (                                  1                  +                                      N                                          n                      i                                                                      )                                                                                        ∑                j                            ⁢                                                {                                                            (                                              1                        +                                                                              log                            2                                                    ⁢                                                      f                            dj                                                                                              )                                        ⁢                                                                  log                        2                                            ⁡                                              (                                                  1                          +                                                      N                            nj                                                                          )                                                                              }                                2                                                                        (        1        )            
However, the conversion of keyword frequency-of-occurrence by the conventional LSI method causes the following problems.
Analysis according to the LSI method is performed on the assumption that a d-th line of the matrix F represents the feature of document d and a t-th row of the matrix F represents the feature of keyword t. In a first conversion, a square-sum of line elements can be normalized to 1. However, a square-sum of row elements cannot be normalized to 1. Accordingly, the performed conversion becomes asymmetric between the document side and the keyword side. Thus, the simple conversion using the above equation 1 cannot normalize both of the document side and the keyword side to 1. Such asymmetry can be found in a conversion using other equation.
Furthermore, when a logarithmic function or other nonlinear function is used in the conversion as shown in the equation 1, the feature of certain document d is not identical with the feature of document dxe2x80x2 consisting of two successive documents d. Therefore, the similarity between the document d and the document dxe2x80x2 is not equal to 1. Similarly, when two keywords t1 and t2 are identical in the frequency-of-occurrence as well as in the meaning, a frequency-of-occurrence matrix obtained on the assumption that two keywords t1 and t2 are the same does not agree with the original frequency-of-occurrence matrix.
The above-described asymmetry or the above-described non-stability caused by the mergence of documents or keywords with respect to the document similarity or the keyword relevancy causes the following phenomenons when a large-scale document database is processed.
(1) In the retrieving and extracting operation at the non-normalized side (i.e., keyword side in many cases), large norms (i.e., square-sum of elements of F) are chiefly retrieved or extracted.
(2) When a document retrieval is performed in a keyword set, only certain keywords have very strong effects and others are almost neglected.
Consequently, the obtained retrieval result will be the ones far from the intent of the retrieval. Thus, the accuracy of retrieval is greatly worsened.
To solve the above-described problems of the prior art, the present invention has an object to provide a similar document retrieving apparatus and a relevant keyword extracting apparatus which can normalize both of the document side and the keyword side and maintain higher retrieving accuracy.
To accomplish the above and other related objects, the present invention provides a first similar document retrieving apparatus applicable to a document database D which stores N document-data containing a total of M kinds of keywords and is machine processible, for designating a retrieval condition (i.e., query) consisting of a document group including at least one document x1, - - - , Xr selected from the document database D and for retrieving documents similar to the document group of the retrieval condition from the document database D. The first similar document retrieving apparatus of this invention comprises: keyword frequency-of-occurrence calculating means for calculating a keyword frequency-of-occurrence data F which represents a frequency-of-occurrence fdt of each keyword t appearing in each document d stored in the document database D; document length calculating means for calculating a document length data L which represents a length ld of each document d; keyword weight calculating means for calculating a keyword weight data W which represents a weight wt of each keyword t of the M kinds of keywords appearing in the document database D; document profile vector producing means for producing a M-dimensional document profile vector Pd having components respectively representing a relative frequency-of-occurrence pdt of each keyword t in the concerned document d; document principal component analyzing means for performing a principal component analysis on a document profile vector group of a document group in the document database D and for obtaining a predefined (K)-dimensional document feature vector Ud corresponding to the document profile vector Pd for each document d; and similar document retrieving means for receiving the retrieval condition consisting of the document group including at least one document x1, - - - , xr selected from the document database D, calculating a similarity between each document d and the retrieval condition based on a document feature vector of the received document group and the document feature vector of each document d in the document database D, and outputting a designated number of similar documents in order of the calculated similarity.
Furthermore, the present invention provides a second similar document retrieving apparatus applicable to a document database D which stores N document data containing a total of M kinds of keywords and is machine processible, for designating a retrieval condition (i.e., query) consisting of a keyword group including at least one keyword y1, - - - , ys selected from the document database D and for retrieving documents relevant to the retrieval condition from the document database D. In addition to the above-described keyword frequency-of-occurrence calculating means, the document length calculating means, the keyword weight calculating means, and the document profile vector producing means, the second similar document retrieving apparatus of this invention comprises: keyword profile vector calculating means for calculating a N-dimensional keyword profile vector Qt having components respectively representing a relative frequency-of-occurrence qdt of the concerned keyword t in each document d; document principal component analyzing means for performing a principal component analysis on a document profile vector group of a document group in the document database D and for obtaining a predefined (K)-dimensional document feature vector Ud corresponding to the document profile vector Pd for each document d; keyword principal component analyzing means for performing a principal component analysis on a keyword profile vector group of a keyword group in the document database D and for obtaining a predefined (K)-dimensional keyword feature vector Vt corresponding to the keyword profile vector Qt for each keyword t, the keyword feature vector having the same dimension as that of the document feature vector, as well as for obtaining a keyword contribution factor (i.e., eigenvalue of a correlation matrix) xcex8j of each dimension j; retrieval condition feature vector calculating means for receiving the retrieval condition (i.e., query) consisting of keyword group including at least one keyword y1, - - - , ys, and for calculating a retrieval condition feature vector corresponding to the retrieval condition (i.e., query) based on the keyword weight data of the received keyword group, the keyword feature vector and the keyword contribution factor; and similar document retrieving means for calculating a similarity between each document d and the retrieval condition based on the calculated retrieval condition feature vector and a document feature vector of each document d, and outputting a designated number of similar documents in order of the calculated similarity.
Furthermore, the present invention provides a first relevant keyword extracting apparatus applicable to a document database D which stores N document data containing a total of M kinds of keywords and is machine processible, for designating an extracting condition consisting of a keyword group including at least one keyword y1, - - - , ys selected from the document database D and for extracting keywords relevant to the keyword group of the extracting condition from the document database D. In addition to the above-described keyword frequency-of-occurrence calculating means, the document length calculating means, and the keyword weight calculating means, the second relevant keyword extracting apparatus of this invention comprises: keyword profile vector calculating means for calculating a N-dimensional keyword profile vector Qt having components respectively representing a relative frequency-of-occurrence qdt of the concerned keyword t in each document d; keyword principal component analyzing means for performing a principal component analysis on a keyword profile vector group of a keyword group in the document database D and for obtaining a predefined (K)-dimensional keyword feature vector Vt corresponding to the keyword profile vector Qt for each keyword t; and relevant keyword extracting means for receiving the extracting condition consisting of the keyword group including at least one keyword y1, - - - , ys selected from the document database D, calculating a relevancy between each keyword t and the extracting condition based on a keyword feature vector of the received keyword group and the keyword feature vector of each keyword t in the document database D, and outputting a designated number of relevant keywords in order of the calculated relevancy.
Furthermore, the present invention provides a second relevant keyword extracting apparatus applicable to a document database D which stores N document data containing a total of M kinds of keywords and is machine processible, for designating an extracting condition consisting of a document group including at least one document x1, - - - , xr selected from the document database D and for extracting keywords relevant to the document group of the extracting condition from the document database D. In addition to the above-described keyword frequency-of-occurrence calculating means, the document length calculating means, the keyword weight calculating means, the document profile vector producing means, and the keyword profile vector calculating means, the second relevant keyword extracting apparatus of this invention comprises: document principal component analyzing means for performing a principal component analysis on a document profile vector group of a document group in the document database D and for obtaining a predefined (K)-dimensional document feature vector Ud corresponding to the document profile vector Pd for each document d as well as for obtaining a document contribution factor (i.e., eigenvalue of a correlation matrix) xcexj of each dimension j; keyword principal component analyzing means for performing a principal component analysis on a keyword profile vector group of a keyword group in the document database D and for obtaining a predefined (K)-dimensional keyword feature vector Vt corresponding to the keyword profile vector Qt for each keyword t, the keyword feature vector having the same dimension as that of the document feature vector; extracting condition feature vector calculating means for receiving the extracting condition consisting of the document group including at least one document x1, - - - , xr, and for calculating an extracting condition feature vector corresponding to the extracting condition based on the document length data of the received document group, the document feature vector and the document contribution factor; and relevant keyword extracting means for calculating a relevancy between each keyword t and the extracting condition based on the calculated extracting condition feature vector and a keyword feature vector of each keyword t, and outputting a designated number of relevant keywords in order of the calculated relevancy.
According to the similar document retrieving apparatus and the relevant keyword extracting apparatus of the present invention, the frequency-of-occurrence of each keyword in a concerned document is expressed as a document profile vector and the frequency-of-appearance of a concerned keyword in each document as a keyword profile vector. A weighted principal component analysis considering the document length and the keyword weight is independently performed to obtain both of a document feature vector and a keyword feature vector.
In this case, the vector representation in the document profile and in the keyword profile is not dependent on the conversion (i.e., normalization) of frequency-of-occurrence. The document length data and the keyword weight data, relevant to the conversion of frequency-of-occurrence, are indirectly reflected as the weight in the principal component analysis. Thus, it becomes possible to perform the normalization without depending on the conversion of frequency-of-occurrence.
As a result, the present invention makes it possible to provide the similar document retrieving apparatus and the relevant keyword extracting apparatus which are highly accurate.