The present invention relates to a similar document searching system to search for a document similar to a specified document, and in particular, to a searching system, a searching method, and a program for processing the searching method efficiently applicable to a document including compound words each of which includes a plurality of words.
To increase efficiency and quality of a business in an organization, demands for a knowledge management system in which knowledge of members of the organization is shared between the members for reuse of the knowledge are becoming stronger these days.
Particularly, in a knowledge management system for use in a firm, documentation of experiences, know-how, and the like of experts is increasingly desired to share and to use the experiences and know-how in documents resulted from the documentation. A high-precision search or retrieval function to simply and appropriately searches a large amount of knowledge accumulated in various forms in the organization of the firm for information desired by the user is quite important in the knowledge management system.
A similar document search technique which satisfies the requirement and in which the user presents an example of a document (to be referred to as a seeds document or a query document hereinbelow) including the contents desired by the user to thereby retrieve a document similar to the document has attracted attention.
A similar document searching method has been described, for example, in pages 363 to 376 of Ranking Algorithms, Section 14, Donua Harman of “Information Retrieval” written by William B. Frakes and published from Prentice Hall PTR (1992). This technique (to be referred to as prior art technique 1 hereinbelow) uses a vector (to be referred to as a characteristic vector hereinbelow) including a term appearance frequency or a term frequency of a word (to be referred to as a characteristic word hereinbelow) appearing in a document to calculate similarity between documents according to the characteristic vector.
An outline of prior art technique 1 is as follows. When a document is registered to a document database, a term frequency of a characteristic word included in the document to be registered is created as a characteristic vector (to be referred to as a registration document characteristic vector hereinbelow) of the registration document in advance.
To retrieve a similar document, a cosine of an angle in a vector space between a characteristic vector (to be referred to as a seeds document characteristic vector) of a seeds document specified as a retrieval condition and each registration document characteristic vector is calculated as similarity between the documents.
FIG. 20 shows an example of a processing procedure in prior art technique 1.
First, in step 200, a check is made to execute document registration processing or similar document search processing. If the document registration processing is to be executed, the program executes step 210 to generate a registration document characteristic vector. That is, a registration document characteristic vector is created for the document to be registered.
If step 200 determines to execute the similar document search processing, the program executes step 220 to generate a seeds document characteristic vector for a seeds document specified as a retrieval condition.
Next, in step 221, step 222 to calculate similarity is repeatedly executed for all registration documents. That is, a cosine of an angle between the seeds document characteristic vector characteristic vector and the registration document characteristic vector in the vector space is calculated as similarity between the documents.
FIG. 21 shows an example of the characteristic vector generation processing in prior art technique 1.
In this processing, the program first reads a document to be used to create a characteristic vector in step 301. In step 302, the program extracts each characteristic word from the document read in step 301.
In step 303, a term frequency is calculated for each characteristic word extracted in step 302. Finally, in step 304, the characteristic words extracted in step 302 and the term frequency calculated for each characteristic word in step 303 are stored as elements of the characteristic vector. The processing procedure of prior art technique 1 has been described.
FIG. 22 shows an outline of prior art technique 1.
According to prior art technique 1, processing request determining step 410 determines that a processing request inputted to the system is a request for registration or retrieval processing. If the registration processing is requested, step 210 is executed.
In step 210, the program extracts characteristic words contained in registration documents 1 and 2, calculates a term frequency of each characteristic word in each document, and generates registration document characteristic vectors 403 and 404 for registration documents 1 and 2, respectively.
A registration document characteristic vector 403 “document1 (“LAN,1) (“”,1) . . . ” is a characteristic vector of “document 1” and indicates that a characteristic word “LAN” appears once and a characteristic word “” appears once.
If step 410 determines that the retrieval processing is requested to retrieve a similar document, the program extracts characteristic words from a specified seeds document 406. In step 220, the program generates a seeds document characteristic vector 407 for the seeds document 406.
The program then calculates as similarity a cosine of an angle between the seeds document characteristic vector 407 and the registration document characteristic vector of each registration document generated in step 210.
In general, a cosine of an angle between vectors A and B is expressed as follows.
                                                                        Similarity                =                                  Cosine                  ⁢                                                                          ⁢                  of                  ⁢                                                                          ⁢                  angle                  ⁢                                                                          ⁢                  between                  ⁢                                                                          ⁢                  vectors                  ⁢                                                                          ⁢                  A                  ⁢                                                                          ⁢                  and                  ⁢                                                                          ⁢                  B                                                                                                        =                                                      A                    ·                    B                                                        |                    A                    |                                          ×                                              |                        B                        |                                                                                                                                                             (        1        )            where, “A·B” is an inner product between vectors A and B and |A| is a magnitude of vector A.
Cosines of angles respectively between the seeds document characteristic vector 407 and the registration document characteristic vectors 403 and 404 shown in FIG. 22 are respectively calculated as below. In these expressions (2) and (3), vector A indicates the seeds document characteristic vector 407 and vector B indicates the registration document characteristic vector 403 or 404.
                                                                                                                                                            Cosine                        ⁢                                                                                                  ⁢                        of                        ⁢                                                                                                  ⁢                        angle                        ⁢                                                                                                  ⁢                        between                                                                                                                                                vectors                        ⁢                                                                                                  ⁢                        407                        ⁢                                                                                                  ⁢                        and                        ⁢                                                                                                  ⁢                        403                                                                                            =                                                                            1                      ×                      0                                        +                                          1                      ×                      0                                        +                                          1                      ×                      1                                        +                                          1                      ×                      0                                                                                                                                                                                                                                              1                                2                                                            +                                                              1                                2                                                            +                                                              1                                2                                                            +                                                              1                                2                                                                                                              ×                                                                                                                                                                                                                        1                              2                                                        +                                                          1                              2                                                        +                                                          1                              2                                                        +                                                          1                              2                                                        +                                                          1                              2                                                        +                                                          1                              2                                                                                                                                                                                                                                                  =                                                      1                                          2                      ⁢                                              6                                                                              =                  0.204                                                                                                 (        2        )            
                                                                                                                                                            Cosine                        ⁢                                                                                                  ⁢                        of                        ⁢                                                                                                  ⁢                        angle                        ⁢                                                                                                  ⁢                        between                                                                                                                                                vectors                        ⁢                                                                                                  ⁢                        407                        ⁢                                                                                                  ⁢                        and                        ⁢                                                                                                  ⁢                        404                                                                                            =                                                                            1                      ×                      1                                        +                                          1                      ×                      1                                        +                                          1                      ×                      1                                        +                                          1                      ×                      0                                                                                                                                                                                                                                              1                                2                                                            +                                                              1                                2                                                            +                                                              1                                2                                                            +                                                              1                                2                                                                                                              ×                                                                                                                                                                                                                        1                              2                                                        +                                                          1                              2                                                        +                                                          1                              2                                                        +                                                          1                              2                                                        +                                                          1                              2                                                        +                                                          1                              2                                                                                                                                                                                                                                                  =                                                      3                                          2                      ⁢                                              5                                                                              =                  0.670                                                                                                 (        3        )            
Resultantly, the program produces a similarity calculation result 408 of each registration document for the seeds document. A processing example of prior art technique 1 has been described.
In prior art technique 1 described above, characteristic words are extracted from registration documents to generate registration document characteristic vectors in advance. When a seeds document is specified as a retrieval condition, a cosine between a seeds document characteristic vector of the seeds document and each of the registration document characteristic vector is calculated as similarity to retrieve a document having the contents similar to those of the seeds document from a document database.
However, prior art technique 1 has a problem. That is, when a characteristic word as an element of the characteristic vector is a compound word including a plurality of words, some similar documents cannot be retrieved depending on cases.
FIG. 23 shows the problem of prior art technique 1. The problem will now be described by referring to FIG. 23. In this example shown in FIG. 23, the user inputs a seeds document    to a document database to which document 3     . . . ┘ and document 4    . . . ┘ are beforehand registered.
First, document registration processing is executed in step 210 to generate registration document characteristic vectors 403a and 404a for the respective documents. In the example, the characteristic vector 403a for document 3 is “document 3 (“”,1) (“”,1) (“”,1) (“”,1)” and the characteristic vector 404a for document 4 is “document 4 (“”,1) (“ ”,1)”.
Next, similar document search processing is executed in step 220 to generate a seeds document characteristic vector 407a for the seeds document. In this example, the generated seeds document characteristic vector 407a is “seeds document (“ ”,1)”.
In step 222, similarity of each registration document is calculated for the seeds document to resultantly produce a similarity calculation result 408a. In the example, values 0.000 and 0.710 of similarity respectively of documents 3 and 4 are obtained as below.
                                                        Similarity              =                                                1                  ×                  0                                                                                            1                      2                                                        ×                                                                                    1                        2                                            +                                              1                        2                                            +                                              1                        2                                            +                                              1                        2                                                                                                                                                                    =                                                0                  2                                =                0.000                                                                        (        4        )                                                                    Similarity              =                                                1                  ×                  1                                                                                            1                      2                                                        ×                                                                                    1                        2                                            +                                              1                        2                                                                                                                                                                    =                                                1                                      2                                                  =                0.710                                                                        (        5        )            
As a result, although the contents of document 3 are related to the seeds document, the calculation result of prior art technique 1 disadvantageously indicates that document 3 is not related to the seeds document at all.
This occurs as follows. Although a characteristic word extracted as an element of the seeds document characteristic vector includes a plurality of words, only the characteristic word “ ” for a longest matching condition is employed as the element of the characteristic vector in the similarity calculation. Therefore, the concept of each word constituting the characteristic word is not reflected in the similarity. In short, similarity is not assigned to a registration document including each word constituting the characteristic word, and hence such a registration document is not retrieved.
On the other hand, the disadvantage case described above can be prevented by using each of the words included in “”, namely, “” and “” in place of the characteristic word “ ” for a longest matching condition. However, this possibly increases a chance case in which a document having a lower degree of similarity to “” is retrieved as noise. Problems of prior art technique 1 have been described.