There are various kinds of electronic documents in an organization, including a word processing document, presentation document, DTP document, CAD document, and spreadsheet document. When a database in which electronic documents are accumulated is searched for an electronic document that one wants to browse, it is not preferable in terms of search efficiency that many similar electronic documents appear as search results. In an organization, for example, an electronic document created by a given member is copied and modified by another member, a member himself saves an electronic document as another one upon every modification, or an electronic document created by a given member is distributed to many members by e-mail or the like. This generates identical or very similar electronic documents. Thus, identical or very similar electronic documents appear as search results. As a method of specifying identical or very similar electronic documents, there is known a conventional technique called similar document detection or very similar document detection.
The most popular similar document detection method is a vector space model in reference 1 (“A Vector Space Model for Automatic Indexing”, Communications of the ACM, November, 1975, Vol. 18, No. 11, pp. 613-620). According to the method described in reference 1, a word appearing in a given document and its frequency are used as vectors. The similarity between documents is obtained based on the inner products of the vectors of the respective documents. However, the vector space model suffers the following problems. First, a larger number of extracted words increase the number of vectors and require a larger-capacity memory. Second, a larger number of extracted words prolong the time taken to calculate the inner product.
These problems can be solved to some extent by narrowing down words to be extracted. A word to be extracted after narrowing down is called a feature word. A variety of related techniques have been proposed for the feature word extraction method. The most famous method is tf*idf. tf*idf is a general feature word extraction method and a description thereof will be omitted. In another related technique, words which frequently co-occur in a document are used as feature words, as disclosed in, e.g., reference 2 (Japanese Patent Laid-Open No. 2005-222480) and reference 3 (“KeyGraph: Automatic Indexing by Segmenting and Unifing Co-occurrence Graphs”, IEICE Transactions, February, 1999, Vol. J82-D-I, No. 2, pp. 391-400). Also, there are many related techniques which uses a compound word as a feature word, as disclosed in, e.g., reference 4 (Japanese Patent Laid-Open No. 2003-16092).
Reference 5 (Japanese Patent Laid-Open No. 9-198409) discloses another similar document search method. According to the method described in reference 5, words belonging to a specific part of speech are extracted from character strings in an electronic document. The order of occurrence of these words and the part of speech of them are recorded. Then, documents or sentences are compared to specify similar documents or similar sentences. A main purpose of the technique disclosed in reference 5 is to detect an illegal copy which violates the copyright law, so more strict similarity is determined than that in the vector space model.
Reference 6 (Japanese Patent Laid-Open No. 2006-92344) discloses still another similar document search method. According to the method described in reference 6, various kinds of electronic documents (e.g., a word processing document, presentation document, DTP document, and CAD document) are converted into images of the same format, and compared for each pixel, thereby making document equivalence determination and detecting a difference. The most significant feature of this technique is that equivalence determination can be made even for figures and images in addition to character strings.