1. Field of the Invention
The present invention relates to a retrieval apparatus and a retrieval method that are for retrieving similar documents utilizing information relating to diagrams included in documents, and further relates to a computer-readable recording medium having recorded thereon a program for realizing the apparatus and method.
2. Description of Related Art
A similar document retrieval system has a function of finding documents that are similar to a document input by the person doing the search (input document) from among document information that is being held. According to the similar document retrieval system, by inputting a document that serves as a basis for retrieving similar documents as a search expression, the person doing the search is thus able to acquire, as a search result, a group of similar documents that match the search expression.
Also, the similar document retrieval system is mainly provided with functional elements such as a crawler for collecting information for a search, a searcher for actually performing the search based on the information collected by the crawler, and scoring for ordering the search.
In the similar document retrieval system, a search based on the information collected by the crawler is executed by the searcher with respect to the input document, and a search result is returned. At this time, a similarity based on some sort of index is computed by scoring, and ordering (ranking) is performed on the search result.
Incidentally, with the scoring in the similar document retrieval system, in many cases the person doing the search requires that the similarity be computed based on the similarity of descriptive contents rather than a simple comparison of sentences. This is the case even with similar document retrieval of documents written in different languages such as Japanese and English. A number of techniques that involve performing a search by computing the similarity based on the contents of documents have thus been proposed as techniques for retrieving similar documents, aside from retrieval techniques that simply involve comparing texts.
For example, JP 2010-218216A (hereinafter, “Literature 1”) discloses a technique for performing a search by computing the similarity from the frequency with which keywords unique to a user that correspond to search terms appears, using a dictionary of related terms.
However, with the technique disclosed in Literature 1, aside from a database of documents, a large-scale database relating to related terms will be needed in order to execute a search. Further, since character information such as keywords is targeted for evaluation, erroneous evaluation may result from the wording of complex sentences or it may not be possible to compute the similarity of the contents of documents written using different languages.
Also, JP 2005-258831A (hereinafter, “Literature 2”) discloses a technique for computing the similarity by focusing on a section of a main element (claims, etc.) of a standard text such as patent filing documents, further dividing the section focused on, and comparing each of the resultant sections with each document in a group of documents. With the technique disclosed in Literature 2, because translation is performed before the similarity is computed in the case where the language of the input document differs from the language of documents in the database, it is possible to search for similar documents of different languages.
However, even with the technique disclosed in Literature 2, similarity may be erroneously evaluated in the case of documents of different languages, given the difficulty in computing the similarity correctly due to factors such as differences in grammar and the nuance of words.
Thus, with regard to the conventional similar document retrieval techniques disclosed in Literature 1 and Literature 2, it has been pointed out that since the similarity is computed by focusing on character information, there is a problem in that the similarity evaluation is influenced by the description language.
Also, comparison information that can be focused on apart from character information includes information specifying diagrams cited within documents (hereinafter, “diagram information”). Given that the role of a diagram in a document is to summarize the contents mentioned preceding and/or following the diagram, diagram information is able to directly represent the contents described in the document. Additionally, since diagram information is constituted by image data, it also is possible to evaluate similarity without being affected by the description language.
For example, JP 2006-148263 (hereinafter, “Literature 3”) discloses technology for interpolating a region of an image including ticker characters that is missing due to the ticker characters, and restoring an image that does not include ticker characters. Further, JP 4545641 (hereinafter, “Literature 4”) discloses technology for dividing an image into small sections, and determining whether an image is similar by comparing the similarity of partial images.
By combining the technologies disclosed in Literature 3 and Literature 4, images from which character information included within diagrams, that is, within images, has been eliminated can be generated, and it can be investigated whether images are similar based on the generated images. Also, since this combined technology also enables similarity to be determined using partial images, it is also possible to determine similarity using other sections of an image from which character information has been eliminated, even when there are parts that could not be restored.
In view of the above points, diagram information is conceivably a very useful judgment material, in order to perform a similar document retrieval without being influenced by factors such as the description language or the wording of complex sentences.
Additionally, JP 2008-252877A (hereinafter, “Literature 5”) discloses a technique for determining whether an original document imported as an image is similar to a registered image that is registered in advance, as a technique for evaluating the similarity of images. Specifically, with the technique disclosed in Literature 5, an original document image that includes characters and photographs is divided into character regions and image regions, and extraction of features and computation of feature amounts based on the features is performed by region. The similarity between the original document image and the registered image is then determined using the computed feature amounts. According to the technique disclosed in Literature 5, it is thus possible to locate parts in which there are diagrams (images) from within an original document, and to evaluate the similarity thereof.
However, with the above-mentioned techniques respectively disclosed in Literature 3, Literature 4 and Literature 5, since only one image or one sheet of an original document is targeted for evaluation, and evaluation of similarity for an entire document is not taken into consideration, it is difficult to perform similar document retrieval that takes the contents of an entire document into consideration.
Also, JP 2010-250359A (hereinafter, “Literature 6”) discloses a technique for searching for a document that includes a target image, using a document that includes images as an input. Specifically, with the technique disclosed in Literature 6, first, feature amounts of image data such as diagrams included in a document and terms extracted from the captions of images are pasted into the document as a search index, and a pseudo document is thereby created. Thereafter, the target image or a document including the target image is searched for based on the pseudo document. Also, with the technique disclosed in Literature 6, since the person doing the search is able to selectively change the weight for determining the similarity with respect to images and terms, it is also conceivably possible to target only a plurality of pieces of diagram information that are dotted throughout a document, and search for a target document that includes those pieces of diagram information.
However, with the technique disclosed in Literature 6, in the case where similarity is evaluated using only diagram information, rather than evaluating similarity with consideration for the contents of a document, there is concern that it will simply be determined how many images are the same. Thus, even with the technique disclosed in Literature 6, it is possible that similarity will not be appropriately evaluated using only diagram information, since similar document retrieval that looks in-depth at the contents of the document desired by the person doing the search, such as the flow of the contents of the document, is not performed.
Heretofore, in the field of similar document retrieval, various retrieval techniques have thus been proposed as techniques for finding documents with similar contents to a document input by the person doing the search. With conventional retrieval techniques that have been proposed, the character information within documents is focused on, and a search is performed by evaluating the similarity of the contents of documents based on the character information.
In other words, with conventional retrieval techniques, it has been pointed out that since similarity is evaluated based on character information, there is a problem in that the similarity may not be correctly computed depending on the wording of complex sentences, resulting in it being difficult to evaluate the similarity of documents written in different languages given the differences in grammar and the nuance of words.