Rapid progress has been made in terms of achieving paperless systems in offices. Paper documents created on a personal computer and old paper documents that have been accumulated in a binder or the like may now be stored in a database by converting these documents to electronic documents such as image data by means of a scanner.
Materials distributed at a meeting even now are preferred to be paper documents, and there are also many opportunities for electronic files that have been stored in a database to be printed out as paper documents which are then delivered to users.
Consider a case where a user who has received a paper document wishes to archive or transmit the document electronically or wishes to extract content from the document and reutilize it. If instead of obtaining an electronic file by putting the paper document back into electronic form it were possible to acquire the original electronic file from a database and utilize this electronic file, then convenience would be enhanced to the extent that loss of information through intermediate use of paper documents is eliminated.
However, devising a query and making a key entry using a personal computer in order to accomplish the above involves considerable labor on the part of the ordinary user.
A system that has been proposed in order to solve this problem reads a paper document by a scanner and retrieves data that is similar in content from a database, as described in the specification of Japanese Patent No. 3017851.
When documents are utilized generally in an office or the like, the content of a document page can be broadly divided into text information and non-text information such as photographs and diagrams. For this reason, the applicant believes that retrieval processing of greater precision can be achieved by executing similarity-degree calculation processing that conforms to the characteristic of the particular information at the time of the search.
For example, the applicant has considered implementing highly precise retrieval processing by utilizing area identification processing of the kind described in the specification of U.S. Pat. No. 5,680,478 to extract a text area and a photograph area from a page image of a scanned document and a page image of a registered document, obtain degree of similarity that is the result of using the feature of a character string that has undergone character recognition processing in regard to the text areas, and obtain degree of similarity that is the result of using image-like features such as color and edges in regard to photograph areas (that is, by finding degree of similarity using different retrieval means depending upon whether an area is a text area or a photograph area).
In particular, a photograph or picture contained in a document page often represents the feature of the page. It can be anticipated that obtaining the degree of similarity of a photograph or picture in highly precise fashion will contribute to a major improvement in the performance of the retrieval system.
In such a retrieval system, however, the documents handled are multifarious and extend from documents having many text attributes to documents having many photograph and line-art attributes, and layout differs greatly from document to document. If the object of a search contains a mixture of documents of widely different layouts, a problem which arises is that using a retrieval method that evaluates uniformly the results of retrieval provided by a plurality of different retrieval means may lower retrieval precision, depending upon the document.