1. Field of the Invention
The present invention relates to a computer-based document retrieval system and document management system, and more particularly to a document registration method, retrieval method, document registration/retrieval equipment, and storage media storing a document registration/retrieval program, which are used to search a set of image documents, electronic versions of paper documents, for the purpose of retrieving a document whose contents are similar to those of a user-specified document with high accuracy.
2. Description of Related Art
Large quantities of documents exist in an office. In recent years, it is important for work efficiency enhancement that documents be shared within an office and managed so as to offer user-specified documents promptly and accurately. As a document sharing method for electronic data that is created by word-processing software or the like, a document management system has already been commercialized to offer a high-speed, efficient scheme for retrieving specified documents. As a paper document sharing system, an image document management system is available for reading paper documents with a scanner or like device and managing them as image data.
It is demanded that the image document management system offer means for registering image data with ease and recalling stored image data for reuse. For the reuse of stored image data, it is essential that the image document management system provide means for retrieving image data and other electronic data containing user-specified information at high speed and with high efficiency.
As a method for retrieving electronic data containing user-specified information at high speed and with high efficiency, a similar-documents retrieval technology has been commercialized. It exemplifies a document (hereinafter referred to as a seed document) that contains user-specified contents, and retrieves a document similar to the seed document.
A typical similar-documents retrieval method capable of handling image data is disclosed by JP-A No. 115330/1996 (hereinafter referred to as Prior Art 1). In a document registration process, Prior Art 1 reads a paper document as image data, converts the image data to text data by exercising a character recognition function to extract character information from the image data, and registers the text data together with the image data. To perform a document retrieval process, this technology reads a paper document as image data, converts the image data to text data by exercising a character recognition function to extract character information from the image data, and automatically searches the text data to extract a character string that characterizes the paper document (hereinafter referred to as a characteristic character string).
It is known that a character recognition error can occur when the character recognition technology is exercised to extract character information. However, Prior Art 1 presumes that the same scanner and OCR (Optical Character Recognition) device are used for the document registration process and document retrieval process. Based on such a presumption, Prior Art 1 can assure consistent character recognition accuracy for generated text data. More specifically, the text data entered as retrieval condition data and text data targeted for retrieval have the same tendencies in terms of erroneously recognized characters; therefore, Prior Art 1 cannot possibly incur a mismatch of characteristic character strings.
However, the above presumption makes it necessary to use exactly the same machine for registration and retrieval. It means the lack of convenience because a person who intends to retrieval a document must take the trouble to move to a registration machine. Even if the use of the same scanner and OCR device is adhered to, these character recognition devices do not always generate the same results when they encounter the same characters. The character recognition results may vary with the inclination of the read paper document and the size, vividness, inclination, font, and other factors of characters existing in the read document. Therefore, any characters can be correctly recognized in a certain situation and erroneously recognized in another situation.
When, for example, the character “E” exists within image data, the character recognition result normally produced by an OCR device is the character “E”. However, if the character is inclined, blurred, or otherwise degraded in quality due, for instance, to paper document contamination, it may be often erroneously recognized as the character “F”, “B”, “Σ”, “L”, or “Γ” even during the use of the same OCR device. Therefore, if a certain character is erroneously recognized in either one of a seed document and document targeted for retrieval and correctly recognized in the other, the characteristic character strings may fail to match, causing inadequate retrieval.
Further, the presumption made by Prior Art 1 does not hold true when the user makes a seed document entry by keying in natural text, if the scanner used for seed document setup differs from the scanner used for documents targeted for retrieval or if the OCR device used for seed document setup differs from the OCR device used for documents targeted for retrieval. As a result, inadequate retrieval may occur because the characteristic character strings existing in a seed document conflict with the characteristic character strings in documents targeted for retrieval.
Suppose that an existing paper document containing the character string “(Japanese soccer representatives compete with Brazil)” is character-recognized as “” by an OCR device. Also, suppose that characteristic character strings such as “”, “”, “”, “”, and “” are extracted from the above character recognition result. In this situation, documents targeted for retrieval in which “(soccer)” is erroneously recognized as “” can be retrieved, wherein “” is produced because of the first character “” erroneously recognized as “”; however, documents in which “” is correctly recognized as “” or erroneously recognized as “” will not be retrieved so that inadequate retrieval results.
In the case of “”, the character “” is recognized as “” when the OCR device failed to recognize the shorter vertical line of “”. In addition, “” is produced instead of “” because of the second character “” erroneously recognized as “”. In this case, the character “” was regarded as “” because both have a curved line on the right side and one or two short bars on the left side although they are different in character size.
Further, since the character “” is erroneously recognized as “”, for a reason that both characters have two horizontal lines and a vertical line laid on the upper horizontal line, the document retrieval result includes a document that contains the character string “ (Nintoku Emperor's tomb, a representative Japanese burial mound)” and is unnecessary for a document-retrieving user. If the user enters the character string “” to specify the seed document for retrieval, documents in which “” is erroneously recognized as “” will not be retrieved.
Concisely, there is a character-recognition-induced gap, in reality, between characteristic character strings specified as retrieval conditions or extracted from a seed document and characteristic character strings existing in documents targeted for retrieval. Since Prior Art 1 does not perform a process for making up the gap, it incurs a mismatch of characteristic character strings, thereby reducing the retrieval accuracy.
A typical retrieval method for bridging a character-recognition-induced gap between characteristic character strings specified as retrieval conditions and characteristic character strings existing in documents targeted for retrieval is disclosed by JP-A No. 158478/1992 (hereinafter referred to as Prior Art 2). This technology learns about the tendency in the occurrence of a recognition error in advance and uses the result of such learning for retrieval to tolerate erroneously recognized characters in the documents targeted for retrieval, thereby conducting a full-text search with high accuracy and without requiring human proofreading. The term “full-text search” refers to a technology for retrieving documents that contain user-entered character strings for retrieval.
In prior art 2, the text data produced by OCR device is registered as a document without correcting it. That is to say, prior art 2 avoids inadequate retrieval due to erroneously recognized characters contained in the retrieval target by improving retrieval processing, without requiring human correction operations before the document registration.
For certain characters, Prior Art 2 causes a similar-characters table to store recognition candidate characters that are likely to be used as a result of erroneous character recognition. In a retrieval process, this technology divides a character string for retrieval into individual characters while referencing the similar-characters table, checks the resulting individual characters with reference to the similar-characters table, and develops a plurality of character strings (hereinafter referred to as developed words) by combining the recognition candidate characters for all the referenced characters. To retrieve documents containing one or more of the developed words, this technology conducts a full-text search for a set of logical adds (ORs) (hereinafter referred to as an extended characteristic character string), thereby tolerating erroneously recognized characters in the documents targeted for retrieval.
When the full-text search method provided by Prior Art 2 above is applied to the retrieval of similar documents, retrieval can be achieved while tolerating erroneously recognized characters existing in the documents targeted for retrieval. However, Prior Art 2 cannot solve problems that are caused by erroneously recognized characters existing in a seed document. For example, if the above-mentioned character string “(soccer)” is erroneously recognized as “” or “” in a document targeted for retrieval, retrieval can be accomplished by the use of Prior Art 2.