The present invention relates to a method of retrieving document information in a system which converts paper documents into electronic documents for storage and management.
JP4158478, U.S. Pat. No. 5,265,242 and U.S. Pat. No. 4,985,863 disclose a character string search method and system.
With the advent of a full-fledged information-oriented society, a document management method based on a document management system that digitizes documents for storage and management has come into wide use, replacing a conventional document management method which files documents in the form of recorded paper and stores and manages them. The early document management method involves generating image data by taking in a paper document by a scanner, registering the image data by associating it with bibliographic information such as “creator,” “date of generation” and “keyword” and, for retrieval of a desired document, using the bibliographic information as a subject of search. However, with the search using only the bibliographic information, it is difficult to find the desired document. Because a full text retrieval technique has already been put to practical use that covers an entire document, a document management system with a function of the full text retrieval has come into wide use also in the field of image document.
In this document management system, a document in the form of recorded paper is taken in by a scanner and stored as image data, which is then converted by character-recognition processing into text data. The text data is then stored in addition to the image data. In retrieving a document, the full text retrieval is performed on the text data. When displaying the result of search, the system displays the text data specified or the corresponding image data. The full text retrieval is based on the premise that the subject text data basically has no errors. Since the text data used for the search is generated from image data by character recognition processing using OCR (optical character recognition device), there is a possibility of the text data containing recognition errors. Hence, the search may fail to hit the text data which, if correctly character-recognized, would normally be found.
To solve the above-described problem of a document escaping the search, text data which may contain recognition errors by OCR has conventionally been proofread manually. That is, during the process of registering a document, the text data output from the OCR is compared with the original document to check for recognition errors which are then corrected manually to eliminate errors in the document so that the registered document can be retrieved normally. With this method, however, the manual proofreading and correction work put a heavy burden on the user taking time and labor for the document registration. As a technique to solve this problem, JP4158478 discloses a method that allows for a certain degree of ambiguity of the subject in performing the search. This conventional technique performs a document registration without making any correction to the text data output from the OCR. That is, an error-containing document as obtained from the OCR is registered and some provisions are made in the process of search to eliminate the need for manual correction work.
In the conventional technique which involves dividing a search character string into individual characters, checking the individual characters against a similarity table to pick up candidate characters, and combining the candidate characters for the search characters to form a plurality of character strings (hereinafter referred to as expanded words), when the search character string specified in the document search is long, the number of expanded words that are likely to be erroneously recognized increases dramatically, prolonging the time taken by the search.
For example, when a search character string is “lock” and it is assumed that the search characters have five candidate characters each, such as (l, I, !, 1, i), (o, O, 0, Q, 6), (c, C, G, e, q) and (k, K, h, b, R), then the number of different expanded words generated by combining all of these candidate characters is 5×5×5×5=54=625.
Similarly, when a search character string is “” and if it is assumed that the search characters have five candidate characters each, such as (, , , , ), (, , , , ), (, , , , ), and (, , , , ), then the number of different expanded words generated by combining all of these candidate characters is 5×5×5×5=54=625.
For a longer search character string made up of eight characters, the number of expanded words is as large as 58=390,625, indicating that as the character string becomes long, the number of expanded words increases sharply. Because the search operation is based on the full text search using a logical sum (OR) set of the expanded words, an increase in the number of expanded words results in an increase in the search time. Thus, as the search character string becomes long, the time taken by the search also increases significantly.