1. Field of the Invention
The present invention relates to an apparatus for searching document images using a result obtained by character-recognizing the document images.
2. Description of the Related Art
Lately, a document management system for digitizing a paper document using a scanner and the like and sharing the document has been used to contribute to the improvement of business efficiency from the viewpoint of both information sharing and rapid access in organizations, such as enterprises and the like. In a personal environment too, a personal computer has been popular and needs for a document management system has increased since an electronic document must be linked with a conventional paper document.
In a document management system, a paper document is read using a scanner, and the document image is stored. However, a keyword must be attached in order to search the image later. There are a variety of methods for attaching a keyword. As one of them, there is a full-text search.
In this method, instead of attaching a special word that summarizes and represents a document, as a keyword, the full text of the document is used for search. In other words, the full text is searched using a keyword character inputted for search. In this case, there is no need to select a word for summarizing the document nor there is any fear as to whether the summary word really represents the document. Therefore, although it has a disadvantage that it takes a long process time, the method is widely used.
This full-text search technology is largely classified into two categories. One is a so-called grep search method for collating an inputted keyword with a text to be searched word by word. The other is an index search method for preparing an index for search based on the text to be searched in advance and collating a keyword with this index when the keyword is inputted. Generally, grep search is practical in the case of a small amount of document, while in the case of a large amount of document, index search is recommended since the grep search takes too much of a search time.
If in a general document management system, a paper document is searched, at the time of document registration, a character area is automatically extracted from a document image, a text is generated by character-recognizing the area, and both the document image and recognized text are paired and managed as one document. When documents are searched, the stored text is collated with an inputted keyword and a corresponding document is extracted.
In this case, in the system of a large organization, full-text search is used to improve search accuracy, and index search is used to improve search speed, instead of grep search. However, since the accuracy of a character recognition process is not 100%, there is always a recognition error. For this reason, the search accuracy of a text after recognition sometimes degrades. As a conventional art for preventing the degradation of the search accuracy due to such a recognition error, the following technologies are used.
(1) A Technology for Generating a Correct Text by Automatically Correcting Wrongly Recognized Characters to Improve Search Accuracy
    (a) Japanese Patent Application Laid-open No. 7-182465, “Character Recognition Method” (1995).
In this method, when characters are recognized, a confidence degree is calculated, a word dictionary is consulted using one candidate character with a specific confidence degree as an index, and candidate words are extracted. Then, a character string with the highest probability is generated based on the location information as well as the collation cost of a word, and the first candidate is replaced with this character string.    (b) Japanese Patent Application Laid-open No. 10-207988, “Character Recognition Method and Character Recognition Apparatus” (1998).
In this method, a candidate character with a specific confidence degree is generated by character recognition, and if the first candidate includes a low confidence degree character, a plurality of three-character character strings are generated using one candidate character, the confidence degree of which is equal to or more than a specific threshold value, out of the three candidate characters, which consist of the low confidence degree character, one character before the character and one character after the character. Then, an already stored correct text document is searched using these character strings, a character string that most frequently appears is designated to be a correct character string, and the recognition result is automatically corrected. In this case, it is assumed that there is already a large amount of correct text.
(2) A Technology for Expanding a Keyword into a Plurality of Keywords to Improve Search Accuracy at the Time of Search
    (c) Japanese Patent Application Laid-open No. 4-92971, “Image Recording Apparatus” (1992).
In this method, a keyword character string inputted at the time of search is expanded into a plurality of character strings, and results obtained by searching a document using all the obtained search character strings are integrated and outputted. When a keyword character string is expanded, a search character string is generated by specifying a wrongly recognized character that is easily mistaken for a specific character, using a recognition error table and replacing the wrongly recognized character with a correct character.    (d) Japanese Patent Application Laid-open No. 4-328682, “Image Recording Apparatus” (1992).
In this method, when a keyword is collated with a search target, up to N characters of collation error are neglected and regarded to be correctly collated.
However, the conventional search methods described above have the following problems.
In the method (1) using automatic correction, since the accuracy of the automatic correction is not 100%, all wrongly recognized characters are not always corrected. This is a method of generating a uniquely determined character string as a correction result and replacing the original text with the character string, and only one candidate character string is used for that purpose. Therefore, if the generated character string is wrong, search is also impossible. Thus, the accuracy of extracting a correct text for search from recognition result information is not sufficient.
In the method (2) using keyword expansion, since the candidate character information of a recognition result is not used, a lot of character strings for search are generated. In other words, since recognition result information is not sufficiently used, a lot of inappropriate search keywords are generated even for an ambiguous search method. Therefore, a search time becomes enormous and search accuracy also degrades. Furthermore, in the method (d), since the method cannot be implemented in ordinary text search, a special search method is needed.