A document-filing device electronically accumulates documents by converting the documents into images by use of an image input device such as an image scanner, allowing a user to search for a document later. The document-filing device has been put into practical use.
In order to search an image document having been read as image data, it is necessary to manually assign index information for searching to each image document. This is very troublesome.
Further, there is proposed a device that specifies the position of a character region (text region), performs OCR (Optical Character Reader) recognition, and thus allows full text search according to contents of the text. An example of a conventional technique that uses OCR recognition is disclosed in Japanese Unexamined Patent Publication No. 1995-152774 (Tokukaihei 7-152774).
However, OCR recognition is problematic in that it requires much amount of calculation, which takes much time. Further, OCR recognition does not attain a high ratio in recognition of characters, which may result in that characters are wrongly recognized and are not searched. Consequently, OCR recognition is problematic in terms of accuracy in search.
On the other hand, Japanese Unexamined Patent Publication No. 1998-74250 (Tokukaihei 10-74250) discloses a technique for allowing automatic full text search without using OCR recognition.
In the technique disclosed in Japanese Unexamined Patent Publication No. 1998-74250, there is prepared a category dictionary in which characters are classified in advance into similar-character categories with respect to every similar-characters according to image features. In registering an image document, each character of a text region (character region) is not recognized as a character, but an image feature of the character is extracted, and the character is classified into a character category according to the image feature, and a category sequence recognized for each character is stored in combination with an input image. In searching the image document, each character of a search keyword is converted into a corresponding category, and an image document partially including a category sequence derived from the conversion is extracted as a result of the search.
It is described that the technique provides a document filing that allows high-speed registration of a document with small calculation power and allows a search with little omissions.
However, Japanese Unexamined Patent Publication No. 1998-74250 has the following problems.
In the technique disclosed in this publication, with respect to each similar-character category, a representative vector that is an average of feature vectors of characters belonging to the similar-character category is determined, and any one of character codes of the characters belonging to the similar-character category is determined as a representative character code.
When an image document is registered, a feature vector of each character image included in a text region of the image document is matched with the representative vector of the similar-character category, and a similar-character category to which each character belongs is identified. The character image included in the text region is replaced with a representative character code of the identified similar-character category, and character images are stored as a representative character code sequences.
However, although the technique for matching a feature vector of a recognized character with a representative vector has less amount of calculation, the technique results in less exact result of matching than a technique for directly matching a feature vector of a recognized character with a feature vector of each character. This may result in omission in searching. Further, such matching and subsequent indexing are generally performed while off-lined, and therefore are not so convenient for a user. More exact matching is preferable for the user.
Further, the technique disclosed in this publication has a problem also in searching. When searching is performed, a search keyword is converted into a sequence of representative character codes for categories that include characters of the search keyword, with reference to a character code/category correspondence table. Then, the sequence of representative character codes having been converted from the keyword is searched for from sequences of representative character codes that are obtained from registered image documents, specifically, by use of index made of the sequences of representative character codes.
However, searching by converting the keyword into the sequence of representative character codes does not allow specifying the position of a character of the keyword in a similar-character category. Consequently, characters belonging to the same similar-character category show the same degree of relevance regardless of whether the characters are more similar or less similar. As a result, it is impossible to present image documents exactly in the order sequentially from the most relevant image document.