1. Field of the Invention
The present invention relates to a document processing apparatus which reads and stores a document as an image and in particular relates to a document processing apparatus having a retrieval function for retrieving a content in a text from a document image.
2. Discussion of the Related Art
Document filing systems capable of converting a document into an image by an image input device such as an image scanner, storing thereof electronically and carrying out retrieval later have been put to practical use. However, many of such systems have required manual assignment of attributes for the retrieval using keywords or the like per every inputted image; therefore much labor has been necessary.
In the document retrieval, originally, it is desirable to carry out full-text retrieval based on the contents of the text. It is possible to execute full-text retrieval for an electronic document prepared by the desktop publishing (DTP) or the like, but it is impossible to carry out the full-text retrieval directly on the document image. Therefore, in Japanese Patent Application Laid-Open No. 62-44878 (1987), for example, it is disclosed that character recognition is performed on the text portion in a document, and the full-text retrieval is made to be possible by coding the text contents. Moreover, candidates for each character obtained in the process of character recognition are retained so that the oversight in retrieval caused by the recognition error is reduced. However, in the character recognition, and in particular in the character recognition of a document written in Japanese which has a large number of character types, feature vectors of several hundreds of dimensions are obtained and tried to match with the features of not less than approximately 3,000 character types; accordingly, the matching process of the feature vectors requires much computation cost. Besides, there is a problem of possibility that a retrieval keyword is incorrectly recognized because the rate of character recognition is not so high. Japanese Patent Application Laid-Open No. 62-285189 (1987) discloses an invention which obtains a character string well-formed as Japanese by utilizing a morphological analysis after character recognition, and automatically corrects the incorrectly recognized characters. In an invention disclosed in Japanese Patent Application Laid-Open No. 5-54197 (1993), Japanese characters are replaced with representative characters to reduce the character types to be dealt with, and then words are identified by utilizing a rate transition matrix for correcting the incorrectly recognized characters. However, these inventions basically require much computation cost in registration of documents for execution of character recognition, and if the ultimately desired object is a document image including the word designated in the retrieval, execution of character recognition would be mostly result in vain.
According to "Keyword Search for Japanese Image Text", Yusa et al., Information Media, 19-1, January 1995, features of each character image are directly converted into the 36-bit codes instead of execution of character recognition on the features obtained from each character image, and features of a retrieval keyword image is also extracted for feature matching, and thereby the character string retrieval is performed using the codes. However, it is necessary to input the retrieval keyword as an image or to generate an image by using character font image corresponding to the keyword, that is, there is a problem of weakness in the difference of the fonts used in the document image.
In "Document Reconstruction: A Thousand Words from One Picture", Reynar J. et al., in Proc. of 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 367-384, Las Vegas, April 1995, it is disclosed an attempt that characters in a text image in a language of European origin (English) are classified into a small number of categories based on their sizes and positions, and identified as words according to the sequence of the categories. U.S. Pat. No. 5,325,444 (1994) or 5,438,630 (1995) discloses a technology which measures frequency of occurrence of a specific word and identifies a word without using an OCR by utilizing an image feature per word unit called "Word Shape". However, it is difficult to intuitively find a feature to be a key for a language having a large number of character types such as Japanese or Chinese. Besides, it is impossible to directly obtain word units from an image because, different from the European origin languages, there is no physical space between the words on the image. For this reason, it is difficult to directly identify the words in a text written in Japanese or the like according to the disclosed method.
Japanese Patent Application Laid-Open No. 4-199467 (1992) discloses an invention which carries out grouping character types apt to be recognized incorrectly with each other and assigns a character code to each group, which is used in retrieving. In this method, character codes are once obtained by executing a character recognition process, and then converted into those indicating the groups. Therefore, oversight in retrieval is prevented by the grouping, but much computation cost and the time for character recognition are still required.
Japanese Patent Application Laid-Open No. 7-152774 (1995) discloses a technique in which, if a character apt to be incorrectly recognized is included in character strings in the retrieval condition expression, plural candidates for the retrieval condition expression are prepared for execution of retrieval. Furthermore, in an invention disclosed in Japanese Patent Application Laid-Open No. 6-103319 (1994), if there are characters cannot be converted normally, they are left indefinite and retrieval is executed for such indefinite data. According to these techniques, oversight in retrieval can be reduced, but these techniques also require much computation cost and time for the character recognition.