1. Field of the Invention
The present invention relates to technologies which improve precision of recognition when recognizing characters in image data achieved by optically reading a document.
2. Description of Related Art
OCR (Optical Character Recognition), which is a technology for recognizing characters in image data achieved by optically reading a document, is in common use. A variety of technologies have been proposed in the OCR field in order to improve precision in recognition of characters.
It is known to provide a technique for improving precision in recognition by updating a recognition dictionary based on correcting operations by a user. With this technique, characters that could not be recognized or were incorrectly recognized are corrected by correcting operations by a user, whereby a feature vector of a character shape registered in a feature vector database for corrected characters is updated to reflect a feature vector of the character shape when the corrected character is recognized.
It is known to provide a technique for improving precision in recognition by updating a recognition dictionary after performing grammatical analysis. With this technique, a grammatical analysis is performed on recognition results, characters are identified that need to be corrected to grammatically correct characters, and a recognition dictionary is updated to enable recognition of grammatically correct characters without grammatical analysis.
It is known to provide a technique for improving precision in recognition by correcting recognition results performed through a grammatical analysis, using an appearance frequency of words for the correction. With this technique, if, during the grammatical analysis of the recognition results, plural words in a character string in the recognition results are possible candidates, then one word is chosen based on the frequency of appearance of the various words in the recognition results.
In OCR of printed documents, it may be possible to increase the recognition precision by using for the character recognition a feature vector database that is adapted to the fonts used for the printing. For example, the number of fonts that are used in a limited environment, such as a company or a department, are limited, so that it is possible to prepare a feature vector database that is sufficiently adapted to the fonts used for printing. The recognition precision of documents within that limited environment will then improve if such a feature vector database is used.
Moreover, in OCR of handwritten documents, it may be possible to increase the recognition precision by using for the character recognition a feature vector database that is adapted to the authors of those documents. For example, the number of persons that prepare handwritten documents within the above-mentioned limited environment is limited, so that it is possible to prepare a feature vector database that is sufficiently adapted to the authors of those documents. The recognition precision of documents within that limited environment will then improve if such a feature vector database is used.
Moreover, as above-described technology, if a grammatical analysis is added, then it may be possible to improve the recognition precision by performing a grammatical analysis that is adapted to the above-noted limited environment. For example, if uncommon words that are used within this limited environment are registered in a dictionary for grammatical analysis, then it is possible to reduce the number of unknown words (not registered words), which are a reason for a lowered precision in grammatical analysis, thereby increasing the recognition precision. For example, it is also conceivable to increase the recognition precision by registering the usage frequency of the various words used in the above-noted limited environment in the dictionary for grammatical analysis, and to perform the grammatical analysis based on these usage frequencies.
Thus, it is possible to increase the recognition precision by performing a recognition process that is adapted to the characteristics of the documents subjected to OCR. However, in any of these cases, it is necessary to register, in advance, information that is adapted to the characteristics of the documents subjected to OCR in a dictionary used for the recognition. Furthermore, in order to attain a dictionary that is sufficiently adapted to the limited environment, a large amount of information that is adapted to the characteristics of the documents subjected to OCR within the limited environment has to be collected in advance. So far, no technique for collecting this information efficiently has been proposed.
The present invention has been made in view of the above circumstances and provides a technique for the efficient collection of data that contributes to an improvement of estimation accuracy when estimating characters in image data obtained by optically reading a document within a limited environment, without requiring any additional effort.