Recently, a law that allows to electronically save documents is enforced, demands of the Optical Character Recognition (OCR) increase more and more, and the scene of utilizing the OCR becomes diversified, too. Up to now, a dictionary including the finite number of fixed categories (also called character codes), such as characters in Japanese Industrial Standards (JIS) first level, hiragana & Katakana in Japanese, symbols and characters in JIS second level, could cope with the requirements from the typical users. However, when the users are diversified, the categories required for the OCR become different for each user. Therefore, it is necessary to cope with such a situation.
A conventional OCR apparatus has, as a framework to flexibly add the categories for each user, a user dictionary. This is a mechanism that, when the user manually cuts out and registers a character (including patterns such as symbols. Hereinafter, the word “character” includes such patterns.) to be recognized, a feature vector of the character is registered, and in the subsequent recognition processing, the character registered in the user dictionary can also be recognized and the recognition result can be obtained, even if the character is not registered in a system dictionary.
On the other hand, recently, a method using distribution characteristics such as the Modified Quadratic Discriminate Function (MQDF) has come to be utilized in order to improve the accuracy of the character recognition. This method utilizes the distributions of learning samples for the respective character codes to realize more accurate character recognition than the character recognition using the conventional Cityblock distance, which is calculated using only the average vector of the feature vectors of the learning samples.
Under such a situation, for example, Japanese Laid-open Patent Publication No. 08-16725 discloses a technique to simplify jobs to register new character information into a character recognition dictionary including, for each character, a feature vector and a variance-covariance matrix and to construct the dictionary with high accuracy. Specifically, an image of an unknown character, which is not included in the recognition dictionary, is read out, and feature vector data is extracted from the character image. Next, a character having a feature vector closest to this extracted feature vector is retrieved from the recognition dictionary. When registering the character obtained in this way, the character code and feature vector for this character are stored into the dictionary, and the variance-covariance matrix for the previously extracted character is stored as the variance-covariance matrix for this character. According to this technique, there is no user dictionary. Therefore, when registering the unknown character into the recognition dictionary, there is a problem, for example, the size of the recognition dictionary becomes large.
Here, when considering the user dictionary again, the distribution information is not registered into the user dictionary. Therefore, a method such as MQDF, in which the distribution information is used, cannot be used for the characters registered in the user dictionary. Specifically, two types of distance values, namely cityblock and MQDF, to determine the order of cutting out the character or the candidate characters are mixed. Therefore, by simultaneously using the user dictionary and the technique such as MQDF, the recognition accuracy is lowered rather than a case where they are not used.
Namely, in the conventional techniques, there is no character recognition processing technique to improve the recognition accuracy without enlarging the size of the dictionary too much.