The present invention relates to a character recognition device which divides into subregions the area of a single character in scanned character image data to obtain a character code based on a quantification of the features of the subregions.
A conventional character recognition device is described below. Specifically, each single character in the character image data scanned from a source text is divided into a series of contiguous rectangular subregions. The features of the image data in each of these subregions are then extracted, and the extracted feature data is used to determine the character code corresponding to the image data for that single character and thereby recognizes the scanned character.
One of the features of the image data evaluated in each subregion is average density, and a method which uses the average density as one feature of the subregion is the "mesh method". The mesh method determines the character code for the scanned character image data by generating a mesh pattern in which the feature is assigned a value of 1 when the average density of the subregion exceeds a predetermined threshold value, and is assigned a value of 0 when the threshold is not exceeded. The mesh pattern is then compared with standard character patterns similarly generated from the standard character image data for each of the possible candidate characters to count the number of subregions for which these assigned values differ. The character is thus recognized to be that character for which the number of differing subregions in the standard and scanned character patterns is smallest.
As thus described, character recognition devices employing a mesh method as above directly extract the features of each subregion from the image data in that subregion (i.e., the features are for the image data itself). As a result, when the characteristics of specific hiragana (one of the two Japanese "kana" syllabaries) are extracted, the features of a specific hiragana extracted from a sentence written only with hiragana, and the features of said same hiragana extracted from a sentence containing both hiragana and JIS level-1 kanji characters are the same.
However, the features of differences in character shape in a character group comprising only hiragana (of which the total is are 46 total) are different from the features of differences in character shape in a character group comprising both JIS level-1 kanji and hiragana (of which the total is approximately 3000). As a result, during recognition of a specific hiragana, the features recognized when that hiragana is part of a string consisting of only hiragana, and the features recognized when that hiragana is part of a string consisting of both hiragana and kanji may reasonably be expected to be different.
Because a conventional character recognition device as described above directly extracts the features of each subregion from the image data, it is possible to express the features of the image data in that subregion, but it is not possible to express the features of the differences in character shapes in the character recognition group. As a result, there is a difference in the ability to recognize a given character when said character is contained in a hiragana-only string and when the same character is contained in a mixed string of hiragana and JIS level-1 kanji.
In addition, because a single character is divided into a series of uniform contiguous rectangles when the area of a single character is divided into subregions, the character recognition performance of the device is also reduced during recognition of handwritten text because the positions of the lines composing the character will vary by each writer, causing lines composing the same character to occupy different subregions in the single character area of the standard character and the single character area of the character to be recognized.
Moreover, because the area of each rectangle is equal when the single character area is divided into subregions as described above, each subregion is not a shape which can contain elements in which the differences in character shapes in the character string being recognized are well expressed. Therefore, the features of these subregions cannot sufficiently express the character shape differences in the recognition character string, and when it is attempted to recognize characters based on the features of the subregions, it is necessary to obtain the features for all subregions comprising the single character area, thus resulting in low efficiency in the character recognition process.