Character recognition technology has been proposed for digitizing a document original on which characters created by a word processor and the like are printed to enable an information processing apparatus such as a computer to handle the document original. In the character recognition technology, a document original is read by an image scanner and the like so that characters are recognized, and the characters are converted to character codes including an alphanumeric character, a HIRAGANA character, and a Chinese character to be saved.
Then processing is generally performed in which analysis processing of a kind of language processing is performed on a character string after recognition to correct a recognition error of a character. As this correction processing, a general approach is such that, a matched candidate on a word dictionary after matching by head matching with the word dictionary is performed basically or a candidate rated as appropriate as a result of language analysis such as morphological analysis is assumed to be a correct candidate to modify the character string as the character recognition result.
However, in the case of a document original such as a document original of a business form in Japanese in which characters are arranged at a certain interval within a predetermined box, that is, in the case of a document original in which characters are equally spaced, a space between characters which is not actually a word separator is regarded as a word separator, thus matching with the word dictionary is unsuccessful so that it is impossible to sufficiently obtain an effect of correction processing.
Japanese Laid-Open Patent Publication No. 8-263587 discloses the following technology as one to solve this problem. That is, technology is disclosed in which a space between an image of a character cut out from a character string image representing a line of a character string and an image of a character which is adjacent thereto is detected, and when the detected space is larger than a predetermined size, the image of a character and the image of a character which is adjacent thereto described above are identified as characters belonging to different words, respectively, and for a character string image within a predetermined area in an image of a scanned document, the above-described identification result is made to be invalid.
In addition to the above-described case of the document original of the Japanese business form, the same problem as the above-described one lies even in a document original including characters of a fixed-pitch font such as MS Gothic which is originally appropriate for displaying and printing of Japanese characters and the like. Specifically, the problem is, with a fixed-pitch font, a space in front of or behind a character whose character width is relatively narrow (“i”) and the like, although the space is not a space character as a word separator, the space is recognized as the space character as the word separator, making it impossible to sufficiently obtain an effect of correction processing.
There is no disclosure or suggestion as to this problem in Japanese Laid-Open Patent Publication No. 8-263587.