The present invention relates to a character recognition post-processing method, and in particular, relates to a method for character recognition post-processing in order to obtain a correct recognition result by correcting recognition results for each character in a sentence (such as an English language sentence and the like) comprising spaces and character strings separated by spaces.
Conventional technology relating to character recognition and having the English language as its object is such as "A character separation method stressing the size and position of characters in English language character recognition processing" (disclosed in Suzuki et. al, 1988 Electronic Information and Communications Society Spring National Conference, Preliminary Edition, Vol.1, page 191). In the technology disclosed in this literature, it is taken as a premise that the height of each of the characters in a line is the same. Thus, a histogram using the lower end and the upper end of a rectangle in contact with the outer edges of the characters in the line is used as the basis to extract two reference lines. The relationship of the positions between the characters and these reference lines is used to classify the characters in the line. This technology is termed Conventional Technology A.
Apart from this technology, there is also disclosed in Japanese Patent Laid Open Publication No.162087-1982 an optical character reading apparatus. In this optical character reading apparatus, the position relationship with the y-coordinate of the character images before and after and the appropriateness of the relationship with the recognition results are investigated and those results are used as the basis for increasing the reliability of the recognition results. This technology is termed Conventional Technology B. In addition, in Japanese Patent Laid Open Publication No.162086-1982 is described a method for improving the reliability of the recognition results by using the difference between the center of a character image and a bottom point and which is held in a dictionary as a bottom point compensation value, and using this bottom point compensation value to perform compensation. This technology is termed Conventional Technology C. Furthermore, in Japanese Patent Laid Open Publication No.39175-1986 is disclosed a method for correcting the results for determining the character type in the case of similar characters, according to the type of character before and after that character. This technology is termed Conventional Technology D. Still furthermore, in Japanese Patent Laid Open Publication No.30991-1988 there is disclosed a character recognition apparatus. In this disclosed character recognition apparatus, the character type and order of a first candidate character is used as the basis for determining a character for which correction is necessary and for determining a character type after correction. The same character type as that judged to be the character type after correction are selected from the characters in the recognition candidate character group with respect to the characters for which correction is necessary. The recognition candidate character for the character type and which is selected is made the first candidate character. This technology is termed Conventional Technology E.
The following problems exist with each of conventional technologies that have been described above.
When the sentences which are the object of recognition are skewed in the case of Conventional Technology A, it is difficult to obtain the peaks for the heights obtained from the histogram. Accordingly, it becomes necessary to perform skew compensation. In addition, the processing is performed in line units and so processing cannot be performed for sentences in which the font size changes within the line. Furthermore, the coordinates of the top end and the bottom end of the rectangle in contact with the periphery of the character are used and so the influence of noise is great.
Conventional Technology B cannot specify whether the recognition of a character was correct or incorrect in cases where there is a contradiction in the relationship that a character to be processed has with the characters before and after it.
Accordingly, there is an increase in the number of reject characters.
Conventional Technology C has difficult in handling sentences where there are multiple fonts because there is a difference in the bottom point compensation values depending upon the font used.
With Conventional Technology D, it is taken as a premise that the type of character to be processed and the type of the characters before and after it are the same but there is no guarantee that this relationship realizes generally. In addition, if Conventional Technology D is applied to alphanumeric character OCR as it is, then the number of characters recognized as similar increases so that it is difficult to correct them.
With Conventional Technology E, the selection of first candidate character types is used as the basis for determining the type of character string. If pattern matching so as to increase the degree of reliability of the first candidate character type is not performed, then it is not possible to correctly judge the character type. Accordingly, it is difficult to apply Conventional Technology E to multiple fonts.