The present invention generally relates to a character region extracting method and an apparatus capable of implementing the method which are adapted to character recognition.
As is well known, characters or character strings on a document are extracted as a preprocess of character recognition. To meet this requirement, projection of a document image in a direction of alignment of characters is extracted. Then a region in which the projection is continuous in a direction perpendicular to the character strings is segmentated as a character line. Subsequently, projection of the document image for each of the segmentated character lines is extracted. Then a region in which the projection is continuous in the direction of the character strings is segmentated as a region of a rectangular shape.
However, the conventional character region segmentating or extracting method has a disadvantage that it is impossible to correctly segment character lines on a document such that characters of relatively small sizes constituting a character string are located adjacent, in the direction of the character string, to a character of a relatively large size or a character string composed of relatively large sizes. This is because projection of the character string composed of characters of relatively small sizes is included in projection of the character of the relatively large size or the character string composed of characters of relatively large sizes, so that it is impossible to discriminate the projection of the characters of the relatively small sizes against the projection of the relative large size character or character string. Therefore, recognition of characters becomes impossible.
In general, in the character recognition for a language such as Japanese having a number of characters of similar patterns, it is difficult to find one candidate character out of a plurality of candidate characters with high accuracy by using a character recognition process per character unit. Therefore, an improved character recognition method is being studied in which a candidate of the character string is compared with character lines which are related to categories and are stored in a knowledge dictionary, and one candidate character is identified for each character included in the character string, by referring to the compared result. In general, information for indicating the category represented by characters included in character strings is provided on the document, so that the collating sequence with the knowledge dictionary can be facilitated.
However, the processing speed of the collating sequence is not high at present.