1. Field of the Invention
The present invention relates to an apparatus and a method for extracting a character string for recognizing characters on a paper sheet on which characters and non-character patterns are mixedly present.
2. Description of the Related Art
Paper sheets on which characters and graphic patterns other than characters are mixedly present includes printed matters such as slits, drawings, maps, documents, books and magazines, as well as handwritten memoranda.
In the field of techniques for automatically recognizing what appears on the paper sheets, a technique for separating graphic pattern regions and character regions is indispensable.
In the conventional character-region extracting method, attention has been paid to the fact that the size of a character is generally smaller than that of a graphic pattern other than the character. In this method, a character region is extracted by measuring the size of a graphically merged region and comparing this size with a known character size. By this method, however, a character region cannot exactly be extracted if a character is in contact with a graphic pattern other than the character.
A method for solving this problem has been proposed. In this method, in order to separate a graphical pattern other than a character from the character contacting with the graphical pattern, a geometrical shape such as a circle or a straight line is assumed on the contacting background pattern on a paper sheet, the assumed shape is extracted from patterns on the paper sheet, and the non-extracted remaining pattern is recognized as a character. In this case, it is likely that a non-character pattern is included in the pattern extracted as a character candidate. In this case, the non-character pattern is simply omitted from the character candidates on the basis of geometric characteristics such as the size of the non-character pattern, the positional relationship between the non-character pattern and a character candidate pattern. Consequently, when the geometrical characteristics of the character are similar to those of the background pattern, the character string cannot be extracted exactly.
There are many graphical patterns, such as Chinese characters each comprising a left-hand radical and a right-hand radical, blurred characters, etc., each of which is formed by merging adjacent graphical patterns. A character string is formed by merging characters. When graphic patterns are merged, it is necessary to determine the range of graphic patterns to be merged (hereinafter referred to as "pattern merging range"). Since this range varies, depending on the size of each character or the interval of characters (hereinafter referred to as "character interval"), this area must be properly determined for each area on the paper sheet if characters and character strings of different sizes are mixedly present on the paper sheet. In the conventional character region extracting method, the pattern merging range must be determined in advance. If the location at which a character is to be written is unknown, the pattern merging range of the same value is applied to the entire area of the paper sheet. If the size of each character or the interval of characters differs from paper sheet to paper sheet, the value of the pattern merging range must be varied for each paper sheet. In addition, if characters of different sizes and character strings of different character intervals are mixedly present, the above conventional method is not applicable.
In most of the prior art, it is required to assume the direction of the character string to be horizontal or vertical, or to assume a mark of the direction of the character string near the character string, for example, a long line element written in parallel to the character line. Consequently, it is difficult to exactly extract the character string from the face of an ordinary paper sheet on which character strings are arranged at given locations in given directions.
In the case where the sizes of characters, the character intervals and the directions of character strings are unknown, it is still more difficult to form a character string in an area where graphic patterns are concentrated. In particular, when many non-character patterns are included in character candidate patterns, it is necessary to determine which pattern should be treated as character and incorporated in the character string, to determine the size of the character, and to extract the character string while assuming the direction of the character string. Thus, it is difficult and time-consuming to exactly extract the character string.
In the conventional character string region extracting apparatus and method, since the geometric shape of the background pattern needs to be assumed in order to extract the character region contacting with the background pattern, exact character region extraction cannot be performed when it is difficult to assume the shape of the background pattern. In particular, if a non-character pattern is included in graphic patterns extracted as character candidates, the criterion for determining whether or not the included non-character pattern is a real character pattern is limited to only the geometric features. Thus, it is difficult to precisely extract the character string, owing to a background graphic pattern similar in size to a character in a surrounding area.
In the conventional apparatus and method for extracting a character string, when a single character candidate, for example, a Chinese character, is prepared by coupling some graphic patterns, or when some character candidates are coupled to prepare a character string candidate, it is necessary to determine the pattern merging range. In the prior art, the value of the pattern merging range is preset and the same value is applied to the entire face of the paper sheet. Consequently, when the size of character or the character interval varies from sheet to sheet, the value of the pattern merging range must be changed each time the size of the sheet varies. Furthermore, if characters of different sizes and character strings of different character intervals are mixedly present on the same paper sheet, the above conventional method is not applicable.
In most of the prior art, it is presupposed that character strings can be read horizontally or vertically. Consequently, it is difficult to exactly extract the character string from the face of an ordinary paper sheet on which character strings are arranged at given locations in given directions.
In the case where the sizes of characters, the character intervals and the directions of character lines are unknown, it is still more difficult to form a character string in an area where graphic patterns are concentrated, in particular, in an area where many non-character patterns are included in character candidate patterns.