1. Field of the Invention:
The present invention relates to an optical character recognition system, and also to a method of setting reference lines which are useful in character recognition by an apparatus such as an OCR (Optical Character Reader) for recognizing printed alphanumeric characters.
2. Description of the Prior Art:
When alphanumeric characters printed on a paper sheet are to be recognized using an OCR, image data of the characters are first input to the OCR, and a string of character image data is isolated from the input data. Hereinafter, such a string of character image data is referred to as "a character row". Then, reference lines (generally, two reference lines) are formed in the character row.
Reference lines are virtual or assumed lines which are set in the direction of the character row so as to respectively elongate along the upper and lower extracting ordinates (or upper and lower extracting lines) of characters having neither upward projecting portions nor downward projecting portions, i.e., characters such as "a", "c", "e", "m", "n", "o", "r", "s", "u", "v" and "w". These characters are hereinafter referred to as "reference line characters".
Such reference lines are used for character recognition in order to differentiate similar characters (e.g., capital and small letters such as "S" and "s", "C and "c", etc.) or marks in the same shape but in different positions (e.g., "'" and ",", "." and ".", etc.). These similar characters or same-shaped marks can be recognized by detecting their positions relative to the reference lines.
In a conventional system of setting reference lines from a character row, coordinates of pixels which constitute the character images are first detected. Then, a histogram is prepared to obtain the frequency distribution, i.e., the number of the pixels existing along each horizontal direction. From the resulting histogram, two points at which the frequency distribution exhibits the greatest change along the vertical axis are detected. Two horizontal lines which respectively intersect these two points are determined as reference lines. In other conventional systems of setting reference lines, a histogram prepared from horizontal line segments alone (Japanese Laid-open Patent Publication No. 64-29986) is used; a weighted histogram prepared along the horizontal axis is used; or the results of character recognition are utilized (Japanese Laid-open Patent Publication No. 63-216189).
The conventional system utilizing a histogram prepared from the number of pixels in each horizontal direction has a drawback that, when a paper sheet is not appropriately placed in the OCR, character strings printed on the sheet are inclined with respect to a reading unit of the OCR, so that the OCR cannot detect areas between adjacent character strings. More specifically, no significant change appears along the vertical axis of the resulting histogram. Thus, reference lines cannot be accurately set.
In the conventional system utilizing the results of character recognition, the accuracy in setting reference lines depends on the accuracy in the character recognition. Thus, when characters cannot be accurately recognized, reference lines cannot be set with accuracy.