The present invention relates to an optical character reader (OCR) and an optical character reading method.
Conventionally, an optical character reading apparatus is used as a data input means for computers. The conventional apparatus reads an image from a document by exposing the document with the light irradiated from a light source and inputs information constituting the read image, i.e., information identifying the kind of the character (alphanumeric or symbol). For inputting various characters, the optical character reading apparatus divides the read image into character lines to detect a space between character lines thereby allowing the identification of each character on each line.
To make a discrimination between a capital letter and the corresponding small letter of a similar shape (such as between "C" or "c" or "S" and "s") and between similar symbols (such as between "." and ".") and to identify a space between words, the conventional apparatus uses the concept of base lines. FIG. 13 is an illustration for explaining this concept.
Basically, alphabets, numerals and symbols (hereinafter referred to as characters) are written with their bottom portion flush with a hypothetical line l1. Although characters "p" and "y" project below line l1, the major portions of them rest on the line l1. Characters "t" and "h" project upward from another hypothetical line l2, but the major portions of these characters are located under the line l2. In other words, the major portion of each character is written between the hypothetical lines l1 and l2. The hypothetical lines l1 and l2 are referred to below as common lines.
According to the concept of base lines, another hypothetical line l3 is located below the common line l1, and still another hypothetical line l4 which is located above the common line l2. The hypothetical lines l3 and l4 define the lower and upper limit positions, respectively, of each character written.
By detecting the common lines l1 and l2, it is possible to discriminate between similar capital and small letters and identify symbols.
Conventionally, there are two methods of detecting these base lines. The first method relies on a histogram for the horizontal direction (of FIG. 13) of the picture elements which constitute the character lines shown in FIG. 13. An example of the histogram is shown in FIG. 14. The second method relies on the area occupied by the smallest character of a character line as is shown in FIG. 13.
In the first method, since the obtained histogram does not show peaks clearly as shown in FIG. 14, it is difficult to determine base lines. In the second method which is based on the assumption that each character on each line is identified accurately, it is impossible to determine base lines correctly if a character is identified incorrectly.